There are two separate rants here that aren't delineated well. 1) C++ is too com...

gpderetta · on May 18, 2016

Can't really argue about 1.

About 2, in-order single instruction execution hasn't been an assumption for a very long time; c and c++ optimizers (and programmers) have been able to take advantage of these CPU features for a while. There are language extensions (Cilk++, OpenMP), to take advantage of extra cores for fine grained parallelism.

Regarding GPUs, arguably C and C++ have the most mature and transparent offloading support all around (OpenACC, again OpenMP, whatever MS offloading extensions are called) and the most popular GPU programming language (CUDA) is a C++ dialect.

Regarding the flat memory model, for large scale programming the only sane model is a flat, cache coherent one; those architectures that don't provide that, either evolve to provide it or die (cf. CELL) supplanted by those that do (yes, that doesn't mean that all memory is the same, but that is true with your standard CPU anyway).

I don't have an opinion on FPGAs, I expect that, if they ever go mainstream, initially people will just assemble predefined blocks via high level languages, but who knows what the future reserves us.

pjc50 · on May 18, 2016

Parallelism: yes, although OpenMP isn't nearly as accessible as e.g. Swift async closures. C++ only got proper language-native threading in C++11.

for large scale programming the only sane model is a flat, cache coherent one

Do google view their datacenters as a single flat cache-coherent memory space? No, they built mapreduce instead. That's the point of view I'm coming from: distributed systems engineering working downwards. Rather than a single large program operating on a single memory space, a set of fragments whose programmers are aware that there is latency when communicating between nodes. DRAM is just another "node" that you have to send messages to and wait for a response.

gpderetta · on May 18, 2016

"Parallelism: yes, although OpenMP isn't nearly as accessible as e.g. Swift async closures."

I'm not familiar with them, do you have a pointer? Cilk does have powerful semantics and a very light weight syntax.

"C++ only got proper language-native threading in C++11."

Sure, but OpenMP and Cilk are significantly older.

"Do google view their datacenters as a single flat cache-coherent memory space?"

No, but I'm pretty sure they whish they could. Many HPC clusters do present a single memory image across thousands of machines.

"they built mapreduce instead."

Mapreduce (and its extensions) is not a general programming model though.

nickpsecurity · on May 18, 2016

Exactly. We had older stuff that could handle this. There was a ton of innovation in MPP's and clusters that could be applied to today's problems with ASIC's or FPGA's with interesting results. I intend to do just that at some point.

Far as nanosecond comms, they might want to consider using the old Active Messages or Fast Messages schemes. Their latency was tiny even on Ethernet. SGI also put FPGA's on NUMA interconnect back in the day. What Intel is doing is an increment on that rather than revolutionary or anything. One could use NUMAscale's chips to connect these things together.

There's also academic tools that could be polished for producing Verilog/VHDL automatically from higher-level descriptions. High-level synthesis it's called. Works well enough for simple constructs like he says he uses. Don't have to go straight to RTL level haha.

vmarsy · on May 18, 2016

> I'm pretty sure they whish they could [view their datacenters as a single flat cache-coherent memory space]. Many HPC clusters do present a single memory image across thousands of machines.

No not really, at some point when you're dealing with PetaBytes of RAM and millions of cores, the Law of Physics kicks in, your RAM is spreading across a large physical area no matter how clever you are. If you want a flat memory space you have to guarantee an access to any memory address in less than X cycles otherwise you have a NUMA architecture[1]

While this is true that HPC clusters present a single memory image per cluster node (where one node = 8-32 processors (maybe 64)), the other nodes's memory has to be access with Message passing or other mechanisms.

You need a different programming model, MapReduce is too specific, that's why Google is trying things like their "DataFlow" platform.

[1]https://en.wikipedia.org/wiki/Non-uniform_memory_access#NUMA...

gpderetta · on May 18, 2016

"If you want a flat memory space you have to guarantee an access to any memory address in less than X cycles otherwise you have a NUMA architecture[1]"

There is nothing wrong with NUMA (well, ccNUMA, but today that's a given). Even a simple modern two socket server is a NUMA machine.

Anyways, as I've commented elsewhere, I'm not arguing that shared memory is practical today on a large HPC cluser.

lomnakkus · on May 18, 2016

> Anyways, as I've commented elsewhere, I'm not arguing that shared memory is practical today on a large HPC cluser.

I think the point that was being made was that it'll never be practical purely for physical reasons. Any physical separation means that light takes a certain amount of time to travel and no known law of physics will let you circumvent that... A distance of a foot will always incur a latency of ~1ns (at best), so our models must account for latency. (At some point -- it's not obvious that we've reached the end of how compact a computer can be, but there is a limit where you just end up with a tiny black hole instead of computer.)

gpderetta · on May 18, 2016

I don't get it, our models have been accounting for latency for the last 30 years at least. We routinely use three level of caches and higly out of order memory accesses for trying to make latency manageable.

Now it is possible that our best coherency protocols simply aren't effective at high latencies, but that doesn't mean we can't come up with something workable in the future. Is there any no-go theorem in the field?

neutronicus · on May 18, 2016

All the HPC I've done is explicitly message-passing. There's certainly no abstraction layer that allows, me to treat it is a single memory space.

gpderetta · on May 18, 2016

Distributed Shared Memory [1] is a thing, although I guess that today's ultra large clusters make it impractical.

[1] https://en.wikipedia.org/wiki/Distributed_shared_memory

maxlybbert · on May 18, 2016

With OpenMP your parallel loops look like normal loops (they have a #pragma to mark them as parallelizable). With asynchronous closures, your parallel code looks like callbacks. Obviously, which you prefer is personal preference, but I've never heard anybody say OpenMP wasn't accessible.

lightcatcher · on May 18, 2016

> Regarding the flat memory model, for large scale programming the only sane model is a flat, cache coherent one; those architectures that don't provide that, either evolve to provide it or die (cf. CELL) supplanted by those that do (yes, that doesn't mean that all memory is the same, but that is true with your standard CPU anyway).

CUDA on Nvidia GPUs gives a non-coherent last level cache. Most programs distributed over multiple nodes have no shared address space abstraction, but instead do explicit message passing. I agree that coherent caches make programming easier to think about, but I don't think I would go as far to say they are the "only sane model" for "large scale programming" given that MPI and CUDA both are popular.

gpderetta · on May 18, 2016

Sure, my point is that the general trend is for the hardware to get smarter and hide the ugliness of the system to the programmer. Every time it has been attempted to push the complexity to the programming language and compiler, it has ended in tragedy.

lallysingh · on May 18, 2016

Computers don't really have the flat memory model today. They're networked cores with coherent distributed memory wrapped in a simple API that has very high performance variability.

Any kind of performant multithreaded work today has to treat their motherboard as such unless they want significant performance hits.

pheon · on May 18, 2016

As for 2) totally agree, GPU and FPGA programming has become "the new assembly language".

It use to be you could drop from C/C++ to assembler and gain massive performance boosts. These days with C intrinsics there`s no need for pure assembly code. But droping to GPU or FPGA code is a total must now, if you need any significant juice from your system.

pjmlp · on May 18, 2016

Compiler intrinsics as Assembly replacement go back all the way to the 60s.

adzm · on May 18, 2016

But good codegen from intrinsics goes back only a dozen years or so (ymmv)

pheon · on May 18, 2016

Precisely. Also the instruction sets have been designed for compilers vs minimizing gate counts. Making it easier for compilers to schedule optimally - alot less weird and bizzare shit they have to deal with.

pjmlp · on May 18, 2016

Well, we also have to thank C for the set back in optimizing compilers.

Fran Allen. In Coders at Work (pp. 501-502):

--- Begin Quote ---

-Seibel-: When do you think was the last time that you programmed?

-Allen-: Oh, it was quite a while ago. I kind of stopped when C came out. That was a big blow. We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization.

The nubbin of the debate was Steve's defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer's issue. The motivation for the design of C was three problems they couldn't solve in the high-level languages: One of them was interrupt handling. Another was scheduling resources, taking over the machine and scheduling a process that was in the queue. And a third one was allocating memory. And you couldn't do that from a high-level language. So that was the excuse for C.

-Seibel-: Do you think C is a reasonable language if they had restricted its use to operating-system kernels?

-Allen-: Oh, yeah. That would have been fine. And, in fact, you need to have something like that, something where experts can really fine-tune without big bottlenecks because those are key problems to solve.

By 1960, we had a long list of amazing languages: Lisp, APL, Fortran, COBOL, Algol 60. These are higher-level than C. We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are... basically not taught much anymore in colleges and universities.

--- End Quote ---

pjc50 · on May 18, 2016

That's some serious sour grapes by Allen. Her assertion that Fortran and COBOL are higher level than C is .. difficult to support given the reliance of both languages on GOTO.

The assertion that compilers weren't taught any more is just silly.

nickpsecurity · on May 18, 2016

Here's an experienced C programmer and fan telling you a list of ways Fortran is higher-level and superior to C for numeric programming:

http://www.ibiblio.org/pub/languages/fortran/ch1-2.html

jedbrown · on May 18, 2016

Most everything here is subjective, inaccurate, or outdated by C99, save Fortran's multi-dimensional array handling which is legitimately superior to C despite partial reconciliation by VLAs.

nickpsecurity · on May 18, 2016

Well, darn, there goes that. I'll have to re-examine it with C99 reference to assess it's accuracy.

pjc50 · on May 18, 2016

I don't think it's quite that bad, but "for numeric programming" is a very important caveat here. As is "define higher-level".

nickpsecurity · on May 18, 2016

I think higher level would be effecient, English-like representation that's closer to algorithm pseudo-code than managing machine details.

rbanffy · on May 18, 2016

I don't remember seeing many GO TO's in post-F77 FORTRAN. When did this exchange happen?

unscaled · on May 19, 2016

Higher level does not necessarily mean more modern language features and paradigm - it only means language's computation model is farther removed from the actual hardware.

Allen was specifically discussing auto-optimizations (what we nowadays would just call 'compiler optimizations') and essentially argued that low level languages, in the quest of allowing fine-grained manual optimization, prevent many types of advanced auto-optimizations.

Specifically speaking, it is well known that FORTRAN still often beats C in numerical calculations just by the virtue of not supporting pointer aliasing (especially pointers pointing to arbitrary positions in the middle of an array which is being looped over).

pjmlp · on May 18, 2016

Fortran is certainly higher level than C, given lack of aliasing and no decay of arrays into pointers.

nickpsecurity · on May 18, 2016

You should bookmark this for future discussions bringing up Fortran. Great write-up.

http://www.ibiblio.org/pub/languages/fortran/ch1-2.html

pheon · on May 18, 2016

ah your siding on the lisp machine vs todays hardware arch. I think if the lisp hardware architecture has significant advantages we`ll see that emerging in "soft cpu`s" on fpga`s. Particularly as the Xeon/FPGA gear gathers steam.

nickpsecurity · on May 18, 2016

We've already seen the advantages although the total cost-benefit is unknown. For one, several CPU's for Scheme/LISP in past had hardware-accelerated garbage collection and/or a bunch of cores. Automatic, memory management and multicore are now mainstream due to perceived benefits. Works down to CPU in LISP machines.

Also, I've seen some of these benefits of Genera in modern stacks but I still don't have all of these capabilities:

http://www.symbolics-dks.com/Genera-why-1.htm

Any jump out at you as particularly awesome for a developer OS?

pjmlp · on May 18, 2016

Not only Lisp Machines, that was how many memory safe systems programming languages were done in the mainframe days.

Intel MPX, CHERI are just two modern approaches of reusing those ideas to tame C's memory issues.

Going back to FPGAs, I think function composition is very similar to digital circuit design, so FP concepts could be a very nice way to do GPGPU programming instead of the actual mainstream approaches. But it would require the GPU to be more FPGAs like.

pjmlp · on May 18, 2016

Might be.

ESPOL, NEWP and Algol-68RS are some examples of Algol based systems programming languages where the hardware was fully exposed as intrisincs and no Assembly was required.

minipci1321 · on May 18, 2016

> 1) C++ is too complicated, and therefore hard to reason about and slow to compile.

Don't we tend to reason predominantly about the written-down code rather than about the language itself? Once the developer has delineated the semantic circle he will be using, the big part of the specification is left out. Granted, the question is open when we are contemplating a blank ... err, code editor, but does that occur all that frequently?

> Presumably this speculation is based on the Intel-Altera acquisition.

Well, it has been a while companies sell high performance network cards equipped with gigabytes of RAM and an FPGA. I'd be curious to know what people do with that? The level of prices seems to indicate the target would be financial institutions, but how about the developers -- where does one find people proficient in finance, math and verilog/VHDL all at once? And at what price?

monocasa · on May 18, 2016

SystemC is actually C++. It's built on a bunch of #defines for classes and templates.