Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

occam (MIMD) is an improvement over CUDA (SIMD)

with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):

  -  8000 CPUs in same die area as an Apple M2 (16 TIPS) (ie. 36x faster than an M2)
  - 40000 CPUs in single reticle (80 TIPS)
  - 4.5M CPUs per 300mm wafer (10 PIPS)
the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect

heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory



Nit pick here but ...

I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.

I enjoyed your other point though about the comparsions of machines though


Really curious about why you think programming a vector machine is so painful ? In terms of what ? And what do you exactly mean by a "true Vector Machine" ?

My experience with RVV (i am aware of vector architecture history, just using it as an example)so far indicates that while it is not the greatest thing, it is not that bad either. You play with what you have!!

Yes, compared to regular SIMD, it is a step up in complexity but nothing a competent SIMD programmer cannot reorient to. Designing a performant hardware cpu is another matter though - lot of (micro)architectural choices and tradeoffs that can impact performance significantly.


No it in terms of flexibility of programming the SIMD is less flexible if you need any decision making. In SIMD you are also typically programming in Intrinsics not at a higher level like in CUDA.

For example I can do for (int tid = 0 ; tid<n; tid+= num_threads) { C[tid] = A[tid] * B[tid] + D[tid]; }

In SIMD yes I can stride the array 32 / (or in rvv at vl) at a time but generally speaking as new Archs come along I need to rewrite that loop for the wider Add and mpy instructions and increase width of lanes etc. But in CUDA or other GPU SIMT strategies I just need to bump the compiler and maybe change 1 num_threads variable and it will be vectorizing correctly.

Even things like RVV which I am actually pushing for my SIMD machine to move toward these problems exists because its really hard to write length agnostic code in SIMD intrinsics. That said there is a major benefit in terms of performance per watt. All that SIMT flexibility costs power that's why Nvidia GPUs can burn a hole through the floor while the majority of phones have a series of vector SIMD machines that are constantly computing Matrix and FFT operations without your pocket becoming only slightly warmer




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: