Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So 2000 fps, 16 million pixels, and 7000 operations per pixel, works out to 224 TFLOPS.

An RTX 4090 is advertised as being able to compute 82 TFLOPS.

Where do you think the extra is coming from? Is it just straightforward optimisations like constant folding? Or do you think the compiler is noticing that 1000 iterations of the loop doesn't change the answer and optimising it down to just 1 loop?



There are 7866 instructions including the constants. There are 1406 constants leaving only 6460 real instructions to be executed (max, min, add, neg, sub, square, sqrt). Those constants can be directly encoded into most (possibly all) of the instructions when mapped to real machine instructions, which a compiler would likely do unless there was a good reason to keep it in a register or memory location.

Something I saw from a cursory scan were some near duplicate instruction (haven't written code to find all instances):

  a = x - c
  ;; some time later
  b = c - x
Recognizing that b is the negation of a, you can convert the calculation of b to:

  b = -a
This may or may not be faster, but it does mean that we can possibly forget about c earlier (helpful for register allocation and can reduce memory accesses).

Negations can sometimes be removed depending on the lifetime of the variable and its uses. Taking the above, suppose we follow it with this use of b:

  d = e + b ;; or b + e
We can rewrite this as:

  d = e - a
Or if it had been:

  d = e - b
It can be rewritten to:

  d = e + a
And if there are no more uses of b we've eliminated both that variable and another instruction from the program. These and other patterns are things a compiler would detect and optimize.

Though looking at the uses of the results from neg, I think most are used in the sequence of max/min instructions following them so it may not be possible to eliminate them as I showed above.


Pretty sure it's optimisations.

Even small things like converting multiplications followed by additions to FMA will reduce the operation count.

Add to that constant folding etc. and a speedup factor of ~3 is not so hard to imagine.


I compiled it for Ampere and counted 6834 actual F32 operations in the SASS after optimizations. I only counted FFMA, FADD, FMUL, FMNMX, and MUFU.RSQ after eyeballing the SASS code, so there might even be more. It's possible the FMNMX doesn't actually take a FLOP since you can do f32 max as an integer operation, and perhaps MUFU.RSQ doesn't either, but even if you only count FFMA, FADD, and FMUL there are still 3685 ops.

  nvcc -arch=sm_86 prospero.cu -o prospero
  cuobjdump -sass prospero | grep -E 'FFMA|FADD|FMUL|FMNMX|MUFU\.RSQ' | wc -l


It’s a good question, admittedly I don’t know. I looked into it when Matt mentioned it but I’m not so familiar with what happens after ptx to say.

If someone does know I’d love to learn why though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: