So 2000 fps, 16 million pixels, and 7000 operations per pixel, works out to 224 TFLOPS.
An RTX 4090 is advertised as being able to compute 82 TFLOPS.
Where do you think the extra is coming from? Is it just straightforward optimisations like constant folding? Or do you think the compiler is noticing that 1000 iterations of the loop doesn't change the answer and optimising it down to just 1 loop?
There are 7866 instructions including the constants. There are 1406 constants leaving only 6460 real instructions to be executed (max, min, add, neg, sub, square, sqrt). Those constants can be directly encoded into most (possibly all) of the instructions when mapped to real machine instructions, which a compiler would likely do unless there was a good reason to keep it in a register or memory location.
Something I saw from a cursory scan were some near duplicate instruction (haven't written code to find all instances):
a = x - c
;; some time later
b = c - x
Recognizing that b is the negation of a, you can convert the calculation of b to:
b = -a
This may or may not be faster, but it does mean that we can possibly forget about c earlier (helpful for register allocation and can reduce memory accesses).
Negations can sometimes be removed depending on the lifetime of the variable and its uses. Taking the above, suppose we follow it with this use of b:
d = e + b ;; or b + e
We can rewrite this as:
d = e - a
Or if it had been:
d = e - b
It can be rewritten to:
d = e + a
And if there are no more uses of b we've eliminated both that variable and another instruction from the program. These and other patterns are things a compiler would detect and optimize.
Though looking at the uses of the results from neg, I think most are used in the sequence of max/min instructions following them so it may not be possible to eliminate them as I showed above.
I compiled it for Ampere and counted 6834 actual F32 operations in the SASS after optimizations. I only counted FFMA, FADD, FMUL, FMNMX, and MUFU.RSQ after eyeballing the SASS code, so there might even be more. It's possible the FMNMX doesn't actually take a FLOP since you can do f32 max as an integer operation, and perhaps MUFU.RSQ doesn't either, but even if you only count FFMA, FADD, and FMUL there are still 3685 ops.
An RTX 4090 is advertised as being able to compute 82 TFLOPS.
Where do you think the extra is coming from? Is it just straightforward optimisations like constant folding? Or do you think the compiler is noticing that 1000 iterations of the loop doesn't change the answer and optimising it down to just 1 loop?