Cycles are often not what you're trying to measure with something like this. You...

Cycles are often not what you're trying to measure with something like this. You care about whether the program has higher latency, higher inverse throughput, and other metrics denominated in wall-clock time.

Cycles are a fine thing to measure when trying to reason about pieces of an algorithm and estimate its cost (e.g., latency and throughput tables for assembly instructions are invaluable). They're also a fine thing to measure when frequency scaling is independent of the instructions being executed (since then you can perfectly predict which algorithm will be faster independent of the measurement noise).

That's not the world we live in though. Instructions cause frequency scaling -- some relatively directly (like a cost for switching into heavy avx512 paths on some architectures), some indirectly but predictably (physical limits on moving heat off the chip without cryo units), some indirectly but unpredictably (moving heat out of a laptop casing as you move between having it on your lap and somewhere else). If you just measure instruction counts, you ignore effects like the "faster" algorithm always throttling your CPU 2x because it's too hot.

One of the better use cases for something like RDTSC is when microbenchmarking a subcomponent of a larger algorithm. You take as your prior that no global state is going to affect performance (e.g., not overflowing the branch prediction cache), and then the purpose of the measurement is to compute the delta of your change in situ, measuring _only_ the bits that matter to increase the signal to noise.

In that world, I've never had the variance you describe be a problem. Computers are fast. Just bang a few billion things through your algorithm and compare the distributions. One might be faster on average. One might have better tail latency. Who knows which you'll prefer, but at least you know you actually measured the right thing.

For that matter, even a stddev of 80% isn't that bad. At $WORK we frequently benchmark the whole application even for changes which could be microbencmarked. Why? It's easier. Variance doesn't matter if you just run the test longer.

You have a legitimate point in some cases. E.g., maybe a CLI tool does a heavy amount of work for O(1 second). Thermal throttling will never happen in the real world, but a sustained test would have throttling (and also different branch predictions and whatnot), so counting cycles is a reasonable proxy for the thing you actually care about.

I dunno; it's complicated.