One of the other comments around here[0] points to one truly classic StackOverfl...

magicalhippo · on Jan 1, 2023

Also, as I understand it, a big part of the "win" of branch prediction is to ensure those caches are filled.

If you can execute ahead, through branches, you can resolve cache misses while the pipeline catches up, effectively reducing the cache miss penalty. If instead you have to stop at a branch, you're always gonna pay the full cache miss penalty, which can be significant.

salawat · on Jan 1, 2023

Of course, now you're opening the speculative execution leak by cache timing can of worms.

Which, again, may not be an issue given your constraints.

Retric · on Jan 1, 2023

That’s mostly true, but the heat generated by calculations which where thrown away still impacts the CPU’s thermal budget and the devices battery life. This coupled with branch prediction overhead makes failed branch predictions far worse than simply stalling would be without them.

However, even very dumb branch prediction tends to be well worth it for the vast majority of code.

Taniwha · on Jan 1, 2023

A useful (very rough) rule of thumb here is that every 5th instruction is a branch this limits all sorts of things - like branch cost and prediction algorithms and sizes, but also how many instructions it's worth decoding per clock which puts practical limits on system issue rates (a predicted taken branch typically reduces the issue rate)

mysterydip · on Jan 1, 2023

If the branch predictor fails, does it have to take time undoing the steps it's already done? Or are pipelines long enough that the fail would be detected before registers/pointers are updated?

coder543 · on Jan 1, 2023

The results for partially-executed instructions aren't "committed" yet, so the pipeline is flushed (thrown away), and execution starts again with an empty pipeline at the corrected address. IIRC, each stage of a pipeline has its own registers. The registers your code knows about would not have been updated until the end of the pipeline, but that branch instruction is ahead of the speculative execution, so none of the speculative results would be stored before the mistake is identified.

So, there's no work required to "undo" those mistakes, but starting from an empty pipeline still means you're a dozen (or two) clock cycles from getting back to where you would have been with a successful branch prediction, which is why it is important for the CPU to predict correctly as often as it can.

magicalhippo · on Jan 1, 2023

Not undoing AFAIK, as it won't have committed the results before the branch resolves. However it will have to throw away all the work it did and start from the correct branch target. This is called a pipeline stall.

Particularly the Pentium 4 suffered due to a long pipeline hence long stall. The successor to the Pentium 4 was an evolved Pentium III core with a much improved branch predictor and larger cache[1] which helped it outperform the Pentium 4[2].

[1]: https://en.wikipedia.org/wiki/Pentium_M

[2]: https://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)