One of the other comments around here[0] points to one truly classic StackOverflow question[1] where the answer is "branch prediction". In that example, the CPU is 6x slower at processing the data when the branch predictor is failing repeatedly. It is surely still succeeding some of the time, since that question wasn't specifically built to foil the branch predictor, so that isn't even the upper bound for performance loss.
If you designed a CPU to lack a branch predictor, you would certainly make different choices that would reduce the penalty of that, but there would still be a significant penalty.
Modern CPUs are all pipelined[2] because it allows a sequence of instructions to be executed (to some degree) in parallel. With out-of-order CPUs, the amount of parallelism that is possible for a sequence of instructions increases even more.
Without a branch predictor, you would have to stall the entire pipeline as soon as you see a branch instruction, until that instruction is finished. Obviously if you don't have a branch predictor, you're going to choose to have a shorter pipeline, but the penalty of each branch instruction will still be significant.
There's a lot of nuance to any proper answer to this question, and it's been years since I learned about this or thought deeply about this, so I'm probably not the right person to provide a deeper answer at this point. The performance impact would be significant.
Also, as I understand it, a big part of the "win" of branch prediction is to ensure those caches are filled.
If you can execute ahead, through branches, you can resolve cache misses while the pipeline catches up, effectively reducing the cache miss penalty. If instead you have to stop at a branch, you're always gonna pay the full cache miss penalty, which can be significant.
That’s mostly true, but the heat generated by calculations which where thrown away still impacts the CPU’s thermal budget and the devices battery life. This coupled with branch prediction overhead makes failed branch predictions far worse than simply stalling would be without them.
However, even very dumb branch prediction tends to be well worth it for the vast majority of code.
A useful (very rough) rule of thumb here is that every 5th instruction is a branch this limits all sorts of things - like branch cost and prediction algorithms and sizes, but also how many instructions it's worth decoding per clock which puts practical limits on system issue rates (a predicted taken branch typically reduces the issue rate)
If the branch predictor fails, does it have to take time undoing the steps it's already done? Or are pipelines long enough that the fail would be detected before registers/pointers are updated?
The results for partially-executed instructions aren't "committed" yet, so the pipeline is flushed (thrown away), and execution starts again with an empty pipeline at the corrected address. IIRC, each stage of a pipeline has its own registers. The registers your code knows about would not have been updated until the end of the pipeline, but that branch instruction is ahead of the speculative execution, so none of the speculative results would be stored before the mistake is identified.
So, there's no work required to "undo" those mistakes, but starting from an empty pipeline still means you're a dozen (or two) clock cycles from getting back to where you would have been with a successful branch prediction, which is why it is important for the CPU to predict correctly as often as it can.
Not undoing AFAIK, as it won't have committed the results before the branch resolves. However it will have to throw away all the work it did and start from the correct branch target. This is called a pipeline stall.
Particularly the Pentium 4 suffered due to a long pipeline hence long stall. The successor to the Pentium 4 was an evolved Pentium III core with a much improved branch predictor and larger cache[1] which helped it outperform the Pentium 4[2].
If you designed a CPU to lack a branch predictor, you would certainly make different choices that would reduce the penalty of that, but there would still be a significant penalty.
Modern CPUs are all pipelined[2] because it allows a sequence of instructions to be executed (to some degree) in parallel. With out-of-order CPUs, the amount of parallelism that is possible for a sequence of instructions increases even more.
Without a branch predictor, you would have to stall the entire pipeline as soon as you see a branch instruction, until that instruction is finished. Obviously if you don't have a branch predictor, you're going to choose to have a shorter pipeline, but the penalty of each branch instruction will still be significant.
There's a lot of nuance to any proper answer to this question, and it's been years since I learned about this or thought deeply about this, so I'm probably not the right person to provide a deeper answer at this point. The performance impact would be significant.
[0]: https://news.ycombinator.com/item?id=34202230
[1]: https://stackoverflow.com/questions/11227809/why-is-processi...
[2]: https://en.wikipedia.org/wiki/Instruction_pipelining