Interesting. You realize this by identifying the offending assembly instructions...

yvdriess · 2025-07-09T15:56:25 1752076585

There's no single good way, but yes as you said, logical deduction based on the surrounding instructions and their hardware counters is a way to do it. Instruction B might be collecting a ton of hardware counted cycles, but it could be because instruction A it depends on is slow. Sometimes, those dependencies are even implicit, since x86 is in-order commit some instructions like lock/atomics have implicit and dynamic dependencies based on what is in the reorder buffer at the time.

To give a concrete example I encountered analysing a GC: traversing the object graph in a loop means calculating the address of an object, loading that object, doing some work on it and then grabbing the bits to calculate the children to visit next. This creates a long brittle chain of data-dependent conditionals, depending on a calculation that eventually came from a much earlier load. That conditional branch might be 30/70 taken/untaken, so the branch predictor often does not speculate, reducing the ILP and making it harder to hide the load's latencies. Now, dear Watson, would you say the blame is to the front end? There are no stalls when all the loads go to fast cache, only when there is the occasional remote LLC hit, DRAM hit or god forbid cross-NUMA hit. What if I tell you that there's an atomic operation to mark the object as visited, which is fast in itself but can only be issued when all prior loads have completed and stops from newer instructions to be issued while it hasn't been committed.

You need to look at a whole bunch of surrounding instructions and a variety of hardware counters to start forming a picture. Insert Always Sunny in Philadelphia meme with the red wire crime board here.