For me, it's about being aware of the entire stack, and deliberate about which p...

For me, it's about being aware of the entire stack, and deliberate about which possibilities I am downplaying.

At a previous company, I was assigned a task to fix requests that were timing out for certain users. We knew those users had more data than the standard deviation, so the team lead created a ticket that was something like "Optimize SQL queries for...". Turns out the issue was our XML transformation pipeline (I don't miss this stack at all) was configured to spool to disk for any messages over a certain size.

Since I started by benchmarking the query, I realized fairly quickly that the slowness wasn't in the database; since I was familiar with all layers of our stack, I knew where else to look.

Instrumentation is vital as well. If you can get metrics and error information without having to gather and correlate it manually, it's much easier to gain context quickly.