> Knowing why that happens would be nice, of course, but no one actually has the...

KronisLV · on Nov 6, 2021

> ...you could find out the problem with quite a good chance.

It's not that it's impossible to do so due to technical limitations. Even without JFR, there's still VisualVM and any number of APM solutions, like JavaMelody, Apache Skywalking, Stagemonitor etc.

It's rather a problem of telling the clients/business:

  Hey, look, for the next X days/weeks i won't be developing any new features or tending to your user stories, but instead will attempt to track down this persistent, yet somewhat hard to reproduce problem. 
  And because of limitations in place that pertain to accessing production environments, this process will likely take much longer than it otherwise should, especially in case of blocking synchronous communications when asking for production logs or heap dumps, which are sometimes wrongly exported after the server restart, which makes them meaningless.
  Alternatively, i will spend a similar amount of time attempting to first get the application instrumented and then we'll run into similar challenges regarding the access permissions for those, before returning to the aforementioned attempts to debug and solve the application issues, because adding instrumentation doesn't magically solve those.

Depending on the environment that you work in, this proposition might either be accepted, you might also find yourself fighting an uphill battle, or people might just look at you like you have two heads and without having proper backing support of the other engineers you'll find yourself for critiqued both for wasting the time on debugging with no guarantees of actual payoff in the end, as well as the application quite possibly still not working.

I'm actually in the middle of implementing an APM solution to hopefully give better insights into how the application works, but in many of the environments out there this will be a Catch 22: https://www.merriam-webster.com/dictionary/catch-22

So, if you have control over the application from day one instead of being onboarded into a maintenance project with SRE not having been a concern throughout its development, consider building for failure - treat it as a "when?" question instead of "whether?" and do what you can to mitigate the actual user impact even when components may fail.

Horizontal scaling is one way to achieve that, and a pretty decent one, as long as you don't attempt to scale your single source of truth.