Unfortunately I don't have deep knowledge of our infrastructure, so I can't answ...

flik · on Nov 13, 2016

>>If one of the hosts delays writing to the journal, then the rest of the fleet is waiting for that operation alone, and the whole file system is blocked. When this happens, all of the hosts halt, and you have a locked file system; no one can read or write anything and that basically takes everything down.

I have seen similar issues where a GC pause on one server, freeze the entire cluster.

Is this one single monolithic file system? On the service side, can the code be asynchronous with request queues for each shard? This can help free up threads from getting blocked and serve requests for other shards.