> What makes the workload somewhat special is I'll add that latency also doesn't...

> What makes the workload somewhat special is

I'll add that latency also doesn't matter that much. You are doing batched data loading for batch n+1 on CPU when GPUs are churning batch n-1 and copying batch n from host memory at the same time.

So as long as your "load next batch" doesn't run for like >1s it would be fine. But one single "load next batch" on one worker means thousands (if not more) random read.