Thanks for sharing your experience. It's non-trivial and surprising behavior lik...

theamk · on Jan 16, 2022

That's an interesting system, but this is pretty "in your face" -- it seems like to takes over some git ops and other steps.

We went the other way and made an "invisible" system -- there were two scripts: one would move data file to cache, upload it, and create metadata file that needs to be checked in; and second one which will take a list of metadata files and make sure that each of them is downloaded to cache. We then modified all of our code to call second script before trying to read large files.

The overall experience was that one didn't have to know anything about our large-file system unless they wanted to add a new file. It is just sometimes, one would run a program and that program would automatically download few large data files.

(We could do this really simple design because (1) we had a centralized large-file storage, (2) all of our large files were immutable and (3) they were only consumed by our code which we could modify)