Git-annex – Managing large files with Git

theamk · on Jan 15, 2022

One non-obvious part about git-annex is "location tracking" system [0], which "keeps track of in which repositories it last saw a file's content."

The result of this is that any "checkout" operations will, by default, modify source repo, and then this information will be eventually propagated to all the clients. Before you delete the repository, you'll want to de-register any git-annex files to let others know they are no longer available. This is pretty big departure from standard git approach where you can delete any checkouts without any repercussions.

We've tried to use git annex for large file storage in a small team, and our metadata kept being "gummed up" with all the repos which no longer exists -- temporary CI checkouts, servers which have been formatted, old user laptops, etc.. So we dropped it in favor of "git lfs" and then a custom system.

[0] https://git-annex.branchable.com/location_tracking/

kvnhn · on Jan 15, 2022

Thanks for sharing your experience. It's non-trivial and surprising behavior like this that drove me to build a custom system[0] myself.

When I started researching version control tools for large files, I remember feeling like git-annex and Git LFS were awkwardly bolted onto Git; Git simply wasn't designed for large files. Then I found DVC[1], and its approach rang true for me. However, after using DVC for a year or so, I grew tired of its many puzzling behaviors (most of which are outlined in the README at [0]). In the end, I built the tool I wanted for the job -- one that is exceptionally simple and fast.

[0]: https://github.com/kevin-hanselman/dud

[1]: https://dvc.org/

theamk · on Jan 16, 2022

That's an interesting system, but this is pretty "in your face" -- it seems like to takes over some git ops and other steps.

We went the other way and made an "invisible" system -- there were two scripts: one would move data file to cache, upload it, and create metadata file that needs to be checked in; and second one which will take a list of metadata files and make sure that each of them is downloaded to cache. We then modified all of our code to call second script before trying to read large files.

The overall experience was that one didn't have to know anything about our large-file system unless they wanted to add a new file. It is just sometimes, one would run a program and that program would automatically download few large data files.

(We could do this really simple design because (1) we had a centralized large-file storage, (2) all of our large files were immutable and (3) they were only consumed by our code which we could modify)

ttyprintk · on Jan 15, 2022

Though, I’ve made a grievous error with git-lfs over NFS. Violating git workflow is possible with git-lfs, too.