What's a good way to be an "Archivist" on a low budget these days?
Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).
Is this a use case for Torrents? What's the most suitable architecture available today for this?
Iām not an expert in such things, but this seems like a good use case for IPFS. Kinda similar to a torrent except that it is natively content-addressed (essentially the key to access is a hash of the data).
Set up a scrape using ArchiveTeam's fork of wget. It can save all the requests and responses into a single WARC file. Then you can use https://replayweb.page/ or some other tool to browse the contents.
In my experience, to archive effectively you need a physical datacenter footprint, or to rent capacity of someone who does. Over a longer timespan (even just 6 months), having your own footprint is a lower total cost of ownership, provided you have the skills or access to someone with the skills to run Kubernetes + Ceph (or something similar).
.
> Is this a use case for Torrents?
Yes, provided you have a good way to dynamically append a distributed index of torrents and users willing to run that software in addition to the torrent software. Should be easy enough to define in container-compose.
Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).
Is this a use case for Torrents? What's the most suitable architecture available today for this?