Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What's a good way to be an "Archivist" on a low budget these days?

Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).

Is this a use case for Torrents? What's the most suitable architecture available today for this?



I’m not an expert in such things, but this seems like a good use case for IPFS. Kinda similar to a torrent except that it is natively content-addressed (essentially the key to access is a hash of the data).


https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

Set up a scrape using ArchiveTeam's fork of wget. It can save all the requests and responses into a single WARC file. Then you can use https://replayweb.page/ or some other tool to browse the contents.


In my experience, to archive effectively you need a physical datacenter footprint, or to rent capacity of someone who does. Over a longer timespan (even just 6 months), having your own footprint is a lower total cost of ownership, provided you have the skills or access to someone with the skills to run Kubernetes + Ceph (or something similar).

.

> Is this a use case for Torrents?

Yes, provided you have a good way to dynamically append a distributed index of torrents and users willing to run that software in addition to the torrent software. Should be easy enough to define in container-compose.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: