I saw this post and I was curious what was out there.
https://neuromatch.social/@jonny/113444325077647843
Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?
One option that I’ve heard of in the past
https://archivebox.io/
I am using archivebox, it is pretty straight-forward to self-host and use.
However, it is very difficult to archive most news sites with it and many other sites as well. Most cookie etc pop ups on a site will render the archived page unusable and often archiving won’t work at all because some bot protection (Cloudflare etc.) will kick-in when archivebox tries to access a site.
If anyone else has more success using it, please let me know if I am doing something wrong…
Monolith has the same problem here. I think the best resolution might be some sort of browser-plugin based solution where you could say “archive this” and have it push the result somewhere.
I wonder if I could combine a dumb plugin with Monolith to do that… A weekend project perhaps.
Going to check that out because…yeah. Just gotta figure out what and where to archive.
That looks useful, I might host that. Does anyone have an RSS feed of at risk data?
This seems pretty cool. I might actually host this.
Eyy, I want that!