ArXiv replication brainstorming

https://stackoverflow.com/questions/1206166

05-07-2019
|

Question

The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.

Solution

full pdf content is in the amazon cloud.

while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB

http://arxiv.org/help/bulk_data_s3

OTHER TIPS

arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?

My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow