Question

The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.

Was it helpful?

Solution

full pdf content is in the amazon cloud.

while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB

http://arxiv.org/help/bulk_data_s3

T.

OTHER TIPS

arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?

My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top