ArXiv replication brainstorming
Question
The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.
Solution
full pdf content is in the amazon cloud.
while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB
http://arxiv.org/help/bulk_data_s3
T.
OTHER TIPS
arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?
My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.