Fetching via wget to memory & bypassing disk writes
-
20-09-2019 - |
Question
Is it possible to download contents of a website—a set of HTML pages—straight to memory without writing out to disk?
I have a cluster of machines with 24G of installed each, but I’m limited by a disk quota to several hundreds MB. I was thinking of redirecting the output wget
to some kind of in-memory structure without storing the contents on a disk. The other option is to create my own version of wget
but may be there is a simple way to do it with pipes
Also what would be the best way to run this download in parallel (the cluster has >20 nodes). Can’t use the file system in this case.
Solution
‘-O file’
‘--output-document=file’
The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If ‘-’ is used as file, documents will be printed to standard output, disabling link conversion. (Use ‘./-’ to print to a file literally named ‘-’.)
If you want to read the files into a Perl program, you can invoke wget
using backticks.
Depending on what you really need to do, you might be able to get by just using LWP::Simple's get
.
use LWP::Simple;
my $content = get("http://www.example.com/");
die "Couldn't get it!" unless defined $content;
Update: I had no idea you could implement your own file system in Perl using Fuse and Fuse.pm. See also Fuse::InMemory.
OTHER TIPS
If you a) are already using Perl, b) want to download HTML, and c) parse it, I always recommend LWP and HTML::TreeBuilder.
Are you root? You could just use a tmpfs
.
Re your edit: you're not CPU bound, you don't need to use every machine. You can use xargs -n SOME_NUMBER
to split your list of root urls, assuming there are several.
But if you are keen on sharing memory, you can set up a cluster memcache and mount it on every machine with memcachefs.
wget <url> -O -
Will write the contents of a URL to standard output, which can then be captured in memory.