Question

We have about 60 million webpages in a compressed format. We would like to de-compress and work with these files individually.

Here are my questions!

First, if I decompress them into the file system, would the FS cope with such number of files. My file system is ext4. (I've 4 different file systems so I can divide the data between them like 15 M pages for each file system)

Secondly, Would storing these files into a relational database be a better option? assuming that all the hassle of cleaning html text is done before inserting them into the database.

Thanks,

Was it helpful?

Solution

If you extract them into a single directory you may exceed the maximum allocated indices in that folder. If you extract them into multiple directories you will fair better.

60 Million is definitely a fair amount, if you plan on doing any indexing on them or searching then a database would be your best option, you can do indexing on files using something like lucene it all depends on what you want to do with the files After they ave been extracted.

I currently have a similar issue with images on a large user site, the way I got around this issue was to give each image a GUID and for each byte in the guid assign it to a different directory, then the next byte under a subdirectory (down to 8 bytes) if my fill ratio goes up I'll create more subdirectories to compensate, it also means I can spread it across different net storage boxes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top