Question

I'm a Perl programmer with some nice scripts that go fetch HTTP pages (from a text file-list of URLs) with cURL and save them to a folder.

However, the number of pages to get is in the tens of millions. Sometimes the script fails on number 170,000 and I have to start the script again manually. It automatically reads the URL and sees if there is a page downloaded and skips. But, with a few hundred thousand, it still takes a few hours to skip back up to where it left off. Obviously, this is not going to pan out in the end.

I've been told that instead of saving to a text file, which is hard to search and modify, I need to use a database. I don't know much about databases, just messed around with MySQL on a school server a year ago. I just need the ability to add millions of rows and a few static columns, search/modify one quickly, and do this all locally on a lan (or a single computer if that's difficult). And of course, I need to access this database using perl.

Where should I start? What do I need to download to get a server started on Windows? Which Perl modules should I use? (I'm using an ActiveState distro)

Was it helpful?

Solution

Since you only need to search on one column, you may wish to consider a key/value store database like the Berkeley DB by using either BerkeleyDB or DB_File.

Generally, you can think of these key/value databases as being Perl hashes that operate from a disk rather than memory. Exact key look ups are very fast. Everything else requires scanning the whole dataset.

OTHER TIPS

There's many sorts of databases, but if you've already decided for an SQL database and are trying to make the setup process easy, you might want to have a look at SQLite and the DBI/DBD::SQLite modules, which allow you to use that from perl.

Look into DBI. If you do not like SQL in your programs, try SQL::Abstract.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top