Question

I have written a web crawler in Java, and I am using Berkeley DB to save the pages I crawl (for later indexing, etc.). I am storing each page as a Webpage object, which has the following instance fields:

@PrimaryKey
String url;
String docString;
Date lastVisited;
Date lastChecked;
ArrayList<String> stringLinks;

The largest field is the String docString, which is the entire HTML content (normally not more than 500KB even on a huge page) and stringLinks keeps a String for each of the outbound links on the page. That shouldn't be too large, at most it's 100 strings of length ~70 (not even a few KB).

I crawl a little faster than a page per second, sometimes 2 pages per second, and I am seeing the Berkeley Database grow to about 2-3MB per page, which is absolutely crazy given the data stored. The database stores the Webpages in an EntityStore, and I sync it periodically. No matter what I change, I can't get the disk usage to go down!

This is a pretty big problem, because if I run multiple instances of the crawler (I have built it to be distributed) they will each quickly use a ton of disk space. If this is increasing linearly, I might be fine, but there is no way to tell by what function this space is ballooning. All i know its that it is many times the space of the actual data.

Is there something I am missing about EntityStore?

One thing to note is that I am both reading and writing from the DB, so I can't set any flags to make it write only or something. And I would prefer not to increase the cache size much since this is a heap space sensitive environment.

Was it helpful?

Solution

The issue was with deferred write. I had to enable deferred write and then call env.sync() on a timer in order to keep the DB in check rather than call env.sync() on each put. The size decreased by a factor of more than 30...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top