Question

I am trying to understand the Crawler4j Open source web crawler. In the mean while I have some doubts, that are as follows,

Questions:-

  1. What is StatisticsDB do in Counters class., and please explain the following code part,

     public Counters(Environment env, CrawlConfig config) throws DatabaseException {
        super(config);
    
        this.env = env;
        this.counterValues = new HashMap<String, Long>();
    
        /*
         * When crawling is set to be resumable, we have to keep the statistics
         * in a transactional database to make sure they are not lost if crawler
         * is crashed or terminated unexpectedly.
         */
        if (config.isResumableCrawling()) {
            DatabaseConfig dbConfig = new DatabaseConfig();
            dbConfig.setAllowCreate(true);
            dbConfig.setTransactional(true);
            dbConfig.setDeferredWrite(false);
            statisticsDB = env.openDatabase(null, "Statistics", dbConfig);
    
            OperationStatus result;
            DatabaseEntry key = new DatabaseEntry();
            DatabaseEntry value = new DatabaseEntry();
            Transaction tnx = env.beginTransaction(null, null);
            Cursor cursor = statisticsDB.openCursor(tnx, null);
            result = cursor.getFirst(key, value, null);
    
            while (result == OperationStatus.SUCCESS) {
                if (value.getData().length > 0) {
                    String name = new String(key.getData());
                    long counterValue = Util.byteArray2Long(value.getData());
                    counterValues.put(name, counterValue);
                }
                result = cursor.getNext(key, value, null);
            }
            cursor.close();
            tnx.commit();
        }
    }
    

As far as I understand, it saves the crawled URLS, that helps in the case when crawler crashed, then web crawler does not need to start from the beginning. Can you please explain the above code line by line, please.

2 . I didn't found any good link that explains SleepyCat to me, as Crawlers4j uses SleepyCat to store the intermediate information. So please tell me some good resource, from where I can learn the basics of SleepyCat. (I don't know what is the meaning of Transaction, Cursor used in above code).

Help me please. Looking for your kind reply.

Was it helpful?

Solution

Basically Crawler4j loads the existing statistics from the database, by loading all values from the DB. In fact the code is pretty much incorrect, since a transaction is opened and no modification to the DB is made. Therefore the lines dealing with tnx could be removed.

Commented line by line :

//Create a database configuration object 
DatabaseConfig dbConfig = new DatabaseConfig();
//Set some parameters : allow creation, set to transactional db and don't use deferred    write
dbConfig.setAllowCreate(true);
dbConfig.setTransactional(true);
dbConfig.setDeferredWrite(false);
//Open the database called "Statistics" with the upon created configuration
statisticsDB = env.openDatabase(null, "Statistics", dbConfig);

 OperationStatus result;
//Create new database entries key and values
    DatabaseEntry key = new DatabaseEntry();
    DatabaseEntry value = new DatabaseEntry();
//Start a transaction
    Transaction tnx = env.beginTransaction(null, null);
//Get the cursor on the DB
    Cursor cursor = statisticsDB.openCursor(tnx, null);
//Position the cursor to the first occurrence of key/value
    result = cursor.getFirst(key, value, null);
//While result is success
    while (result == OperationStatus.SUCCESS) {
//If the value at the current cursor position is not null, get the name and the value of     the counter and add it to the Hashmpa countervalues
        if (value.getData().length > 0) {
            String name = new String(key.getData());
            long counterValue = Util.byteArray2Long(value.getData());
            counterValues.put(name, counterValue);
        }
        result = cursor.getNext(key, value, null);
    }
    cursor.close();
//Commit the transaction, changes will be operated on th DB
    tnx.commit();

I also answered a similar question here. About SleepyCat, are you speaking about this ?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top