Question

I'm using Sesame for querying RDF via SPARQL. I work with big files (2GB, 10GB) and do several queries subsequently. During work with such big files I get an error java.lang.OutOfMemoryError: Java heap space. I run my app with param -Xmx3g but it seems like it is not enough for these files. Maybe I should shut down a repository after every query I do?

There is my code:

void runQuery() {
   try {
       con = repo.getConnection();
       TupleQuery tupleQuery = con.prepareTupleQuery(QueryLanguage.SPARQL, queryString);
       TupleQueryResult result = tupleQuery.evaluate();
       while (result.hasNext()) {
           result.next();
       }
       result.close();
       con.close();
       } catch (Exception e) {
           ...
       }
   }
}

runTests() {
    File dataDir = new File("RepoDir/");
    repo = new SailRepository(new NativeStore(dataDir));
    repo.initialize();
    ...
    for (int j = 0; j < NUMBER_OF_QUERIES; ++j) {
        queryString  = queries.get(j);
        runQuery(); 
    }
    ...
    repo.shutDown();
}

Also, is it possible to use MemoryStore instead of NativeStore for such large files?

Example of a query that emits an error:

SELECT DISTINCT ?name1 ?name2 
WHERE {
  ?article1 rdf:type bench:Article .
  ?article2 rdf:type bench:Article .
  ?article1 dc:creator ?author1 .
  ?author1 foaf:name ?name1 .
  ?article2 dc:creator ?author2 .
  ?author2 foaf:name ?name2 .
  ?article1 swrc:journal ?journal .
  ?article2 swrc:journal ?journal
  FILTER (?name1<?name2)
}
Was it helpful?

Solution

So that's SP2B Query 4 (information that would have been useful to provide in your original post, please be through with your questions if you expect people to be thorough with their answers).

SP2B Query 4 at the 5M scale returns ~18.4M results. The 5M dataset (in turtle) is ~500M, so given your stated sizes, I'm guessing you're trying this with the 25M and 100M datasets?

The original authors were not even able to publish the size of the result set for Q4 as nothing could calculate it (at least with the bounds of the study). Given the scale factor apparent in the dataset for the results of that query, I'd imagine we're taking about 100m+ results at the 25M scale, and possibly as much as 1B results at the 100M scale.

The size of intermediate joins needed to calculate a result set that size is enormous and its no wonder that 3G of RAM is not enough. Sesame is a good system, but I don't have any clue how much memory it would require to answer that query at that scale, or even if it could answer it at all.

To my knowledge, only one system has reported running that query at 25M and no one has run it at 100M. This is why SP2B is a great, but perverse, benchmark. You might read a little more background material on it, and also look into BSBM, if you're attempting to do benchmarking of triple store performance.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top