Question

I've noticed that the instantiation using the the RepositoryConnection method add was slower than when instantiated by modifying the model using a SPARQL query. Despite the difference, even the SPARQL update method takes a long time for instantiation (3.4 minutes to 10,000 triplets). The execution of multiple inserts (one query for each triple) or one big insert query does not change the performance of the methods. It is still slow. Is there another method appropriate for adding 1 million triples, or are there any special configurations that can help?

Code for RepositoryConnection

Repository myRepository = new HTTPRepository(serverURL, repositoryId);
myRepository.initialize();
RepositoryConnection con = myRepository.getConnection();
ValueFactory f = myRepository.getValueFactory();

i = 0;
j = 1000000;    

while(i < j)(
    URI event    = f.createURI(ontologyIRI + "event"+i);
    URI hasTimeStamp    = f.createURI(ontologyIRI + "hasTimeStamp");
    Literal timestamp   = f.createLiteral(fields.get(0));
    con.add(event, hasTimeStamp, timestamp);
    i++
}    

Code for SPARQL

Repository myRepository = new HTTPRepository(serverURL, repositoryId);
myRepository.initialize();
RepositoryConnection con = myRepository.getConnection();

i = 0;
j = 1000000;    

while(i < j)(
    query = "INSERT {";
    query += "st:event"+i+" st:hasTimeStamp     '"+fields.get(0)+"'^^<http://www.w3.org/2001/XMLSchema#float> .\n"
    + "}"
      + "WHERE { ?x ?y ?z }";
    Update update = con.prepareUpdate(QueryLanguage.SPARQL, query);
    update.execute();

    i++;
}

Edition I've done experiment with In Memory and Native Store Sesame repositories with synchronization value equal to 0

Was it helpful?

Solution

(I only just noticed that you added the requested additional info, hence this rather late reply)

The problem is, as I suspected, that you are not using transactions to batch your update operations together. Effectively, each add operation you do becomes a single transaction (a Sesame repository connection by default runs in autocommit mode), and this is slow and ineffecient.

To change this, start a transaction (using RepositoryConnection.begin()), then add your data, and finally call RepositoryConnection.commit() to finalize the transaction.

Here's how you should modify your first code example:

Repository myRepository = new HTTPRepository(serverURL, repositoryId);   
myRepository.initialize(); 
RepositoryConnection con = myRepository.getConnection(); 
ValueFactory f = myRepository.getValueFactory();

i = 0; 
j = 1000000;    

try {
  con.begin(); // start the transaction
  while(i < j) {
      URI event    = f.createURI(ontologyIRI + "event"+i);
      URI hasTimeStamp    = f.createURI(ontologyIRI + "hasTimeStamp");
      Literal timestamp   = f.createLiteral(fields.get(0));
      con.add(event, hasTimeStamp, timestamp);
      i++; 
  }
  con.commit(); // finish the transaction: commit all our adds in one go.
}
finally {
  // always close the connection when you're done with it. 
  con.close();
}

The same applies to your code with the SPARQL update. For more information on how to work with transactions, have a look at the Sesame manual, particularly the chapter about using the Repository API.

As an aside: since you're working over HTTTP, there is a risk that if your transaction becomes too large, it will start consuming a lot of memory in your client. If this starts happening you may want to break up your update into several transactions. But with an update consisting of a million triples you should still be alright I think.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top