سؤال

I have a Camel context configured to do some manipulation of input data in order to build RDF triples.

There's a final route with a processor that, using Sesame Client API, talks to a separate Sesame instance (running on Tomcat with 3GB of RAM) and sends add commands (each command contains about 5 - 10 statements).

The processor is running as a singleton and the corresponding "from" endpoint has 10 concurrentConsumers (I tried with 1, then 5, then 10 - moreless same behaviour).

I'm using HttpRepository from my processor for sending add commands and, while running, I observe a (rapid and) progressive degrade of performance in indexing. The overall process starts indexing triples very quickly but after a little bit the committed statements grow very slowly.

On Sesame side I used both MemoryStore and NativeStore but (performance) behaviour seems moreless the same.

The questions:

  • which kind of store kind is reccommended in case I would like to speed up the indexing phase?
  • Is the Repository.getConnection doing some kind of connection pooling? In other words, can I open and close a connection each time the "add" processor does its work?
  • Having said that I need first to create a store will all those triples, is it preferred create a "local" Sail store instead of having that managed by a remote Sesame server (therefore I won't use a HTTPRepository)?
هل كانت مفيدة؟

المحلول

I am assuming that you're adding using transactions of 4 or 5 statements for good reason, but if you have a way to do larger transactions, that will significantly boost speed. Ideal (and quickest) would be to just send all 300,000 triples to the store in a single transaction.

Your questions, in order:

  • If you're only storing 300,000 statements the choice of store is not that important, as both native and memory can easily handle this kind of scale at good speed. I would expect memory store be slightly more performant, especially if you have configured it to use a non-zero sync delay for persistence, but native has a lower memory footprint and is of course more robust.

  • HTTPRepository.getConnection does not pool the actual RepositoryConnection itself, but internally pools resources (so the actual HttpConnections that Sesame uses internally are pooled). so getConnection is relatively cheap and opening and closing multiple connections is fine - though you might consider reusing the same connection for multiple adds, so that you can batch multiple adds in a single transaction.

  • Whether to store locally or on a remote server really depends on you. Obviously a local store will be quicker because you eliminate network latency as well as the cost of (de)serializing, but the downside is that a local store is not easily made available outside your own application.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top