I'm working with an RDF dataset generated as part of our data collection which consists of around 1.6M small files totalling 6.5G of text (ntriples) and around 20M triples. My problem relates to the time it's taking to load this data into a Sesame triple store running under Tomcat.

I'm currently loading it from a Python script via the HTTP api (on the same machine) using simple POST requests one file at a time and it's taking around five days to complete the load. Looking at the published benchmarks, this seems very slow and I'm wondering what method I might use to load the data more quickly.

I did think that I could write Java to connect directly to the store and so do without the HTTP overhead. However I read in an answer to another question here that concurrent access is not supported, so that doesn't look like an option.

If I were to write Java code to connect to the HTTP repository does the Sesame library do some special magic that would make the data load faster?

Would grouping the files into larger chunks help? This would cut down the HTTP overhead for sending the files. What size of chunk would be good? This blog post suggest 100,000 lines per chunk (it's cutting a larger file up but the idea would be the same).

Thanks,

Steve

有帮助吗?

解决方案

If you are able to work in Java instead of Python I would recommend using the transactional support of Sesame's Repository API to your advantage - start a transaction, add several files, then commit; rinse & repeat until you've sent all files.

If that is not an option then indeed chunking the data into larger files (or larger POST request bodies - you of course do not necessarily need to physically modify your files) would help. A good chunk size would probably be around 500,000 triples in your case - it's a bit of a guess to be honest, but I think that will give you good results.

You can also cut down on overhead by using gzip compression on the POST request body (if you don't do so already).

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top