Question

Suppose I have a large ammount of heterogeneous JSON documents (i.e. named key-value mappings) and a hierarchy of classes (i.e. named sets) that these documents are attached to. I need to set up a data structure that will allow:

  1. CRUD operations on JSON documents.
  2. Retrieving JSON documents by ID really quickly.
  3. Retrieving all JSON documents that are attached to a certain class really quickly.
  4. Editing class hierarchy: adding/deleting classes, rearranging them.

I've initially came up with the idea of storing JSON documents in a document-oriented database (like CouchDB or MongoDB) and storing class hierarchy in an RDF storage (like 4store). 1, 2 and 4 are then figured out naturally, and 3 solved by maintaining list of attached document IDs for every class in the storage.

But then I figured that a RDF storage could actually do the document-oriented part of retrieving JSON documents by ID. At a first glance this seems true, but I'm still concerned about 2 and 3. Is there a RDF storage that is able to retrieve documents (nodes) at a speed document-oriented db's serve documents? How fast will it serve 3-like queries? I've heard a little bit about RDF storages being slow, reification problem, etc.

Is there an RDF storage that is also as comfortable for casual retrieving objects by ID, as CouchDB, for example? What is the difference between using document-oriented and RDF storage for storing, retrieving and editing JSON-like objects?

Was it helpful?

Solution

The closest thing you can use in RDF databases are named graphs. In a named graph, you can put a set of RDF triples. This set of triples can be asserted from one or many RDF documents depending on your needs. Lets say you want one named graph per RDF document. You could name the graph with a URI that reflects the file location a URL or a IRI. For instance ...

http://yourdomain/files/rdf_file_1

or

file:///home/myrdffiles/file1

4store is a quad store. Quad stores support named graphs and 4store is specially design to handle this.

With 4store you can run the following command to assert triples in a Named Graph:

curl -T your_file.rdf http://your_4store_database/data/http://yourdomain/files/rdf_file_1

After /data/ you can put the GRAPH identifier (IRI) where the triples are going to be asserted. See 4store sparql server and 4store Client Libs for more details.

Once you have your data asserted, with SPARQL you can also use the named graph to direct your query to that graph:

SELECT * WHERE {
   GRAPH <http://youdomain/files/rdf_file_1> {
        .... some triple patterns in here ....
   }
}

Moreover, 4store also supports JSON so you can retrieve the SPARQL resultset directly in JSON.

If you decide to use 4store you'll find valuable support here: http://4store.org/contact

OTHER TIPS

You originally asked this question for graph databases (like Neo4j). That's why I'd like to add some notes.

  1. Graph databases use integrated indexing for nodes (and relationships) so the fast initial lookup for the root nodes of your documents is done via that (external or in graph indexes)
  2. Additional in graph indexes for paths (actually trees to the root) can be modelled cleaner that just a key-value lookup)
  3. If you model your documents as trees of nodes with properties you can do any simple, and complex CRUD operations (also structural)
  4. retrieving all documents of a "type" or "class" can again be done by a index (index root nodes to type) or in graph category nodes
  5. you can put those "types or class" category-nodes into a hierarchy (or graph) which then can be edited using the usual graph database API
  6. traversing the graph can be done using traversers / integrated graph query language (e.g. cypher for Neo4j)
  7. Loading hierarchical data can either be done by custom importers or a more general sub-graph importer (e.g. GEOFF)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top