Question

We are calculating term frequency (tf-idf) of some documents. We are representing the terms as nodes, related to some documents (more nodes).

The thing is that I have to fill our Neo4j database with weighted relationships between terms and documents, and that is a lot of data.

We have been working with HTTP REST services, my team mate is telling me he will make a matrix that I can use to populate the graph with the relationships, I think that would be wrong because it will turn out into an O (N^2).

I think it would be best to use a json structure and send that through HTTP, then insert relationships one by one.

Which is the best way to handle this kind of data structures?

Was it helpful?

Solution

Please take a look at one of our new features in the latest milestone of Neo4j, Cypher's LOAD CSV clause.

http://docs.neo4j.org/chunked/milestone/import-importing-data-from-a-single-csv-file.html

Generate a CSV file from the document you are analyzing that contains each unique word and its frequency. Push that CSV file to a location that can be accessed by HTTP GET from the Neo4j database server.

That Cypher query will look like this:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
    "http://localhost:8888/csv/docid-ABCD-0000001.csv"
AS csvLine
MERGE (doc:Document { id: csvLine.document_id })
MERGE (word:Word { word: csvLine.word })
MERGE (doc)-[:HAS_WORD { weight: csvLine.word_frequency }]->(word)

This query gets or creates the document node, word nodes, and then connects the two and qualifies the relationship on word frequency for each word in the document.

The header of the CSV file would be: document_id, word, word_frequency

Note: You must download the latest milestone of Neo4j (2.1.0-M01) to use LOAD CSV as of the time I'm posting this. It's not advised to use milestones for production applications.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top