Neo4j Batch Inserter is very slow and creates huge database files

https://stackoverflow.com/questions/23609142

neo4j

20-07-2023
|

Question

I'm trying to insert a relatively small graph (2M relationships, a few 100K nodes) into Neo4j 2.0.3 from a CSV file. Each line in this file is a relationship. I'm using the BatchInserter API.

To test my code, I use a subset of the input file. When this subset is 500 relationships big, the insertion runs fast (a few seconds including JVM startup). When it's 1000 relationships big, the import takes 20 minutes, and the resulting database is 130 GB in size! What's even more weird is that the result (in time and in space) is exactly the same with 5000 relationships. 99% of the 20 minutes is dedicated to writing the GBs to disk.

I don't understand what is happening here. I've tried configuring the inserter with various settings following the recommendations from the official documentation.

Files
  .asCharSource(new File("/path/to/input.csv"), Charsets.UTF_8)
  .readLines(new LineProcessor<Void>() {

    BatchInserter inserter = BatchInserters.inserter(
      "/path/to/db", 
      new HashMap<String, String>() {{
        put("dump_configuration","false");
        put("cache_type","none");
        put("use_memory_mapped_buffers","true");
        put("neostore.nodestore.db.mapped_memory","500M");
        put("neostore.relationshipstore.db.mapped_memory","1G");
        put("neostore.propertystore.db.mapped_memory","500M");
        put("neostore.propertystore.db.strings.mapped_memory","500M");
      }}
    );
    RelationshipType relationshipType = 
      DynamicRelationshipType.withName("relationshipType");
    Set<Long> createdNodes = new HashSet<>();

    @Override public boolean processLine(String line) throws IOException {
        String[] components = line.split("\\|");
        long sourceId = parseLong(components[1]);
        long targetId = parseLong(components[3]);

        if (!createdNodes.contains(sourceId)) {
           createdNodes.add(sourceId);
           inserter.createNode(sourceId, new HashMap<>());
        }
        if (!createdNodes.contains(targetId)) {
            createdNodes.add(targetId);
            inserter.createNode(targetId, new HashMap<>());
        }
        inserter.createRelationship(
            sourceNodeId, targetNodeId, relationshipType, new HashMap<>()); 

        return true;
    }

    @Override public Void getResult() {
        inserter.shutdown();
        return null;
    }

});

Solution

I stumbled upon the solution by messing around with my code.

It turns out that if I call createNode without specifying the node ID, then it works perfectly well.

I was specifying the node ID because, since the API allowed it, it was convenient to have the node IDs match the IDs from the input file.

Guess at the underlying reason: the nodes are probably stored in a contiguous array indexed by their ID. Most IDs in my input file are small (4 digits), but some can be 12 digits long. So when I tried to insert one of those, Neo4j would write a gigabytes long array to disk just to put that node at the end. Maybe someone can confirm this. It is surprising that this behavior doesn't seem to be documented in the Neo4j API documentation for this method.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow