Indexing Wikipedia with Solr doesn't work

https://stackoverflow.com/questions/22596726

19-06-2023
|

Question

I'm trying to index the English Wikipedia, around 40Gb, but it's not working. I've followed the tutorial at http://wiki.apache.org/solr/DataImportHandler#Configuring_DataSources and other related Stackoverflow questions like Indexing wikipedia with solr and Indexing wikipedia dump with solr.

I was able to import the wikipedia (simple english), about 150k documents, and Portuguese wikipedia (more than 1 million documents) using the configuration explained in the tutorial. The problem is happening when I try to index the English Wikipedia (more than 8 million documents). It gives the follow error:

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:457)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:410)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
    ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:539)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
    ... 5 more
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.index.ParallelPostingsArray.<init>(ParallelPostingsArray.java:34)
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:254)
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:279)
    at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
    at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:307)
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:324)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:165)
    at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1520)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217)
    at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
    at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:569)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:705)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
    at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
    at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
    at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:504)
    ... 6 more

I'm using a MacBook pro with 4Gb RAM and more than 120Gb of free space in the HD. I've already tried to change the the 256 in the solrconfig.xml, but no success up to now.

Does anyone could help me, please?

Edited

Just in case, if someone has the same problem, I've used the command java Xmx1g -jar star.jar suggested by Cheffe to solve my problem.

Solution

Your Java VM is running out of memory. Give more memory to it. Like explained in this SO question Increase heap size in Java

java -Xmx1024m myprogram

Further detail on the Xmx parameter can be found in the docs, just search for -Xmxsize

Specifies the maximum size (in bytes) of the memory allocation pool in bytes. This value must be a multiple of 1024 and greater than 2 MB. Append the letter k or K to indicate kilobytes, m or M to indicate megabytes, g or G to indicate gigabytes. The default value is chosen at runtime based on system configuration. For server deployments, -Xms and -Xmx are often set to the same value. For more information, see Garbage Collector Ergonomics at http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc-ergonomics.html

The following examples show how to set the maximum allowed size of allocated memory to 80 MB using various units:

Xmx83886080

Xmx81920k

Xmx80m

The -Xmx option is equivalent to -XX:MaxHeapSize.

OTHER TIPS

If you have tomcat6, you can increase java heap size in the file

/etc/default/tomcat6

change the parameter -Xmx in the line (e.g. from Xmx128m to Xmx256m):

JAVA_OPTS="-Djava.awt.headless=true -Xmx256m -XX:+UseConcMarkSweepGC"

During the import, watch the Admin Dashboard web page, where you can see actual JVM-memory allocated.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow