Solr core's numDocs does not equal sum of processed documents

https://stackoverflow.com/questions/8489590

15-03-2021
|

Question

I have an issue while building my Solr index (Lucene & Solr 3.4.0 on an Apache Tomcat 6.0.33).

The data for the documents to index comes out of an Oracle database. Since I have to handle loads of CLOBs, I splitted up the dataimport into several requestHandlers to increase the performance while fetching the data from the database (multithreading simulation). These requestHandlers are configured in my solrconfig.xml as follows:

<requestHandler name="/segment-#" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">segment-#.xml</str>
    </lst>
</requestHandler>

To build the index, I start the first DataImportHandler with the clean=true option and then start the full-import of all other segments. When all segments are through, the status pages (http://host/solr/segment-#) tell me, that for each segment the correct number of rows (according to the SELECT COUNT(*) statement in the database) was fetched and processed. Fine so far.

But if I now call the status page of the core (http://host/solr/admin/core) the numDocs is not the sum of all segments. There are always some documents missing. I tried the index build several times, the difference was always varying. In sum there should be 8.3 million documents in the index, but actually there are always roundabout 100.000 entries missing. The numDocs is the same number that I can find with a *:* query via the Solr admin interface.

I turned on the infostream, had a look at the log entries, also the Tomcat logs but did not find a clue. What am I doing wrong?

I am using 17 requestHandlers and my <indexDefaults> are configured as follows:

<useCompoundFile>false</useCompoundFile>
<mergeFactor>17</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<maxBufferedDocs>50000</maxBufferedDocs>
<maxFieldLength>2000000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<lockType>native</lockType>

Help is very appreciated. Thank you very much in advance!

Solution

I found the problem, just had to RTFM... I tricked myself because the default clean option is TRUE, I thought it was FALSE. So I just called the first URL with &clean=true instead of calling all other URLs with &clean=false. So each URL call resulted in cleaning the whole index. If I call the URLs with &clean=false, the sum of all documents is correct.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow