Question

I am trying to convince solr to perform a bulk import of a sqlite database. I configured DataImportHandler to open that database through jdbc successfully and I can start the import with a wget http://localhost:8080/solr/dataimport?command=full-import but whatever I do, solr appears to be indexing only the first 499 documents (as reported by wget http://localhost:8080/solr/dataimport?command=status).

The jetty log file does not report any error message. Instead, it reports the end of indexing:

27-Jan-2012 19:08:13 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
27-Jan-2012 19:08:13 org.apache.solr.handler.dataimport.SolrWriter persist
INFO: Wrote last indexed time to dataimport.properties
27-Jan-2012 19:08:13 org.apache.solr.handler.dataimport.DocBuilder execute
INFO: Time taken = 0:0:1.145

What could I have done wrong ??

Was it helpful?

Solution

I know that it is not very good taste to answer one's own question but I eventually figured out the nasty problem that caused this error.

The directive used to configure solr for a specific data source is this:

<dataSource type="JdbcDataSource" driver="org.sqlite.JDBC" url="jdbc:sqlite:/foo.db"/>

By default, the JdbcDataSource class reads the batchSize attribute of this XML node and assumes it to be set to 500 unless specified. So, The above was in fact equivalent to:

<dataSource type="JdbcDataSource" ... batchSize="500"/>

Now, JdbcDataSource passes batchSize to the method setFetchSize of the underlying JDBC driver (in this case, the sqlite jdbc driver). This driver assumes that this method is in fact asking it to limit the number of rows returned and thus never returns more than 500 rows in this case. I am not sufficiently familiar with the expected semantics of the JDBC API to be able to tell whether it is the sqlite driver which is wrong in how it interprets this value or whether it is the solr JdbcDataSource class that is wrong in how it thinks drivers will react to this method call.

What I know, though, is that the fix is to specify batchSize="0" because the sqlite jdbc driver assumes that a value of zero means: "no row limit specified". I added this tip to the corresponding solr FAQ page.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top