Question

I'm having an issue where Solr won't clear the index during a full import.

All of the servers run Solr 3.4, the configuration is as vanilla as it can be.

I tried this on our development environment and on an instance on my own computer, and received similar results.

The schema is rather simple, these are the salient points:

<schema name="System" version="1.4">
...
  </types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
    <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0" />
    <fieldType name="documentKey" class="solr.TextField">
      <analyzer type="index"> 
        <tokenizer class="solr.KeywordTokenizerFactory"/> 
      </analyzer> 
      <analyzer type="query"> 
        <tokenizer class="solr.KeywordTokenizerFactory"/> 
      </analyzer> 
    </fieldType>
  </types>
  <fields>
    <field name="document_id" type="documentKey" indexed="true" stored="true" required="true" />
    <field name="entity_id" type="long" indexed="true" stored="true" required="true" />
    <field name="name" type="string" indexed="true" stored="true" required="true" />
    <field name="entity_type" type="string" indexed="true" stored="true" required="false" />
    <field name="Timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
  </fields>
</schema>

Of note: - The document_id field is calculated in the materialized view which is used to populate the index, and is a combination of other fields not in this index, but is indipendent of the entity_id. It's unique. - The entity_id field is the key of a couple of tables, and for the same document_id it can change wildly between a refresh and another.

Before a full refresh, if I query the index as such:

http://localhost:8080/qq-solr/system/select/?rows=10&q=document_id:%22French_Polynesia/Huahine~4034376%22

I get:

<?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">5</int>
      <lst name="params">
        <str name="indent">true</str>
        <str name="q">document_id:"French_Polynesia/Huahine~4034376"</str>
        <str name="rows">10</str>
      </lst>
    </lst>
  <result name="response" numFound="1" start="0">
    <doc>
      <date name="Timestamp">2012-03-08T09:47:26.335Z</date>
      <str name="document_id">French_Polynesia/Huahine~4034376</str>
      <long name="entity_id">22902728</long>
      <str name="name">Huahine</str>
      <str name="type">LOCATION</str>
    </doc>
  </result>
</response>

Then I refresh:

http://localhost:8080/qq-solr/system/dataimport?command=full-import&clean=true&commit=true&optimize=true

(I know the clean, commit, and optimize are redundant, but I used them just to make sure) and after a while I get the message that everything is a-ok.

Then I query the index again:

http://localhost:8080/qq-solr/system/select/?rows=10&q=document_id:%22French_Polynesia/Huahine~4034376%22

And I get:

<?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">5</int>
      <lst name="params">
        <str name="indent">true</str>
        <str name="q">document_id:"French_Polynesia/Huahine~4034376"</str>
        <str name="rows">10</str>
      </lst>
    </lst>
  <result name="response" numFound="1" start="0">
    <doc>
      <date name="Timestamp">2012-03-09T08:31:07.317Z</date>
      <str name="document_id">French_Polynesia/Huahine~4034376</str>
      <long name="entity_id">22902728</long>
      <str name="name">Huahine</str>
      <str name="type">LOCATION</str>
    </doc>
  </result>
</response>

But in the database the entity_id is different!

I see that the Timestamp has been updated, so that record has been touched, but why is the old value being retained?

Was it helpful?

Solution

I would run your DataImportHandler (DIH) process through the Interactive Development Mode so that you can assure that your database query is retrieving the entity_id that you are expecting. Because the timestamp on the solr entry is being updated, your DIH process is running, but I am guessing the cause for this lies in the way the data is being retrieved.

OTHER TIPS

Any time I'm doing an operation like this with Solr, I always manually clear the index first using curl to be 100% sure its wiped. Here is a tutorial: http://www.alphadevx.com/a/365-Clearing-a-Solr-search-index

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top