Question

I've installed solr 4.6.0 and follow the tutorial available at Solr's home page. Everything was fine, untill I need to do a real job that I'm about to do. I have to get a fast access to wikipedia content and I was advised to use Solr. Well, I was trying to follow the example in the link http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia, but I couldn't get the example. I am newbie, and I don't know what means data_config.xml!

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/data/enwiki-20130102-pages-articles.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

I couldn't find in the Solr home directory. Also, I tried to find some questions related to mine, How to index wikipedia files in .xml format into solr and Indexing wikipedia dump with solr, but they didn't solve my doubt.

I think I need something more basic, guiding me step by step, because the tutorial is confusing when deals with indexing wikipedia.

Any advice to give some directions to folow would be nice.

Was it helpful?

Solution 2

Well, I've read many things on the Web and tried to collected as many information as possible. This is how I could find the solution:

here is my solrconfig.xml:

...
  <!-- ****** Data import handler -->
  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>
...
  <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />

Here is my data-config.xml: (important: it must be in the same folder of solrconfig.xml)

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/Applications/solr-4.6.0/example/exampledocs/simplewikiSubSet.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

Attention: The last line is very important!

My schema.xml:

...
   <field name="id"        type="string"  indexed="true" stored="true" required="true"/>
   <field name="title"     type="string"  indexed="true" stored="false"/>
   <field name="revision"  type="int"    indexed="true" stored="true"/>
   <field name="user"      type="string"  indexed="true" stored="true"/>
   <field name="userId"    type="int"     indexed="true" stored="true"/>
   <field name="text"      type="text_en"    indexed="true" stored="false"/>
   <field name="timestamp" type="date"    indexed="true" stored="true"/>
   <field name="titleText" type="text_en"    indexed="true" stored="true"/>
...
 <uniqueKey>id</uniqueKey>
...
   <copyField source="title" dest="titleText"/>
...

And it's done. That's all folks!

OTHER TIPS

  • For the data_config.xml

    Each Solr instance is configured using three main files:solr.xml,solrconfig.xml,schema.xml,and the data_config.xml file define the data source when you use DIH component,this URL would be usefull for you :DIH.

  • About Solr home directory

    You should start from here:https://cwiki.apache.org/confluence/display/solr/Running+Solr

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top