Domanda

I've installed solr 4.6.0 and follow the tutorial available at Solr's home page. Everything was fine, untill I need to do a real job that I'm about to do. I have to get a fast access to wikipedia content and I was advised to use Solr. Well, I was trying to follow the example in the link http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia, but I couldn't get the example. I am newbie, and I don't know what means data_config.xml!

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/data/enwiki-20130102-pages-articles.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

I couldn't find in the Solr home directory. Also, I tried to find some questions related to mine, How to index wikipedia files in .xml format into solr and Indexing wikipedia dump with solr, but they didn't solve my doubt.

I think I need something more basic, guiding me step by step, because the tutorial is confusing when deals with indexing wikipedia.

Any advice to give some directions to folow would be nice.

È stato utile?

Soluzione 2

Well, I've read many things on the Web and tried to collected as many information as possible. This is how I could find the solution:

here is my solrconfig.xml:

...
  <!-- ****** Data import handler -->
  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>
...
  <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />

Here is my data-config.xml: (important: it must be in the same folder of solrconfig.xml)

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/Applications/solr-4.6.0/example/exampledocs/simplewikiSubSet.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

Attention: The last line is very important!

My schema.xml:

...
   <field name="id"        type="string"  indexed="true" stored="true" required="true"/>
   <field name="title"     type="string"  indexed="true" stored="false"/>
   <field name="revision"  type="int"    indexed="true" stored="true"/>
   <field name="user"      type="string"  indexed="true" stored="true"/>
   <field name="userId"    type="int"     indexed="true" stored="true"/>
   <field name="text"      type="text_en"    indexed="true" stored="false"/>
   <field name="timestamp" type="date"    indexed="true" stored="true"/>
   <field name="titleText" type="text_en"    indexed="true" stored="true"/>
...
 <uniqueKey>id</uniqueKey>
...
   <copyField source="title" dest="titleText"/>
...

And it's done. That's all folks!

Altri suggerimenti

  • For the data_config.xml

    Each Solr instance is configured using three main files:solr.xml,solrconfig.xml,schema.xml,and the data_config.xml file define the data source when you use DIH component,this URL would be usefull for you :DIH.

  • About Solr home directory

    You should start from here:https://cwiki.apache.org/confluence/display/solr/Running+Solr

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top