How to expose the Solr DataImportHandler dataSource name in the result doc

https://stackoverflow.com/questions/16910456

30-05-2022
|

Question

I am importing data into Solr 4.3.0 from two different dataSources. This all works fine except that the search results do not indicate the original dataSource for each result document.

Is there a "proper" way to get the dataSource (or entity name) into the result document?

My data-config.xml looks like this (based on example given in http://wiki.apache.org/solr/DataImportHandler#Multiple_DataSources ):

<dataConfig>
    <dataSource name="ds1" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@//oracle-1:1521/DB1" user="SCHEMA1" password="Passw0rd1"/>
    <dataSource name="ds2" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@//oracle-1:1521/DB2" user="SCHEMA2" password="Passw0rd2"/>
    <document>
        <entity name="apples" dataSource="ds1" pk="id" query="select id,name,color from apples" />
        </entity>
        <entity name="bannnas" dataSource="ds2" pk="id" query="select id,name,desc from bananas" />
        </entity>
    </document>
</dataConfig>

Sample XML result set from a search looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">3</int>
    <lst name="params">
        <str name="indent">true</str>
        <str name="q">yellow</str>
        <str name="_">1370321809357</str>
        <str name="wt">xml</str>
    </lst>
</lst>
<result name="response" numFound="2" start="0">
  <doc>
      <str name="id">12</str>
      <str name="name">Golden Delicious</str>
      <str name="color">yellow</str></doc>
  <doc>
      <str name="id">5</str>
      <str name="name">Cavendish group</str>
      <str name="desc">Cavendish group is the common name for the triploid AAA group of Musa acuminata, by far the most popular cultivar by export volume. Cavendish bananas have a yellow skin and pale yellow inside when ripe.</str></doc>
</result>
</response>

Note the reason I want to know the dataSource for a given result is that the result entities have different schemas and thus need to be parsed/handled/rendered differently by the client application. Happy to see other answers that address this root problem in a different way.

Solution

Instead of the storing the datasource, why not just add the entity identifier column with each document.
This identifier field would a fixed value column, probably embedded within the Query itself.

e.g. Use alias in sql e.g. SELECT 'APPLE' AS ENTITY_TYPE

You can use this field to determine what type of parsing is needed for the respective entity.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow