Many hours later... First, there is a lot of misleading, wrong and useless information on this problem. No page seemed to provide everything in one place. All of the information is well intentioned but between differing versions and some going over my head, it didn't solve the problem. Here is my collection of what I learned and the solution. To reiterate, I'm using Solr 4.0 (on Tomcat) + Oracle 11g.
Solution overview: DataImportHandler + TikaEntityProcessor + FieldStreamDataSource
Step 1, make sure you update your solrconfig.xml
so that solr can find the TikaEntityProcessor + DataImportHandler + Solr Cell stuff.
<lib dir="../contrib/dataimporthandler/lib" regex=".*\.jar" />
<!-- will include extras (where TikaEntPro is) and regular DIH -->
<lib dir="../dist/" regex="apache-solr-dataimporthandler-.*\.jar" />
<lib dir="../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../dist/" regex="apache-solr-cell-\d.*\.jar" />
Step 2, modify your data-config.xml
to include your BLOB table. This is where I had the most trouble since the solutions to this problems have changed a lot as versions have changed. Plus, using multiple data sources and plugging them together correctly was not intuitive to me. Very sleek once it's done though. Make sure to replace your IP, SID name, username, password, table names, etc.
<dataConfig>
<dataSource name="dastream" type="FieldStreamDataSource" />
<dataSource name="db" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:@192.1.1.1:1521:sid"
user="username"
password="password"/>
<document>
<entity
name="attachments"
query="select * from schema.attachment_table"
dataSource="db">
<entity
name="attachment"
dataSource="dastream"
processor="TikaEntityProcessor"
url="blob_column"
dataField="attachments.BLOB_COLUMN"
format="text">
<field column="text" name="body" />
</entity>
</entity>
<entity name="unrelated" query="select * from another_table" dataSource="db">
</entity>
</document>
</dataConfig>
Important note here. If you're getting "No field available for name : whatever"
errors when you attempt to import, the FieldStreamDataSource is not able to resolve the data field name you gave. For me, I had to have the url
attribute with the lower-case column name, and then the dataField
attribute with outside_entity_name.UPPERCASE_BLOB_COLUMN. Also, once I had the column name wrong and that will cause the problem as well.
Step 3, you need to modify your schema.xml
to add the BLOB-column field (and any other column you need to index/store). Modify according to your needs.
<field name="body" type="text_en" indexed="false" stored="false" />
<field name="attach_desc" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_en" indexed="true" stored="false" multiValued="true" />
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true" />
<copyField source="body" dest="text" />
<copyField source="body" dest="content" />
With that you should be well on your way to saving many hours getting your binary, rich-text documents (aka rich documents) that are stored as BLOBs in a database column indexed with Solr.