Tika - url / file path issue

https://stackoverflow.com//questions/24036517

21-12-2019
|

Question

I am working on Solr and on DataimportHandler for indexing rich documents like pdf,word,image etc I am using TikaEntityProcessor for extracting contents from the files.

I have one small issue regarding setting value to 'url' entry.

My data-config.xml file is like so:

<dataConfig>
<dataSource name="db_ds" type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url="jdbc:oracle:thin:@KOR308051.bmh.apac.bosch.com:1521:xe"
user="ezbdb"
password="ezbdb"/>

<dataSource name="tk_ds" type="BinFileDataSource" />

 <script>  
    <![CDATA[ 

    function getFilePath(row) { 
        var link = row.get('url_link');
        if (link === null || true === link.isEmpty() || link === '') {
                            row.remove('url_link');
                    } else {
            var path_arr = link.split("#");
            var file_path = path_arr[0];            
            row.put(file_path);
        }
         return row;
     }  
    ]]>  
</script>

<document name="db_doc">
    <entity name="db_link"
            query="SELECT 
                    d.doc_url as Link,
                    d.doc_name as Name,
                    cast(trunc(d.last_modified) as date) as Last_modified
                    FROM doc_data d
            dataSource="db_ds" transformer="DateFormatTransformer,script:getFilePath">
                    <field column="LINK" name="link"/>
        <field column="NAME" name="name"/>
        <field column="LAST_MODIFIED" name="last_modified" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd HH:mm:ss"/>

            <entity name="tika-doc" dataSource="tk_ds" processor="TikaEntityProcessor"
                          url="${db_link.LINK}" format="text" onError="skip">
                         <field column="text" name="content"/>
           </entity>

           </entity>
</document>
</dataConfig>

The thing is, the file path is stored in a different pattern in the database: "doc_url" is the field in db which stores the url or file path. The file path is stored in this way: D:\Games\CS2\setup.doc#D:\Games\CS2\setup.doc# i.e. the path is stored twice seperated by a '#'. I am not sure why it is done. It has been done by our client.

All I need is only the one file path i.e. D:\Games\CS2\setup.doc I am passing the url value to tika as url="${db_link.LINK}" But the ${db_link.LINK} contains the path coming from database directly. I have tried using script transformer and splitting the path string to parts by '#' and taking the first path using the method getFilePath(row) but no luck.

I am still getting the path as stored in db. This gives a FileNotFound exception while trying to index it and that is obvious because the path is incorrect.

What can be done to get only the path and leaving out rest of the path having # and all?

Help would be much appreciated :)

Solution

You can use the RegexTransformer of SolR :

http://wiki.apache.org/solr/DataImportHandler#RegexTransformer

add to your transformer attribute the RegexTransformer :

    <entity name="db_link"
                    query="SELECT ..." ... transformer="... ,org.apache.solr.handler.dataimport.RegexTransformer"...>

modify field tag in row 'link' :

<field column="link" regex="^([^#]+)#" sourceColName="LINK"/>

That should be all

EDIT regex corrected

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow