solr pdf extraction works but no indexing

https://stackoverflow.com/questions/15701387

30-03-2022
|

Question

I working with solr to extract pdf files and index it. now I am able to extract it with the following code:

private static void IndexPDFFile(ISolrOperations<Article> solr)
{
    string filecontent = null;

    using (var file = File.OpenRead(@"C:\\cookbook.pdf"))
    {
        var response = solr.Extract(new ExtractParameters(file, "abcd1")
        {
            ExtractOnly = true,
            ExtractFormat = ExtractFormat.Text,
        });

        filecontent = response.Content;
    }
    solr.Commit();
}

but when I check solr with the following command in the browser, nothing appears:

http://berserkerpc:444/solr/select/?q=text:solr

http://berserkerpc:444/solr/select/?q=author:admin

the content of the pdf file is: This is a Solr cookbook... the field author should contain somethinh with admin.

here the output:

    <response><lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params"><str name="q">text:Solr</str></lst></lst><result name="response" numFound="0" start="0"/></response>

any suggestions for that issue??

thanks, tro

Solution

This is because you have set the ExtractOnly=true in your ExtractParameters. Here is the comment for the ExtractOnly parameter from the source code.

    /// <summary>
    /// If true, return the extracted content from Tika without indexing the document. 
    /// This literally includes the extracted XHTML as a string in the response. 
    /// </summary>
    public bool ExtractOnly { get; set; }

If you want to index the extracted content, do not set this parameter to true.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow