
I working with solr to extract pdf files and index it. now I am able to extract it with the following code:

private static void IndexPDFFile(ISolrOperations<Article> solr)
    string filecontent = null;

    using (var file = File.OpenRead(@"C:\\cookbook.pdf"))
        var response = solr.Extract(new ExtractParameters(file, "abcd1")
            ExtractOnly = true,
            ExtractFormat = ExtractFormat.Text,

        filecontent = response.Content;

but when I check solr with the following command in the browser, nothing appears:




the content of the pdf file is: This is a Solr cookbook... the field author should contain somethinh with admin.

here the output:

    <response><lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params"><str name="q">text:Solr</str></lst></lst><result name="response" numFound="0" start="0"/></response>

any suggestions for that issue??

thanks, tro

Was it helpful?


This is because you have set the ExtractOnly=true in your ExtractParameters. Here is the comment for the ExtractOnly parameter from the source code.

    /// <summary>
    /// If true, return the extracted content from Tika without indexing the document. 
    /// This literally includes the extracted XHTML as a string in the response. 
    /// </summary>
    public bool ExtractOnly { get; set; }

If you want to index the extracted content, do not set this parameter to true.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top