Index pdf documents in Solr from C# client

https://stackoverflow.com/questions/8935060

30-10-2019
|

Question

Basically I'm trying to index word or pdf documents in Solr and found the ExtractingRequestHandler, but can't figure out how to write code in c# that performs the HTTP POST request like in the Solr wiki: http://wiki.apache.org/solr/ExtractingRequestHandler.

I've installed Solr 3.4 on Tomcat 7 (7.0.22) using the files from the example/solr directory in the Solr zip and I haven't altered anything. The ExtractingRequestHandler should be configured out of the box in the solrconfig.xml and ready to use, right?

Can some of you give an C# (HttpWebRequest) example of how you make the HTTP POST request and upload a PDF file like it is done using curl in the Solr wiki?

I've look all over this site and many others trying to find an example or a tutorial on how this is done, but haven't found anything.

EDIT:

I finally managed to get it to work using SolrNet!

In order for it to work you need to copy this to a lib-folder in your Solr installation directory from the Solr zip:

apache-solr-cell-3.4.0.jar file from the dist folder
content of contrib\extraction\lib directory

With SolrNet 0.4.0 beta 2, this code does the job:

Startup.Init<IndexDocument>("YOUR-SOLR-SERVICE-PATH");
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<IndexDocument>>();

using (FileStream fileStream = File.OpenRead("FILE-PATH-FOR-THE-FILE-TO-BE-INDEXED"))
{
    var response =
        solr.Extract(
            new ExtractParameters(fileStream, "doc1")
            {
                ExtractFormat = ExtractFormat.Text,
                ExtractOnly = false
            });
}

solr.Commit();

Sorry for the trouble. I hope however that others will find this useful.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow