質問

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.

So what I need is the page number and a short text snippet of every search result.

I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.

役に立ちましたか?

解決

I'm now splitting the PDF and sending each page separately to SOLR. So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.

他のヒント

There is JIRA SOLR-380 with a Patch, which you can check upon.

I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.

I have not tried it myself. Approach,

  1. Solr customer connector integrating with Apache Tika parser for indexing PDFs
  2. Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
  3. In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
  4. Enable search on all the “page” attributes
  5. When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
  6. The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
  7. Link the PDF with the “#PageNumber” of the PDF and pop up the page on click

A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.

If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top