Question

I've searched far and wide. Perhaps I don't know what to search for...

I need to be able to index and search "secured" PDFs. These PDFs have the "No Copy" attribute selected and are locked down. Meaning there is no way to copy the content of the PDF without the username and password. IFilter respects these settings and won't allow the PDF to be indexed.

I'm looking for a means to index and search theses PDFs on my server using aspx.net. It would appear that I'm stuck with one of the following:

  1. I whould have the credentials needed to open up these PDFs to get “copy” access to the content
  2. When a PDF is submitted for my tool, two items will need to be submitted: The word copy - and - The PDF copy
  3. Have the full content copied to the Meta data of the PDF, or at least some key words. I have not looked into what kind of risks could be involved here. This would mean an extra step for the writers

Solutions one and 2 would mean maintaining a duplicate copy... either on the server or in a DB and refer to the actual for download, programmatically. Has anyone come up with a solution for this? I would prefer the indexing capabilities as it means no duplication of content. Solution 3 is appealing if the PDFs meta data can handle that much content and if security is still intact. I've also wondered about programmatic access to the PDF where, via C# or VB, I can use credentials to gain the access... but it looks like I may be stuck.

This is my last ditch effort to find another solution. Any help would be appreciated.

Was it helpful?

Solution 2

I ended up going with a completely different solution. I loved the idea of utilizing MS's indexing, but it's becoming much easier to use SQL and have the user who upload the PDF paste key words, or the content of the pdf into a text box. Then SQL can index that "column" and bamm... a search engine does the rest.

Thanks everyone for taking the time to consider this one.

OTHER TIPS

If you have user names and passwords for the files than maybe you could just open the files and extract text from them?

Then you will be able to build an index from extracted data.

Docotic.Pdf, the library I am involved with, can open password-protected files for you. And it can extract text, too. Text can be extracted as plain or formatted text and can be split by words or chars.

Please have a look at following samples:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top