Question

I have thousands of Cvs and I want to search for the CVs having 'computer science' as their background.

So, I googled and got to know that Lucene does this job and I need to feed the data to Lucene and it indexes all the documents.

On search for a particular text (say 'Compuet science'), it result the CVs matching the results.

For this, I need to convert MSword-93/MSword-2007/PDF to text and feed Lucene.

I can get text out of MSword2007 documents, but I am unable to get from MSword 2003.

There are many pdf writers but I didnt get any PDF reader library which can does this.

Please throw some light on PDF reader library and converting ms93 documents to text OR please let me know if any alternatives for Lucene search.

Thanks, many Thanks for answers

Was it helpful?

Solution

You can use Apache Solr or directly Tika to extract text from PDFs and MS Word and index it. Both are Java projects, but you can call their server from PHP.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top