Domanda

Could someone tell me from where I have to start to develop a simple full text search engine for local files?

I have a Debian 7 server with LAMP and I have mounted a windows network drive on it. So far I am using this script to show the other local network users the directory tree where they can download files from the mounted network drive.

But I have to build a simple search engine which could index the names and the content (if any) of local files in the mounted folder - Microsoft doc, docx, xls, xlsx, rtf, txt. The search has to return the name of the file, the path and the best would be if there is a part of the text where the search word(s) present (if the file has text).

Could someone point me to the right direction what I have to read and learn to do this? Thanks.

È stato utile?

Soluzione

You need a couple of tools for this. You need something to index and search content, and you've tagged the question with three good tools for this task, , , and . Each one of them is rich with tutorials and examples to help you get started.

The other thing you will need, is a way to read the content from all those different file types. I'd recommend Apache Tika. It's an excellent toolkit for this, reads all the formats you've listed, and works well with Lucene.

You can see an example of their use together in this question : Tika in Action book examples Lucene StandardAnalyzer does not work

Altri suggerimenti

You may find this helpful, you may not.

I have Solr and Nutch set up to index my local filesystem and store them in Solr and have guides on how I set them up that way.

This would provide a solid backend for your application.

Here are the links. First two for Solr set up, last two for Nutch integration

http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html

http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top