Question

I want to implement search functionality for a website (assume it is similar to SO). I don't want to use Google search of stuff like that.

My question is:

How do I implement this?

There are two methods I am aware of:

  1. Search all the databases in the application when the user gives his query.
  2. Index all the data I have and store it somewhere else and query from there (like what Google does).

Can anyone tell me which way to go? What are the pros and cons?

Better, are there any better ways to do this?

Was it helpful?

Solution

Use lucene,
http://lucene.apache.org/java/docs/

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

It is available in java and .net. It is also in available in php in the form of a zend framework module.

Lucene does what you wanted(indexing of the searched items), you have to keep track of a lucene index but it is much better than doing a database search in terms of performance. BTW, SO search is powered by lucene. :D

OTHER TIPS

It depends on how comprehensive your web site is and how much you want to do yourself.

If you are running a a small website without further possibilities to add a custom search, let google do the work (maybe add a sitemap) and use the google custom search.

If you run a medium site with an sql engine use the search features of your sql engine.

If you run some heavier software stack like J2EE or .Net use Lucene, a great, powerful search engine or its .Net clone lucene.Net

If you want to abstract your search from your application and be able to query it in a language neutral way with XML/HTTP and JSON APIs, have a look at solr. Solr runs lucene in the background, but adds a nice web interface to it.

You might want to have a look at xapian and the omega front end. It's essentially a toolkit on which you can build search functionality.

The best way to approach this will depend on how you construct your pages.

If they're frequently composed from a lot of different records (as I imagine stack overflow pages are), the indexing approach is likely to give better results unless you put a lot of work into effectively reconstructing the pages on the database side.

The disadvantage you have with the indexing approach is the turn around time. There are workarounds (like the Google's sitemap stuff), but they're also complex to get right.

If you go with database path, also be aware that modern search engine systems function much better if they have link data to process, so finding a system which can understand links between 'pages' in the database will have a positive effect.

If you are on Microsoft plattform you could use the Indexing service. This integrates very easliy with IIS websites.

It has all the basic features like full text search, ranking, exlcude and include certain files types and you can add your own meta information as well via meta tags in the html pages.

Do a google and you'll find tons!

This is somewhat orthogonal to your question, but I highly recommend the idea of a RESTful search. That is, to perform a search that has never been performed, the website POSTs a query to /searches/. To re-run a search, the website GETs /searches/{some id}

There are some good documents to be found regarding this, for example here.

(That said, I like indexing where possible, though it is an optimization, and thus can be premature.)

If you application uses the Java EE stack and you are using Hibernate you can use the Compass Framework maintain a searchable index of your database. The Compass Framework uses Lucene under the hood.

The only catch is that you cannot replicate your search index. So you need to use a clustered database to hold the index tables or use the newer grid based index storage mechanisms that have been added to the Compass Framework 2.x.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top