best practices for text-based searches in database

https://stackoverflow.com/questions/8977062

12-11-2019
|

Question

I have an application where I need to search in various text-based fields. The application is developed using NHibernate as an ORM.

I would like to implement Porter Stemming in searches, in order to be able to return relevant results even when the keyword matches a similar word, for example the description of a product contains memories while the search keyword is memory.

Can anyone suggest the best practices for such types of searches? The first idea that comes to mind is to store two version of the same field in database, for example:

Description
Description_Search

The Description column would be the text as entered by the website administrator, and is the text visible on the frontend.

The Description_Search would include the same text, but passed through a Porter-Stemming algorithm. Search queries would then be based on the Description_Search field, rather than Description.

Does this make sense? Is it a waste of space having to store two version of almost the same text?

Also, would Lucene.Net help in such a case? I am also looking into integrating Lucene.Net for full-text based searches but haven't yet looked into it in detail.

Thanks in advance!

Solution

There's no need to use two fields for this, one would be enough. A field has two "values", one stored that can be retrieved using Document.Get(...), and one indexed that's used for searching. It's not technically required to store the values either, a common solution is to store a id that's used to lookup the original content in a database. This would also allow you to lookup more information, like author information and document location.

Lucene.Net would help in this case, but it requires you to write the infrastructure yourself. You would need to take care of configuring analyzers (usually nothing to configure), and index your content. As mention in a comment, you could go for SQL Server's Full Text Search functionality, but that itself has some limitations (which may not affect you).

One big problem I've had using SQL Server's FTS, but works in Lucene.Net (this isn't really fair since you can do almost anything in Lucene.Net since you write the code that does it) is accent sensitivity. I've been unable to configure it using Swedish language rules, where åäö should be treated as real characters. Enabling accent sensitivity would do this, but it would also mean that diacritics is parsed as real characters, which means that ñ differs from n. (Imagine searching for a "jalapeno" and get no matches for "jalapeño"). Disabling accent sensitivity basically removes all diacritics, turning åäö into aao, and words turn out completely different.

Writing things in Lucene.Net (compared to SQL Server FTS) allows you to provide result highlighting (present which phrases in a document that matches the query), search for similar documents, spell-checking, custom result boosting, facets, and other things that would enhance your users' search experience.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow