What is Full Text Search vs LIKE

https://stackoverflow.com/questions/224714

03-07-2019
|

Question

I just read a post mentioning "full text search" in SQL.

I was just wondering what the difference between FTS and LIKE are. I did read a couple of articles but couldn't find anything that explained it well.

Solution

In general, there is a tradeoff between "precision" and "recall". High precision means that fewer irrelevant results are presented (no false positives), while high recall means that fewer relevant results are missing (no false negatives). Using the LIKE operator gives you 100% precision with no concessions for recall. A full text search facility gives you a lot of flexibility to tune down the precision for better recall.

Most full text search implementations use an "inverted index". This is an index where the keys are individual terms, and the associated values are sets of records that contain the term. Full text search is optimized to compute the intersection, union, etc. of these record sets, and usually provides a ranking algorithm to quantify how strongly a given record matches search keywords.

The SQL LIKE operator can be extremely inefficient. If you apply it to an un-indexed column, a full scan will be used to find matches (just like any query on an un-indexed field). If the column is indexed, matching can be performed against index keys, but with far less efficiency than most index lookups. In the worst case, the LIKE pattern will have leading wildcards that require every index key to be examined. In contrast, many information retrieval systems can enable support for leading wildcards by pre-compiling suffix trees in selected fields.

Other features typical of full-text search are

lexical analysis or tokenization—breaking a block of unstructured text into individual words, phrases, and special tokens
morphological analysis, or stemming—collapsing variations of a given word into one index term; for example, treating "mice" and "mouse", or "electrification" and "electric" as the same word
ranking—measuring the similarity of a matching record to the query string

OTHER TIPS

FTS involves indexing the individual words within a text field in order to make searching through many records quick. Using LIKE still requires you to do a string search (linear or the like) within the field.

MySQL creates an index from the words of the enabled full-text search column and performs searches on this index. MySQL uses a sophisticated algorithm to determine the rows matched against the search query.

Also, from this SO answer:

There's a few advantages to full text searching.

Indexing:

Something like:
WHERE Foo LIKE '%Bar';
Cannot take advantage of an index. It has to look at every single row, and see if it matches. A fulltext index, however, can. In fact, fulltext indexes can offer a lot more flexibility in terms of the order of matching words, how close those words are together, etc.

Stemming:

A fulltext search can stem words. If you search for run, you can get results for "ran" or "running". Most fulltext engines have stem dictionaries in a variety of languages.

Weighted Results:

A fulltext index can encompass multiple columns. For example, you can search for "peach pie", and the index can include a title, keywords, and a body. Results that match the title can be weighted higher, as more relevant, and can be sorted to show near the top.

Disadvantages:

A fulltext index can potentially be huge, many times larger than a standard B-TREE index. For this reason, many hosted providers who offer database instances disable this feature, or at least charge extra for it. For example, last I checked, Windows Azure did not support fulltext queries.

Fulltext indexes can also be slower to update. If the data changes a lot, there might be some lag updating indexes compared to standard indexes.

Like uses wildcards only, and isn't all that powerful.

Full text allows much more complex searching, including And, Or, Not, even similar sounding results (SOUNDEX) and many more items.

I would start looking at the SQL CONTAINS() FREETEXT() and related Full Text search items to help get a better understanding of what is available.

The real difference is the scanning methodologies. For full-text search, the words (terms) are used as hash keys - each of which is associated with an array of documents the keys (terms) appears in. Its like this:

Document sets = {d1, d2, d3, d4, ... dn}
Term sets = {t1, t2, t3, .. tn}

Now term-document matrix (which term member of which document) can be represented as:

t1 -> {d1, d5, d9,.. dn}
t2 -> {d11, d50, d2,.. dn}
t3 -> {d23, d67, d34,.. dn}
:
tn -> {d90, d87, d57,.. dn}

When the request comes in asking for "Get me all documents containing the word/term t1" - then the document set {d1, d5, d9,.. dn} is returned.

You could hack a de-normalized table schema to store documents - each row in MySQL table will be considered as "document" and a TEXT column could contain a paragraph etc. The inverted index will contain the terms as hash keys and the row-ids as the document ids.

Remember that this SQL query will have more or less O(1) performance. The query will be independent of

Number of words/terms in the TEXT column
The number of rows/documents matching the criteria
The length of the words/terms

For instance this SQL could be fired to extract all rows matching the given word XYZ:

SELECT * 
FROM   my_table 
WHERE  MATCH (my_text_column) against ('XYZ' IN boolean mode) ;

Caveat: If you add ORDER BY to this query, your runtimes will vary based on the several parameters, one of which is the number of matching rows/documents. So beware.

The LIKE however has got nothing of this. It is forced to linearly scan the sentence/string and find all matching terms. Adding wild card adds to the mess. It works great for small length strings, as you can imagine, but will fail miserably for longer sentences. And definitely not comparable when having a paragraph or a whole page of text etc.

FTS is more efficient, powerful (especially for Word Breakers and stemming functionalities) ... but check your requirements because sometimes DBs don't support all languages for example MSSQL doesn't support Greek (check on this page http://msdn.microsoft.com/en-us/library/ms176076(v=sql.110).aspx )

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow