SQLite: Efficient substring search in large table

https://stackoverflow.com/questions/11334832

19-06-2021
|

Domanda

I'm developing an Android application that has to perform substring search in a large table (about 500'000 entries with street and location names, so just a few words per entry).

CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB)

Note that only 20% of all entries contain strings in the "name" column.

Performing the following query almost takes 2 minutes:

SELECT elementID, name FROM Elements WHERE name LIKE %foo%

I now tried to use FTS3 in order to speed up the query. That was quite successful, query time decreased to 1 minute (surprisingly the database file size increased by only 5%, which is also quite good for my purpose).

The problem is, FTS3 seemingly doesn't support substring search, i.e. if I want to find "bar" in "foo bar" and "foobar", I only get "foo bar", although I need both results.

So actually I have two questions:

Is it possible to further speed up the query? My goal is 30 seconds for the query, but I don't know if that's realistic...
How can I get real substring search using FTS3?

Soluzione

Solution 1: If you can make every character in your database as an individual word, you can use phrase queries to search the substring.

For example, assume "my_table" contains a single column "person":

person
------
John Doe
Jane Doe

you can change it to

person
------
J o h n D o e
J a n e D o e

To search the substring "ohn", use phrase query:

SELECT * FROM my_table WHERE person MATCH '"o h n"'

Beware that "JohnD" will match "John Doe", which may not be desired. To fix it, change the space character in the original string into something else.

For example, you can replace the space character with "$":

person
------
J o h n $ D o e
J a n e $ D o e

Solution 2: Following the idea of solution 1, you can make every character as an individual word with a custom tokenizer and use phrase queries to query substrings.

The advantage over solution 1 is that you don't have to add spaces in your data, which can unnecessarily increase the size of database.

The disadvantage is that you have to implement the custom tokenizer. Fortunately, I have one ready for you. The code is in C, so you have to figure out how to integrate it with your Java code.

Altri suggerimenti

You should add an index to the name column on your database, that should speed up the query considerably.

I believe SQLite3 supports sub-string matching like so:

SELECT * FROM Elements WHERE name MATCH '*foo*';

http://www.sqlite.org/fts3.html#section_3

I am facing some thing similar to your problem. Here is my suggestion try creating a translation table that will translate all the words to numbers. Then search numbers instead of words.

Please let me know if this is helping.

not sure about speeding it up since you're using sqllite, but for substring searches, I have done things like

SET @foo_bar = 'foo bar'
SELECT * FROM table WHERE name LIKE '%' + REPLACE(@foo_bar, ' ', '%') + '%'

of course this only returns records that have the word "foo" before the word "bar".

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow