Good algorithm to find all pairs of strings between 2 sets so that all words from the 1st string are all contained in the 2nd string?

https://cs.stackexchange.com/questions/120658

29-09-2020
|

Question

I have 2 large sets of strings (actually they are product names). "Large" means few millions of strings.

Example:

Set 1:

Some good product
Another product
Some name
Blah

Set 2:

Very long some product name with words blah
Another very long product name
asd asd sad sad asdsa
Blah blah blah

Set 1 contains "good" names. Set 2 contains "dirty" names.

I want: for every item from Set 2 (further: item2) find the longest item from Set 1 (further: item1) so that all words from item1 are contained in item2.

For the given example the pairs will be the following:

Very long SOME product NAME with words blah => Some name
ANOTHER very long PRODUCT name              => Another product
asd asd sad sad asdsa                       => none
BLAH blah blah                              => blah

So far I couldn't think of anything better than brute force algorithm:

Split every string from Set 1 into words = we get a set of lists of words, let it be Set 3
Split every string from Set 2 into words = we get a set of lists of words, let it be Set 4
Pick up a list of words from Set 3 (further: list3), compare it to all the lists of words from Set 4 until find some list which is fully contained in the list3.

However it has pretty high complexity and works quite slow. My simple implementation takes about 1.8s to find 1 pair (Set 1 has 3mln items, Set 2 has 4mln items). If I implement same task using MySQL-fulltext indexes (it allows to search for strings which contain all given words) then 1 search takes about 0.4s. So I'm wondering whether there are some good approaches which could be applied here with small blood :)

My programming language is PHP7. The data is stored in MySQL DB.

Solution

I'll list two possible approaches that might be reasonably effective in practice, though their worst-case running time is no better than what you listed.

Indices

You can build up an index for each word. Build a hash table. For each word that appears in any clean name, the hashtable maps that word to a list of all dirty names that contain that word. This hashtable can be built up once in a linear scan of the set of dirty names (Set2).

Then, given a clean name, iterate over the words in the clean name. For each word, look it up in the hash table, and iterate over all the dirty names that contain that word, and check how many words it has in common with the clean name. Keep the best match.

This can be optimized a bit. If the clean name contains a word that occurs in many dirty names, handling that word will be slow. So, you could find the number of times each word occurs in some dirty name (its frequency) and store this in a hash table. Then, given a clean name, you could iterate over the words in the clean name in order of increasing frequency, keeping track of the best match you've found so far. If you've found a match of length $\ell$, then you can stop the iteration early without iterating over $\ell-1$ highest-frequency words in the clean name without missing any valid matches.

Tries

The order of words in a name is irrelevant, so sort the words in each phrase. For instance, 'some good product' becomes 'good product some'. Do this to each name in each set.

Next, build a trie to represent the set of good names (Set1). For instance, in your example the trie will be

+-- another --+-- product --+
|`-- blah --+
|`-- good --+-- product --+-- some --+
 `-- name --+-- some --+

Now, pick a dirty name. We want to find a match for it from the trie. I suggest you use a recursive algorithm to find all matches: to find a match for the name $w_1 \cdots w_n$ in the trie $T$, check whether there is an edge out of the root of $T$ labelled $w_1$, and if so, recursively find all matches for $w_2 \cdots w_n$ in the subtrie pointed to by that edge; also recursively find all matches for $w_2 \cdots w_n$ in $T$. Once you have found all matches, keep the longest one.

For instance, for 'another very long product name', after sorting this becomes 'another long name product very'. You look that up in the trie by recursively finding all matches for 'long name product very' in the subtrie +-- product --+, and by finding all matches for 'long name product very' in the main trie.

This search process can be optimized in various ways, e.g., by keeping track of the longest match found so far and stopping early if there's no way the recursive call could find a longer match based on how many words you've matched so far and how many words remain.

There's no requirement to sort by lexicographic order. You can sort in any other order, as long as it is consistent. For example, you could sort by the frequency of words in the entire dataset (into the least common words first), which might help reduce the number of recursive calls.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange