Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)

Question 1

Here is a solution that, I believe, should work well if you have a very large amount of patterns. For just 10k it may be overkill, and implementing it means relatively much work, but you may be interested nevertheless.

The basic idea is to create an inverted index that maps substrings of the patterns to pattern IDs. First, each pattern gets an ID:

1: what is *
2: where is *
3: do * need to
etc.

And then we create an inverted index. In the simplest case, we split the patterns into tokens and map each token to the list of pattern IDs it occurs in. We can be flexible in what we define as a token, but one method is to assume that every white-space separated word is one token. So here is the index:

what  -> 1
is    -> 1,2
where -> 2
do    -> 3
need  -> 3
to    -> 3

Then, when you get an input string from the user, you split that into tokens and look them up in the index. You combine all pattern IDs you get from the index. Example:

INPUT: what is something?

TOKENS:
   what      -> 1
   is        -> 1,2
   something -> n/a

You retrieve the pattern IDs for each token and put them into a temporary data structure that counts the frequency of each ID, for example a hash (e.g. a std::unordered_map<id_type,std::size_t>).

You then sort this by frequency to find out that rule 1 was found twice and rule 2 was found once.

You then apply the rules you found, in the order of frequency, to the input text. Here you use a regular expression library or something similar to generate matches. The most frequent rule has the most tokens in common with the input text, so it is likely to match well.

The overall advantage of the approach is that you need not apply all the rules to the input, but only those that have at least one token in common with the input, and even among those you do it in the order of how many tokens each rule shares with the input, and once you found a matching rule you could probably break off the rest of the matching procedure (or not – depending on whether or not you want all matching rules in each case, or just one that is a very good match).

Improvement The above performs the rule preselection based on tokens. Instead, you could concatenate all the rules like this:

what is *||where is *||do * need to||...

Then you construct a suffix array of this concatenated string.

Then, given an input string, you match it against the suffix array to identify all substring-matches, including matches that are smaller than one token or span across multiple tokens. In the example above I assume that the wildcard symbols * and $ are included in the suffix array, although of course no part of an input string will ever match them. You can well exclude them from the suffix array or replace them with a dummy character.

Once you determine the matches, you sort them by length. You also must map the match positions in the concatenated string to rule IDs. This is readily possible by maintaining an array of starting positions of rules relative to the concatenated string; there are also highly-optimised methods based on indexed bit vectors (I can elaborate on this if necessary).

Once you have the matching rule IDs, you do the same as in the inverted index case: Apply the matching rules, using standard regex matching (or similar).

Again, this approach is relatively complicated and makes sense only when you have a very large amount of rules, and if chances that a token-based (or substring-based) lookup reduces the number of candidate rules significantly. From the example rules you gave I assume the latter in the case, but if the number of rules you are dealing with (in the order of 10k) justifies this approach, I am not sure. It may make more sense if the total number of rules is in the 100ks or millions.

Question 2

Given your comment that the patterns do not need to be updated at runtime I'm not sure you need a runtime structure at all.

I'd recommend using re2c or ragel to compile the patterns to code that will do the pattern matching.

Question 3

You might want to look at flex. From the manual:

flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns in text. The flex program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, lex.yy.c by default, which defines a routine yylex(). This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

Also this:

The main design goal of flex is that it generate high-performance scanners. It has been optimized for dealing well with large sets of rules.

For example, this scanner matches the three patterns in your post:

%%
"WHAT IS XYZ?"      puts("matched WHAT-IS-XYZ");
"WHAT IS ".*"?"     puts("matched WHAT-IS");
"HOW MUCH ".*"?"    puts("matched HOW-MUCH");

Flex works by generating a discrete finite automaton (DFA). A DFA looks at each input character exactly once. There is no backtracking, even when matching wildcards. Run time is O(N) where N is the number of input characters. (More patterns will generate larger DFA tables, which will cause more cache misses, so there is some penalty for more patterns. But that is true of any matching system I can think of.)

However, you will have to list your patterns in the proper order to match them correctly. Flex may tell you if there's a problem. For example, if you reverse the order of the WHAT-IS-XYZ and WHAT-IS patterns in the above scanner, flex will tell you:

:; flex matcher.l
matcher.l:3: warning, rule cannot be matched

If you can meet flex's requirements, flex should give you a very fast scanner.

Question 4

Many of the above answers won't work well in practice, or do not constitute the best known answer to your question: for instance, using scanner generators from compiler construction (re2c, lex, flex, ...) is likely to fail for large lexicons, as these tools were designed for programming languages, which typically have at most a few hundred built-in keywords.

There are two alternative approaches that I can recommend:

(1) choose a C++ trie class that maps strings to size_t ids used in a second array to store the payload data. Use a Web search engine search query such as:

c++ trie c++14 implementation fast "small footprint" site:github.com

to find suitable candidates - at the time of writing e.g. https://github.com/Tessil/hat-trie looks quite good (clean, portable, modern code, and there's a scientific paper to go with it, which adds credibility); or

(2) represent the mapping from string lexicon to payload data as a Finite State Transducer (FST), an extension of NFAs (automata with output), as realized by FOMA, XFST or OpenFST.

The first option means you will have to test available libraries for ease of use and scalability. The second option means you will have to make yourself familiar with FSTs and existing implementation libraries and their licenses.

(The second approach is used by computational linguists to model large word-lists as well as governments to scan everything you write on the Internet, so it scales very well.)

Question 5

Check out CritBit trees:

Example source code that's trivial to C++-ise if you really feel the need.

To find all matches you use the function critbit0_allprefixed

e.g.

// Find all strings that start with, or are equal to, "WHAT IS"`
critbit0_allprefixed(tree, "WHAT IS", SomeCallback);`

SomeCallback is called for each match.

Question 6

Have you tried Ternary search tree? Here is a c++ implementation: http://code.google.com/p/ternary-search-tree/

I do not have any experience of how slow it is to build a ternary tree but I do know that search is very fast.

[edit]

For matching wildcard inside the tree in partialMatchSearch: (disclaimer: this is just a suggestion and not tested in any way)

you could add * symbols to the tree and an if-clause like this in the beginning of partialMatchSearch function:

  if ( ( *key != 0 ) && tree->splitChar == '*' )
  {
     this->partialMatchSearch( tree, key+1 );
  }

in other words just recursively call the partialMatchSearch with the same node but the string set to next char.

Question 7

If space is no issue then you could do the following, off the top of my head.

Create a tree that has children of the possible characters at this point in the tree where the current level is the index into the string so. The first level of the tree is index level 0 or rather array index 0 in the string.

Each level would be a index into the string so the root node would be index 0 and it's children would be index 1. Each node would have N children equal to the number of possible characters at this point in the string.

So, say you had the root have the set of possible [ "a", "b", "c" ] it would have three children. Then say you wanted to find a possible match for the string "ab" you would recurse to the child that has the route of "a" and go from there.

If you reach the end of the string before getting to a leaf node then the list of possible strings is the entire sub tree of all children at your current node.

Hope that made any sense, but it would look sort of like a huffman tree, but each leaf would a potential string to choose from.

Question 8

I would take Kernighan and Pike's advice and pick a reasonable algorithm and then brute-force it.

All your examples are looking for "longest prefix", which suggests a simple trie rather than a suffix-tree to me. Given you only need ~10k strings stored, you should be able to implement a char-trie in a couple of hours at most, using nested STL maps.

Unless memory is tight or performance truly is critical, this should be fine.

Question 9

This isn't the question you're asking. You wanted something that is pre-cooked. But...

How complex does this need to be? If I were trying to do what you are asking, I would try something a bit more space intensive but much less time intensive.

I would (and have) start with a tree with the index 0 of the alphabet.

Then each child node would be a word (the dictionary).

Then each child node of that would be a potential string segment match knowing, for example, that "round" almost never DIRECTLY follows "square". You may say, "Put the round peg in the square hole," but the word that follows "round" is "peg". So the segment matches for "round" would be "round peg", "round cup", "round ball". I'd strike out articles also because they mean nothing to the sentence (usually). The above paragraph would translate to, "each child node is word".

I would add a heuristic, though, as even an advanced B-tree can get slow when you have that much data. I have seen them slow down to a minute or more when searching very large data sets.

That is assuming that you didn't want to actually use a database, which would probably be fastest of all unless you wanted to code in ASM.