Question

Anyone know how one might adapt a search tree to handle limited regular expressions? The task is, given a file name, find all nodes matching that file name. Nodes may contain usual file name globs (* and ?). Obviously, since this is a search tree, speed is of the essence.

EDIT: I should add that the most important case for speed is the average time to rule out a match. That is, in most cases matching will fail.

An example: Say the tree contained the following nodes:

foo, bar, foo*, *bar, foo?bar

Searching for foo would return nodes 1 and 3. Searching for bar would return nodes 2 and 4. Searching for fob would return no nodes. Searching for fooxbar would return node 5. Searching for foobar would return nodes 3 and 4.

Was it helpful?

Solution

An aho-corasick search tree would fit the bill. Aho-Corasick a very good article about this sort of thing Tries, and the implementation used in Evolution to replace regex searching Etrie

Edit: To do the whole string matching, you can add beginning and ending anchor states, if scanning multi line data, you could add the newline to begin and end. You could also remove the part where it add the cross linking for the partial matching starting a different match, this also allows faster exclusion.

Another algorithm for checking for membership in a string set is CritBit. This doesn't have Regex, but it simple and test complete strings.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top