Fast string matching algorithm with simple wildcards support
-
21-08-2019 - |
Question
I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.
Requirements for wildcard support are as follows:
Wildcard (*) can only substitute a "part" of a URL. That is fragments of a domain, path, and parameters. For example, "*.part.part/*/part?part=part&part=*". The only exception to this rule is in the path area where "/*" should match anything after the slash.
Examples:
- *.site.com/* -- should match sub.site.com/home.html, sub2.site.com/path/home.html
- sub.site.*/path/* -- should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html
Additional requirements:
- Fast lookup (I realize "fast" is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
- Work within the scope of a modern desktop (e.g. not a server implementation)
- Ability to return 0:n matches given a input string
- Matches will have rule data attached to them
What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.
Solution
If I'm not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it's a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.
Example
Rule: *.site.com/* domain => *.site.com path => /* query => [empty] URL: sub.site.com/path/home.html domain => sub.site.com path => /path/home.html query => [empty] Matching process: domain => *.site.com matches sub.site.com? YES path => /* matches /path/home.html? YES query => [empty] matches [empty] YES Result: MATCH
As you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the *
's to %
's and then use the database's native LIKE
operation to do the matching for you. Then you'd just have a query like
SELECT *
FROM ruleTable
WHERE @urlDomain LIKE ruleDomain
AND @urlPath LIKE rulePath
AND @urlQuery LIKE ruleQuery
where @urlDomain
, @urlPath
, and @urlQuery
are variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.
OTHER TIPS
First of all, one of the worst performing searches you can do is with a wildcard at both ends of the string ".domain.com/path" -- and I think you're going to hit this case a lot. So my first recommendation is to reverse the order of the domains as they're stored in your DB: com.domain.example/path1/path2/page.html. That will allow you to keep things much more tidy and only use wildcards in "one direction" on the string, which will provide MUCH faster lookups.
I think John mentions some good points about how to do this all within your DB. If that doesn't work I would use a regex library in C++ against the list. I bet you'll get the best performance and most general regex syntax that way.