Question

I am trying to have some sort of Data Object (I'm thinking a dictionary) to hold a TON of regular expressions as keys, then I need to take a string of text, and match against them to get the actual value from the Dictionary. I need an efficient way to do this for a large set of data.

I am in C# and I'm not sure where to begin.

Was it helpful?

Solution

Why not use LINQ?

Dictionary<string, string> myCollection = new Dictionary<string, string>();

myCollection.Add("(.*)orange(.*)", "Oranges are a fruit.");
myCollection.Add("(.*)apple(.*)", "Apples have pips.");
myCollection.Add("(.*)dog(.*)", "Dogs are mammals.");
// ...

string input = "tell me about apples and oranges";

var results = from result in myCollection
              where Regex.Match(input, result.Key, RegexOptions.Singleline).Success
              select result;

foreach (var result in results)
{
    Console.WriteLine(result.Value);
}

// OUTPUT:
//
// Oranges are a fruit.
// Apples have pips.

OTHER TIPS

I'm not sure if you actually need regular expressions for this - you could use a trie. Representing dictionaries is a common application for a trie. (I'm assuming you mean a dictionary as in a list of words, and not the "associative array" meaning).

Do you mean match a string against the regexes to get a regex match? Or just a text match? In other words, is the string you have going to BE one of those regexes, or some data to APPLY a regex to?

If it IS a regex and you want to find it in the list, you don't need a Dictionary, those are 2 part containers. You could just use a List or StringCollection, and ask for IndexOf(mytString), -1 meaning it's not in there.

If your regexps aren't trivial single-strings, and you care for efficiency, you'd want to represent them in a single NFA (nondeterministic finite-state automaton, with values in final states. If it's possible for an input to match more than one regexp, then final states would need a set of values.

At this point, you're prepared to consider optimizing the automaton. If it can be practically determinized (this give you a DFA that can be exponentially larger than the NFA), then by all means do that. Once you have a DFA, you can efficiently (and uniquely up to isomorphism) minimize it (but since you have values in your final states, an obvious modification of the usual algorithm is needed).

There are also techniques for minimizing NFA directly. For example, if two states have the same suffix sets ({(rest of string,value)}) they're equivalent and can be combined. Equivalence in an acyclic NFA can be done via hash-consing starting from the final states.

Remember that if you are planning on using a regex more than once you can create a regex object as compiled and re-use it to reduce overhead.

Regex RegexObject = new Regex(Pattern, RegexOptions.Compiled);

Using this model you would be best storing a regex object rather than the pattern string.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top