Question

I have a function that takes an input string and then runs the string through several regular expressions until it finds a match. Once a match is found, I return an output that is a function of the original string and the match. So in ruby:

str = "my very long original string ... millions of characters"
case str
  when regex1
    do_something1(str,$1)
  when regex2
    do_something2(str,$1)
  when regex3
    do_something3(str,$1)
  ...
  when regex100000
    do_something100000(str,$1)
  else
    do_something_else(str)
end

Now, is Ruby actually optimizing this switch loop so that str only gets traversed once? Assuming it isn't, then this functionality could be performed much more efficiently with one big long regular expression that had embedded callbacks. Something like this:

/(?<callback:do_something1>regex1)|
(?<callback:do_something2>regex2)|
(?<callback:do_something3>regex3)|
   ...
(?<callback:do_something100000>regex100000)/

Is there any technology that does this?

Was it helpful?

Solution

If Ruby 1.9, and if you use named groups in your regular expressions, then you can combine all of the regular expressions into one with a bit of trickery. Here's the class that does the heavy lifting:

class BigPatternMatcher

  def initialize(patterns_and_functions, no_match_function)
    @regex = make_big_regex(patterns_and_functions)
    @no_match_function = no_match_function
  end

  def match(s, context)
    match = @regex.match(s)
    context.send(function_name(match), match)
  end

  private

  FUNC_GROUP_PREFIX = "func_"
  FUNC_GROUP_REGEX = /^#{FUNC_GROUP_PREFIX}(.*)$/

  def function_name(match)
    if match
      match.names.grep(FUNC_GROUP_REGEX).find do |name|
        match[name]
      end[FUNC_GROUP_REGEX, 1]
    else
      @no_match_function
    end
  end

  def make_big_regex(patterns_and_functions)
    patterns = patterns_and_functions.map do |pattern, function|
      /(?<#{FUNC_GROUP_PREFIX}#{function}>#{pattern.source})/
    end
    Regexp.union(patterns)
  end

end

We'll come back to how this works. To use it, you'll need a list of regular expressions and the name of the function that should be invoked for each one. Make sure to use named groups only:

PATTERNS_AND_FUNCTIONS = [
  [/ABC(?<value>\d+)/, :foo],
  [/DEF(?<value>\d+)/, :bar],
]

And the functions, including one that gets called when there's no match:

def foo(match)
  p ["foo", match[:value]]
end

def bar(match)
  p ["bar", match[:value]]
end

def default(match)
  p ["default"]
end

And, finally, here's how it's used. BigPatternMatcher#match takes the string to match, and the object on which the function should be called:

matcher = BigPatternMatcher.new(PATTERNS_AND_FUNCTIONS, :default)
matcher.match('blah ABC1 blah', self)    # => ["foo", "1"
matcher.match('blah DEF2 blah', self)    # => ["bar", "2"]
matcher.match('blah blah', self)         # => ["default"]

See below the fold for the trickery that makes it work.


BigPatternMatcher#make_big_regex combines all of the regular expressions into one, each surrounded in parentheses and separated by |, e.g.

/(ABC(?<value>\\d+))|(DEF(?<value>\\d+))/

That isn't enough, though. We need some way, when one of the sub-expression matches, to identify which one matched, and therefore which function to call. To do that, we'll make each sub-expression its own named group, with the name based on the function that should be called:

/(?<func_foo>ABC(?<value>\\d+))|(?<func_bar>DEF(?<value>\\d+))/

Now, let's see how we get from a match to calling a function. Given the string:

"blah ABC1 blah"

Then the match group is:

#<MatchData "ABC1" func_foo:"ABC1" value:"1" func_bar:nil value:nil>

To figure out which function to call, we just have to find the match with a name that starts with "func_" and which has a non-nil value. The part of the group's name after "func_" names the function to call.


Measuring the performance of this against the straightforward technique in your question is an exercise for the reader.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top