After further searching I came across the rexical gem, which is itself a renamed-and-slightly-maintained version of rex. This is an old-school lexer-generator thats only dependency is on racc/parser, which has been part of ruby-core for long enough that I don't have to worry about it.
The documentation is sparse, but there were enough blog posts touching on the topic that I was able to get what I needed working.
In case you're curious enough to have read this far, here is my example .rex specification:
require 'generator'
class OptionSpecsLexer
rules
\d+(\.\d*) { [:number, text] }
\w+: { [:syntax_hash_key, ":#{text[0, text.length - 1]} =>"] }
\:\w+ { [:symbol, text] }
\w+\( { [:funcall_open_paren, text] }
\w+ { [:identifier, text] }
\"(\\.|[^\\"])*\" { [:string, text] }
=> { [:rocket, text] }
, { [:comma, text] }
\{ { [:open_curly, text] }
\} { [:close_curly, text] }
\( { [:open_paren, text] }
\) { [:close_paren, text] }
\[ { [:close_square, text] }
\] { [:close_square, text] }
\\\s+ { }
\n { [:eol, text] }
\s+ { }
inner
def enumerate_tokens
Generator.new { |token|
loop {
t = next_token
break if t.nil?
token.yield(t)
}
}
end
def normalize(source)
scan_setup source
out = ""
enumerate_tokens.each do |token|
out += ' ' + token[1]
end
out
end
end
This lexer understands just enough ruby syntax to preprocess specifications written in my vMATCodeMonkey DSL, replacing the new keyword-style hash key syntax with the old rocket operator syntax. [This was done to allow vMATCodeMonkey to work on un-updated Mac OS X 10.8 which still ships with a deprecated version of ruby.]