Question

Ruby's regex literal can take the options i, m, x, which are documented. But besides them, it can accept much wider variety of options. Here is the inventory of the options that seem to be allowed:

//e # => //
//i # => //i  ignore case
//m # => //m  multiline
//n # => //n
//o # => //
//s # => //
//u # => //
//x # => //x  extended
  • What do they do? Are some of them related to encoding? What about others?
  • If they indicate encoding, then what happens when more than one encoding is specified?
  • While other options raise an unknown regex options error, the ones listed here do not. If the answer to the previous question is that they do nothing, then why are these particular options allowed?
  • Why is n reflected in the inspection while others are not? Do the ones whose inspection do not show difference, actually differ?

If there is a documentation, link to that would be appreciated.

Was it helpful?

Solution

Regular-expression modifiers:

Regular expression literals may include an optional modifier to control various aspects of matching. The modifier is specified after the second slash character, as shown previously and may be represented by one of these characters:

Modifier    Description
i           Ignore case when matching text.
o           Perform #{} interpolations only once, the first time the regexp literal is evaluated.
x           Ignores whitespace and allows comments in regular expressions
m           Matches multiple lines, recognizing newlines as normal characters
u,e,s,n     Interpret the regexp as Unicode (UTF-8), EUC, SJIS, or ASCII. 
            If none of these modifiers is specified, the regular expression is 
            assumed to use the source encoding.

source

Note: that description above has proviso. See sawa's answer for that.

OTHER TIPS

I found some correction and complementation to guido's answer.

  • When no encoding is specified, then the regular expression is assumed to use the source encoding (which is UTF-8 in Ruby 2.0 if there is no magic comment at the beginning of the file) unless the regex only consists of single-byte characters, in which case the regex is converted to US-ASCII 1.

  • When more than one encoding option is specified, then the last one takes effect.

    //eu.encoding # => UTF-8
    //ue.encoding # => EUC
    
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top