Question

Does someone have some insight why the named group ref_id in regex1 contains Some address: loststreet 4 in the capture below?

I want it to be just loststreet 4 and I don't understand why it's not. The code below is from an IRB session.

I've considered the encoding of the strings:

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos
# => "Burp\nFirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4\nZip code:\n" 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1)
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4"> 

str1.encoding
# => #<Encoding:UTF-8> 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/miu
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1)
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4"> 
Was it helpful?

Solution

Because you write an optional \s? in your regex (after "Ref person:") which can match a newline \n (when a parameter is void). Replace it by [^\S\n]? (You must do the same with all \s? that can't be a newline.)

(Note that after each parameter you use .* to go to the next, replace it by .*? which is lazy, to avoid too much backtracks)

OTHER TIPS

Use MatchData#[] to get specific group string:

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi
matched = str1.match(regex1)

matched['name'] # => "Al Bundy"
matched['other'] # => "loststreet 4"

One of the objectives of writing code is to make it maintainable. Making it maintainable involves making it easily read and understood by those who follow along when taking care of that code.

Regular expressions are often a maintenance nightmare, and in my experience can often be reduced in their complexity, or replaced entirely, to come up with code that is just as useful. Parsing this sort of text is a great example of when to not use a complex pattern.

I'd do it this way:

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos

def get_value(s)
  _, value = s.split(':')
  value.strip if value
end

rows = str1.split("\n")
firstname          = get_value(rows[1]) # => "Al Bundy"
ref_person         = get_value(rows[2]) # => nil
some_address       = get_value(rows[3]) # => "loststreet 4"
some_other_address = get_value(rows[4]) # => "loststreet 4"
zip_code           = get_value(rows[5]) # => nil

Split the text into rows, and pick out the data needed.

That can be reduced using map into something more succinct:

firstname, ref_person, some_address, some_other_address, zip_code = rows[1..-1].map{ |s| get_value(s) }
firstname          # => "Al Bundy"
ref_person         # => nil
some_address       # => "loststreet 4"
some_other_address # => "loststreet 4"
zip_code           # => nil

If you absolutely have to have a regex, just to have a regex, then simplify it and isolate its task. While it's possible to write a regex that can span multiple lines, skipping and capturing text as it goes, getting there is painful and it'll become more and more fragile as it grows and will likely break if the incoming text changes. By reducing its complexity you're more likely to avoid fragility and will make your code more robust:

def get_value(s)
  s[/^([^:]+):(.*)/]
  name, value = $1, $2
  value.strip! if value

  [name.downcase.tr(' ', '_'), value]
end

data_hash = Hash[
  str1.split("\n").select{ |s| s[':'] }.map{ |s| get_value(s) }
]
data_hash # => {"firstname"=>"Al Bundy", "ref_person"=>"", "some_address"=>"loststreet 4", "some_other_address"=>"loststreet 4", "zip_code"=>""}

It looks like your regexp is missing some parts. Please try:

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some address:\s?(?<address>[^\n]*).*Some other address:\s?(?<other>[^\n]*)/mi

Using extended mode makes it much easier:

regex1 = %r{
  FirstName:\s?(?<name>[^\n]*).*
  Ref\ person:\s?(?<ref_id>[^\n]*).*
  Some\ address:\s?(?<address>[^\n]*).*
  Some\ other\ address:\s?(?<other>[^\n]*)
}xmi

Just make sure to escape regular spaces.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top