Question

I'm trying to parse iCalendar (RFC2445) input using a regex.

Here's a [simplified] example of what the input looks like:

BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT

I'd like to get an array of matches: the "outer" match is each VEVENT block and the inner matches are each of the field:value pairs.

I've tried variants of this:

BEGIN:VEVENT\n((?<field>(?<name>\S+):\s*(?<value>\S+)\n)+?)END:VEVENT

But given the input above, the result seems to have only ONE field for each matching VEVENT, despite the +? on the capture group:

**Match 1**
field   def:456
name    def
value   456

**Match 2**
field   ghi:789
name    ghi
value   789

In the first match, I would have expected TWO fields: the abc:123 and the def:456 matches...

I'm sure this is a newbie mistake (since I seem to perpetually be a newbie when it comes to regex's...) - but maybe you can point me in the right direction?

Thanks!

Was it helpful?

Solution

You need to split your regex up into one matching a VEVENT and one matching the name/value pairs. You can then use nested scan to find all occurences, e. g.

str.scan(/BEGIN:VEVENT((?<vevent>.+?))END:VEVENT/m) do
  $~[:vevent].scan(/(?<field>(?<name>\S+?):\s*(?<value>\S+?))/) do
    p $~[:field], $~[:name], $~[:value]
  end
end

where str is your input. This outputs:

"abc:1"
"abc"
"1"
"def:4"
"def"
"4"
"ghi:7"
"ghi"
"7"

If you want to make the code more readable, i suggest you require 'english' and replace $~ with $LAST_MATCH_INFO

OTHER TIPS

Use the icalendar gem. See the Parsing iCalendars section for more info.

You need a nested scan.

string.scan(/^BEGIN:VEVENT\n(.*?)\nEND:VEVENT$/m).each.with_index do |item, i|
  puts
  puts "**Match #{i+1}**"
  item.first.scan(/^(.*?):(.*)$/) do |k, v|
    puts "field".ljust(7)+"#{k}:#{v}"
    puts "name".ljust(7)+"#{k}"
    puts "value".ljust(7)+"#{v}"
  end
end

will give:

**Match 1**
field   abc:123
name    abc
value   123
field   def:456
name    def
value   456

**Match 2**
field   ghi:789
name    ghi
value   789

I think the problem is that the ruby MatchData object, which is what the regexp returns its results in, doesn't have any provision for more than one value with the same name. So your second match overwrites the first one.

Ruby has a seldom used method called slice_before that fits this need well:

'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).to_a

Results in:

[["BEGIN:VEVENT", "abc:123", "def:456", "END:VEVENT"],
 ["BEGIN:VEVENT", "ghi:789", "END:VEVENT"]]    

From there it's simple to grab just the inner array elements:

'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).map{ |a| a[1 .. -2] }

Which is:

[["abc:123", "def:456"], ["ghi:789"]]

And, from there it's trivial to break up each resulting string using map and split(':').

Don't be seduced by the siren call of regular expressions trying to do everything. They're very powerful and convenient in their particular place, but often there are simpler and easier to maintain solutions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top