Regex parsing of iCalendar (Ruby regex)
Question
I'm trying to parse iCalendar (RFC2445) input using a regex.
Here's a [simplified] example of what the input looks like:
BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT
I'd like to get an array of matches: the "outer" match is each VEVENT block and the inner matches are each of the field:value pairs.
I've tried variants of this:
BEGIN:VEVENT\n((?<field>(?<name>\S+):\s*(?<value>\S+)\n)+?)END:VEVENT
But given the input above, the result seems to have only ONE field for each matching VEVENT, despite the +? on the capture group:
**Match 1**
field def:456
name def
value 456
**Match 2**
field ghi:789
name ghi
value 789
In the first match, I would have expected TWO fields: the abc:123 and the def:456 matches...
I'm sure this is a newbie mistake (since I seem to perpetually be a newbie when it comes to regex's...) - but maybe you can point me in the right direction?
Thanks!
Solution
You need to split your regex up into one matching a VEVENT
and one matching the name/value pairs. You can then use nested scan
to find all occurences, e. g.
str.scan(/BEGIN:VEVENT((?<vevent>.+?))END:VEVENT/m) do
$~[:vevent].scan(/(?<field>(?<name>\S+?):\s*(?<value>\S+?))/) do
p $~[:field], $~[:name], $~[:value]
end
end
where str
is your input. This outputs:
"abc:1"
"abc"
"1"
"def:4"
"def"
"4"
"ghi:7"
"ghi"
"7"
If you want to make the code more readable, i suggest you require 'english'
and replace $~
with $LAST_MATCH_INFO
OTHER TIPS
Use the icalendar gem. See the Parsing iCalendars section for more info.
You need a nested scan
.
string.scan(/^BEGIN:VEVENT\n(.*?)\nEND:VEVENT$/m).each.with_index do |item, i|
puts
puts "**Match #{i+1}**"
item.first.scan(/^(.*?):(.*)$/) do |k, v|
puts "field".ljust(7)+"#{k}:#{v}"
puts "name".ljust(7)+"#{k}"
puts "value".ljust(7)+"#{v}"
end
end
will give:
**Match 1**
field abc:123
name abc
value 123
field def:456
name def
value 456
**Match 2**
field ghi:789
name ghi
value 789
I think the problem is that the ruby MatchData
object, which is what the regexp returns its results in, doesn't have any provision for more than one value with the same name. So your second match overwrites the first one.
Ruby has a seldom used method called slice_before
that fits this need well:
'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).to_a
Results in:
[["BEGIN:VEVENT", "abc:123", "def:456", "END:VEVENT"],
["BEGIN:VEVENT", "ghi:789", "END:VEVENT"]]
From there it's simple to grab just the inner array elements:
'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).map{ |a| a[1 .. -2] }
Which is:
[["abc:123", "def:456"], ["ghi:789"]]
And, from there it's trivial to break up each resulting string using map
and split(':')
.
Don't be seduced by the siren call of regular expressions trying to do everything. They're very powerful and convenient in their particular place, but often there are simpler and easier to maintain solutions.