I am interested in parsing tracklistings in a variety of formats, containing lines such as:
artist - title
artist-title
artist / title
artist - "title"
1. artist - title
0:00 - artist - tit le
05 artist - title 12:20
artist - title [record label]
These are text files which generally contain one tracklist but which may also contain other stuff which I don't want to parse, so the regex ideally needs to be strict enough to not include lines which aren't tracklistings, although really this is probably a question of balance.
I am having some success with the following regex:
simple = re.compile(r"""
^
(?P<time>\d?\d:\d\d)? # track time in 00:00 or 0:00
(
(?P<number>\d{1,2}) # track number as 0 01
[^\w] # not followed by word
)?
[-.)]? # possibly followed by something
"?
(?P<artist>[^"@#]+) # artist anything except "@#
"?
\s[-/\u2013]\s
"? # dash surrounded by spaces, possibly unicode
(?P<title>[^"@#]+?) # title, not greedy
"?
(?P<label>\[\w+\])? # label i.e. [something Records]
(//|&\#13;)? # remove some weird endings, i.e. ascii carriage return
$
""", re.VERBOSE)
However, it's a bit horrible, I only started learning regex very recently. It has problems with lines like this:
an artist-a title # couldn't find ' - '
2 Croozin' - 2 Pumpin' # mistakes 2 as track number
05 artist - title 12:20 # doesn't work at all
In the case of 2 Croozin' - 2 Pumpin', the only way of telling that 2 isn't a track number is to take into account the surrounding context, i.e. look at the other tracks. (I forgot to mention this - these tracks are usually part of a tracklist)
So my question is, how can I improve this in general? Some ideas I've had are:
- Use several regex, starting with very specific ones and carry on using less specific ones until it has parsed properly.
- dump regex and use a proper parser such as pyparsing or parsley, which might be able to make better use of surrounding context, however I know absolutely nothing about parsing
- use lookahead/lookbehind in a multiline regex to look at previous/next lines
- use separate regex to get time, track number, artist, title
- give up and do something less pointless
I can validate that it has parsed properly (to some degree) doing things such as making sure artists and titles are all different, tracks are in order, times are sensible, and possibly even check artists/titles/labels do actually exist.