how to better parse random tracklistings

https://stackoverflow.com/questions/18672955

28-06-2022
|

문제

I am interested in parsing tracklistings in a variety of formats, containing lines such as:

artist - title
artist-title
artist / title
artist - "title"
1. artist - title
0:00 - artist - tit le
05 artist - title    12:20
artist - title [record label]

These are text files which generally contain one tracklist but which may also contain other stuff which I don't want to parse, so the regex ideally needs to be strict enough to not include lines which aren't tracklistings, although really this is probably a question of balance.

I am having some success with the following regex:

simple = re.compile(r"""
^
(?P<time>\d?\d:\d\d)? # track time in 00:00 or 0:00
(
(?P<number>\d{1,2})   # track number as 0 01
[^\w]                 # not followed by word
)?
[-.)]?                # possibly followed by something
"?
(?P<artist>[^"@#]+)   # artist anything except "@#
"?
\s[-/\u2013]\s
"?                    # dash surrounded by spaces, possibly unicode
(?P<title>[^"@#]+?)   # title, not greedy
"?
(?P<label>\[\w+\])?   # label i.e. [something Records]
(//|&\#13;)?          # remove some weird endings, i.e. ascii carriage return
$
""", re.VERBOSE)

However, it's a bit horrible, I only started learning regex very recently. It has problems with lines like this:

an artist-a title           # couldn't find ' - '
2 Croozin' - 2 Pumpin'      # mistakes 2 as track number
05 artist - title  12:20    # doesn't work at all

In the case of 2 Croozin' - 2 Pumpin', the only way of telling that 2 isn't a track number is to take into account the surrounding context, i.e. look at the other tracks. (I forgot to mention this - these tracks are usually part of a tracklist)

So my question is, how can I improve this in general? Some ideas I've had are:

Use several regex, starting with very specific ones and carry on using less specific ones until it has parsed properly.
dump regex and use a proper parser such as pyparsing or parsley, which might be able to make better use of surrounding context, however I know absolutely nothing about parsing
use lookahead/lookbehind in a multiline regex to look at previous/next lines
use separate regex to get time, track number, artist, title
give up and do something less pointless

I can validate that it has parsed properly (to some degree) doing things such as making sure artists and titles are all different, tracks are in order, times are sensible, and possibly even check artists/titles/labels do actually exist.

해결책

At best, you are dealing with a context-sensitive grammar which moves you out of the realm of what regexps can handle alone and into parsing.

Even if your parser is implemented as a regexps and a pile of heuristics, it is still a parser and techniques from parsing will be valuable. Some languages have a chicken-egg problem: I'd like to call "The Artist Formerly Known as the Artist Formerly Known as Prince" an artist and not a track title, but until I see it a second time, I don't have the context to make that determination.

To amplify @JonClements comment, if the files do contain internal meta-data there are plenty of tools to extract and manipulate that information. Even if internal metadata just increases the probability that "A Question of Balance" is an album title, you'll need that information.

Steal as many design approaches as you can: look for open source tag manipulators (e.g. EasyTag) and see how they do it. While you are learning, you might just find a tool that does your job for you.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow