Regular expression that takes <...> as one item in "foo bar <hello world> and so on" (Goal: Simple music/lilypond parsing)

StackOverflow https://stackoverflow.com/questions/14801122

  •  09-03-2022
  •  | 
  •  

Pregunta

I am using the re module in Python(3) and want to substitute (re.sub(regex, replace, string)) a string in the following format

"foo <bar e word> f ga <foo b>" 

to

"#foo <bar e word> #f #ga <foo b>"

or even

"#foo #<bar e word> #f #ga #<foo b>" 

But I can't isolate single words from word boundaries within a <...> construct.

Help would be nice!

P.S 1

The whole story is a musical one: I have strings in the Lilypond format (or better, a subset of the very simple core format, just notes and durations) and want to convert them to python pairs int(duration),list(of pitch strings). Performance is not important so I can convert them back and forth, iterate with python lists, split strings and join them again etc. But for the above problem I did not found an answer.

Source String

"c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"

should result in

[
(4, ["c'"]),
(8, ["d"]),
(16, ["e'", "g'"]),
(4, ["fis'"]),
(0, ["a,,"]),
(0, ["g", "b'"]),
(1, ["c''"]),
]

the basic format is String+Number like so : e4 bes16

  • List item
  • the string can consist of multiple, at least one, [a-zA-Z] chars
  • the string is followed by zero or more digits: e bes g4 c16
  • the string is followed by zero or more ' or , (not combined): e' bes, f'''2 g,,4
  • the string can be substituted by a list of strings, list limiters are <>: 4 The number comes behind the >, no space allowed

P.S. 2

The goal is NOT to create a Lilypond parser. Is it really just for very short snippets with no additional functionality, no extensions to insert notes. If this does not work I would go for another format (simplified) like ABC. So anything that has to do with Lilypond ("Run it trough lilypond, let it give out the music data in Scheme, parse that") or its toolchain is certainly NOT the answer to this question. The package is not even installed.

¿Fue útil?

Solución 2

Your first question can be answered in this way:

>>> import re
>>> t = "foo <bar e word> f ga <foo b>"
>>> t2 = re.sub(r"(^|\s+)(?![^<>]*?>)", " #", t).lstrip()
>>> t2
'#foo #<bar e word> #f #ga #<foo b>'

I added lstrip() to remove the single space that occurs before the result of this pattern. If you want to go with your first option, you could simply replace #< with <.

Your second question can be solved in the following manner, although you might need to think about the , in a list like ['g,', "b'"]. Should the comma from your string be there or not? There may be a faster way. The following is merely proof of concept. A list comprehension might take the place of the final element, although it would be farily complicated.

>>> s = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
>>> q2 = re.compile(r"(?:<)\s*[^>]*\s*(?:>)\d*|(?<!<)[^\d\s<>]+\d+|(?<!<)[^\d\s<>]+")
>>> s2 = q2.findall(s)
>>> s3 = [re.sub(r"\s*[><]\s*", '', x) for x in s2]
>>> s4 = [y.split() if ' ' in y else y for y in s3]
>>> s4
["c'4", 'd8', ["e'", "g'16"], "fis'4", 'a,,', ['g,', "b'"], "c''1"]
>>> q3 = re.compile(r"([^\d]+)(\d*)")
>>> s = []
>>> for item in s4:
    if type(item) == list:
            lis = []
            for elem in item:
                    lis.append(q3.search(elem).group(1))
                    if q3.search(elem).group(2) != '':
                            num = q3.search(elem).group(2)
            if q3.search(elem).group(2) != '':
                    s.append((num, lis))
            else:
                    s.append((0, lis))
    else:
            if q3.search(item).group(2) != '':
                    s.append((q3.search(item).group(2), [q3.search(item).group(1)]))
            else:
                    s.append((0, [q3.search(item).group(1)]))


>>> s
[('4', ["c'"]), ('8', ['d']), ('16', ["e'", "g'"]), ('4', ["fis'"]), (0, ['a,,']), (0, ['g,', "b'"]), ('1', ["c''"])]

Otros consejos

I know you are not looking for a general parser, but pyparsing makes this process very simple. Your format seemed very similar to the chemical formula parser that I wrote as one of the earliest pyparsing examples.

Here is your problem implemented using pyparsing:

from pyparsing import (Suppress,Word,alphas,nums,Combine,Optional,Regex,Group,
                       OneOrMore)

"""
List item
 -the string can consist of multiple, at least one, [a-zA-Z] chars
 -the string is followed by zero or more digits: e bes g4 c16
 -the string is followed by zero or more ' or , (not combined): 
  e' bes, f'''2 g,,4
 -the string can be substituted by a list of strings, list limiters are <>;
  the number comes behind the >, no space allowed
"""

LT,GT = map(Suppress,"<>")

integer = Word(nums).setParseAction(lambda t:int(t[0]))

note = Combine(Word(alphas) + Optional(Word(',') | Word("'")))
# or equivalent using Regex class
# note = Regex(r"[a-zA-Z]+('+|,+)?")

# define the list format of one or more notes within '<>'s
note_list = Group(LT + OneOrMore(note) + GT)

# each item is a note_list or a note, optionally followed by an integer; if
# no integer is given, default to 0
item = (note_list | Group(note)) + Optional(integer, default=0)

# reformat the parsed data as a (number, note_or_note_list) tuple
item.setParseAction(lambda t: (t[1],t[0].asList()) )

source = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
print OneOrMore(item).parseString(source)

With this output:

[(4, ["c'"]), (8, ['d']), (16, ["e'", "g'"]), (4, ["fis'"]), (0, ['a,,']), 
 (0, ['g,', "b'"]), (1, ["c''"])]
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top