Python regex to remove whitespace inside a pattern match

https://stackoverflow.com/questions/12205455

29-06-2021
|

Вопрос

I have some well-behaved xml files I want to reformat (NOT PARSE!) using regex. The goal is to have every <trkpt> pairs as oneliners.

The following code works, but I'd like to get the operations performed in a single regex substitution instead of the loop, so that I don't need to concatenate the strings back.

import re

xml = """
    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581">
        <time>2012-08-25T10:20:44Z</time>
        <ele>0</ele>
      </trkpt>
    </trkseg>
"""

for trkpt in re.findall('<trkpt.*?</trkpt>', xml, re.DOTALL):
    print re.sub('>\s*<', '><', trkpt, re.DOTALL)

An answer using sed would also be welcome.

Thanks for reading

Решение

How about this:

>>> regex = re.compile(
    r"""\n[ \t]*  # Match a newline plus following whitespace
    (?=           # only if... 
     (?:          # ...the following can be matched:
      (?!<trkpt)  #  (unless an opening <trkpt> tag occurs first)
      .           #  any character
     )*           # any number of times,
     </trkpt>     # followed by a closing </trkpt> tag
    )             # End of lookahead""", 
    re.DOTALL | re.VERBOSE)
>>> print regex.sub("", xml)

    <trkseg>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
      <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
    </trkseg>

Другие советы

This isn't really what you were asking for, but here's a one-liner for the sake of being a one-liner:

>>> print re.sub(r'(<trkpt.*?</trkpt>)',
                 lambda m: re.sub(r'>\s*<', '><', m.group(1), re.DOTALL),
                 xml, flags=re.DOTALL)

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>

Also note that this approach will break if any string attributes contain the string "<trkpt", which probably won't happen, but that's the problem with not using a real parser.

Do you want to keep the <trkseg>? If so, this could work for you:

print re.sub('([^gt])>\s*<', '\g<1>><', xml, re.DOTALL)

Removes all spaces between elements, on condition that the previous element does not end with t or g.

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>

Another one-liner is

print re.sub("(<trkpt.+?>).*?(<time>.+?</time>).*?(<ele>.+?</ele>).*?(</trkpt>)",
             r'\1\2\3\4', xml, re.DOTALL)

produces

<trkseg>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
  <trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
</trkseg>

This has the advantage of being easy to change for other tags.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow