How to parse sentences into tokens with either regex or toolkits

https://stackoverflow.com/questions/22665065

21-06-2023
|

Frage

How can I parse a sentence like this with either regex or toolkits like beautifulsoup, lxml:

input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

into this:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

I cannot use re.findall("<person>(.*?)</person>", input) beacuse the tag varies.

Lösung

Look how easy it is using BeautifulSoup:

from bs4 import BeautifulSoup

data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    print item

prints:

Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

UPD (splitting non-tag items into spaces and print every part on a new line):

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    if not isinstance(item, Tag):
        for part in item.split():
            print part
    else:
        print item

prints:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

Hope that helps.

Andere Tipps

Try this regex -

>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]\s*>[^<]*?</.*?>",r"\n\g<0>\n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

>>>

Demo of the regex here

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow