Python RegEx skipping the first few characters?

https://stackoverflow.com/questions/1620889

06-07-2019
|

Question

Hey I have a fairly basic question about regular expressions. I want to just return the text inside (and including) the body tags, and I know the following isn't right because it'll also match all the characters before the opening body tag. I was wondering how you would go about skipping those?

x = re.match('(.*<body).*?(</body>)', fileString)

Thanks!

Solution

Here is some example code which uses regex to find all the text between <body>...</body> tags. Although this demonstrates some features of python's re module, note that the Beautiful Soup module is very easy to use and is a better tool to use if you plan on parsing HTML or XML. (See below for an example of how you could parse this using BeautifulSoup.)

#!/usr/bin/env python
import re

# Here we have a string with a multiline <body>...</body>
fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''

# re.DOTALL tells re that '.' should match any character, including newlines.
x = re.search('(<body>.*?</body>)', fileString, re.DOTALL)
for match in x.groups():
    print(match)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

If you wish to collect all matches, you could use re.findall:

print(re.findall('(<body>.*?</body>)', fileString, re.DOTALL))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

and if you plan to use this pattern more than once, you can pre-compile it:

pat=re.compile('(<body>.*?</body>)', re.DOTALL)
print(pat.findall(fileString))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

And here is how you could do it with BeautifulSoup:

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''
soup = BeautifulSoup(fileString)
print(soup.body)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

print(soup.findAll('body'))
# [<body>foo
# baby foo
# baby foo
# baby foo
# </body>, <body>bar</body>]

OTHER TIPS

I don't know Python, but here's a quick example thrown together using Beautiful Soup, which I often see recommended for Python HTML parsing.

import BeautifulSoup

soup = BeautifulSoup(fileString)

bodyTag = soup.html.body.string

That will (in theory) deal with all the complexities of HTML, which is very difficult with pure regex-based answers, because it's not what regex was designed for.

You cannot parse HTML with regex. HTML is not a regular language. Use an HTML parser like lxml instead.

 x = re.match('.*(<body>.*?</body>)', fileString)

Consider minidom for HTML parsing.

x = re.search('(<body>.*</body>)', fileString)
x.group(1)

Less typing than the match answers

Does your fileString contain multiple lines? In that case you may need to specify it or skip the lines explicitly:

x = re.match(r"(?:.|\n)*(<body>(?:.|\n)*</body>)", fileString)

or, more simply with the re module:

x = re.match(r".*(<body>.*</body>)", fileString, re.DOTALL)

x.groups()[0] should contain your string if x is not None.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow