Regex for texts NOT enclosed in angled bracket - python

https://stackoverflow.com/questions/19378576

30-06-2022
|

Question

I need to get the texts that are not enclosed in angled brackets.

my input looks like this:

> whatever something<X="Y" zzz="abc">this is a foo bar <this is a
> < whatever>and i ><only want this

and the desired output is:

> whatever something
this is a foo bar <this is a
> 
and i ><only want this

I've tried first detecting the things inside the brackets then remove them. But seems like i'm matching the attributes inside the <> instead of the whole <...>. How do i achieve the desired output?

import re
x = """whatever something<X="Y" zzz="abc">this is a foo bar <this is a\n< whatever>and i ><only want this"""
re.findall("<([^>]*)>", x.strip())
['X="Y" zzz="abc"', 'this is a\n    ', ' whatever']

Solution

You should move the parenthesis just inside the quotes (and remove the parenthesis you already have) in the regex pattern to grab all of the text between <...> including the brackets themselves. You also need to exclude the \n characters to achieve the output you want.

import re
x =  """whatever something<X="Y" zzz="abc">this is a foo bar <this is a\n\
        < whatever>and i ><only want this"""
y = re.findall("(<[^>\n]*>)",x.strip())
z = x[:]
for i in y:
    z = z.replace(i,'\n')
print(z)
whatever something
this is a foo bar <this is a

and i ><only want this

The parentheses indicate which text you want to group when the findall finds a match.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow