Question

I need to get the texts that are not enclosed in angled brackets.

my input looks like this:

> whatever something<X="Y" zzz="abc">this is a foo bar <this is a
> < whatever>and i ><only want this

and the desired output is:

> whatever something
this is a foo bar <this is a
> 
and i ><only want this

I've tried first detecting the things inside the brackets then remove them. But seems like i'm matching the attributes inside the <> instead of the whole <...>. How do i achieve the desired output?

import re
x = """whatever something<X="Y" zzz="abc">this is a foo bar <this is a\n< whatever>and i ><only want this"""
re.findall("<([^>]*)>", x.strip())
['X="Y" zzz="abc"', 'this is a\n    ', ' whatever']
Was it helpful?

Solution

You should move the parenthesis just inside the quotes (and remove the parenthesis you already have) in the regex pattern to grab all of the text between <...> including the brackets themselves. You also need to exclude the \n characters to achieve the output you want.

import re
x =  """whatever something<X="Y" zzz="abc">this is a foo bar <this is a\n\
        < whatever>and i ><only want this"""
y = re.findall("(<[^>\n]*>)",x.strip())
z = x[:]
for i in y:
    z = z.replace(i,'\n')
print(z)
whatever something
this is a foo bar <this is a

and i ><only want this

The parentheses indicate which text you want to group when the findall finds a match.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top