Question

I have some strings that look like this:

<a href="javascript:updateParent('higashino/index.html')">東野 圭吾「夢幻花」「白夜行」</a>他<br>

Now I want to extract the link and the strings inside the corner brackets ("「" and "」"), like this:

['higashino/index.html', '夢幻花', '白夜行']

I've tried:

import re
str = u'''<a href="javascript:updateParent('higashino/index.html')">東野 圭 吾「夢幻花」「白夜行」</a>他<br>'''
myre = re.compile(ur'''\('(.*)'\)">.*「(.*?)」.*''', re.UNICODE)
myre.findall(str)

the result is:

['higashino/index.html', '白夜行']

then I tried to use the pattern\('(.*)'\)">.*「([^」]*)」.*, but the result was the same, only one element inside the corner brackets was found.

How can I get not just one, but all elements inside the corner brackets? Thanks.

Was it helpful?

Solution

Use re.findall() (or re.finditer) with the regex 「([^」]*?)」:

import re
str = '''<a href="javascript:updateParent('higashino/index.html')">東野 圭 吾「夢幻花」「白夜行」</a>他<br>'''
match = re.findall(r'「([^」]*?)」', str)
print(match)

Giving:

['夢幻花', '白夜行']

Using python 3. Also, if you're not using python 3 already I recommend doing so as it is much better with unicode strings than python 2

OTHER TIPS

>>> myre = re.compile(ur'''(?<=\(').+?(?='\)">)|(?<=「)[^」]+''', re.UNICODE)
>>> myre.findall(str)
[u'higashino/index.html', u'\u5922\u5e7b\u82b1', u'\u767d\u591c\u884c']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top