You can loop over the findall()
results and collect them in a collections.defaultdict
object. Do adjust your regular expressions to not include the quotes, and add some whitespace tolerance, though:
from collections import defaultdict
import re
regex = re.compile(r'label\("([^"]*)",\s*"([^"]*)"\)')
results = defaultdict(list)
for id_, tag in regex.findall(s):
results[id_].append(tag)
print results.items()
You can replace list
with set
and append()
with add()
if all you want is unique values.
Demo:
>>> from collections import defaultdict
>>> import re
>>> s = 'label("id1","A") label("id1","B") label("id2", "C") label("id2","A") label("id2","D") label("id3","A")'
>>> regex = re.compile(r'label\("([^"]*)",\s*"([^"]*)"\)')
>>> results = defaultdict(list)
>>> for id_, tag in regex.findall(s):
... results[id_].append(tag)
...
>>> results.items()
[('id2', ['C', 'A', 'D']), ('id3', ['A']), ('id1', ['A', 'B'])]
You can sort this result too, if so desired.