here are advices to improve upon your code:
- Use regex compilation so you don't process the regex each time you apply the regex,
- use raw strings to avoid any interpretation of the regex string by python,
- use a regex that takes anything but the closing tag character for matching within the tag
- you don't need to repeat the substitution as it's matching every occurance on the line per default
here's a simpler and better result:
>>> import re
>>> r = re.compile(r'<[^>]+>')
>>> for it in source:
... r.sub('', it)
...
'Twitter for Android Tablets'
'Twitter for Android'
'foursquare'
'web'
'Twitter for iPhone'
'Twitter for BlackBerry'
N.B.: the best solution for your use case would be @bakuriu's suggestion:
>>> for it in source:
... it[it.index('>')+1:it.rindex('<')]
'Twitter for Android Tablets'
'Twitter for Android'
'foursquare'
'Twitter for iPhone'
'Twitter for BlackBerry'
which adds no important overhead and uses basic, fast string operations. But that solution takes only what is between tags, instead of removing it, which may have side effects if there are tags within the <a>
and </a>
or no tags at all, i.e. it won't work for the web
string. A solution against no tags at all:
>>> for it in source:
... if '>' in it and '<' in it:
... it[it.index('>')+1:it.rindex('<')]
... else:
... it
'Twitter for Android Tablets'
'Twitter for Android'
'foursquare'
'web'
'Twitter for iPhone'
'Twitter for BlackBerry'