Question

I'm cleaning a series of sources from a twitter stream. Here is an example of the data:

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']


import re
for i in source:
    re.sub('<.*?>', '', re.sub(r'(<.*?>)(Twitter for)(\s+)', r'', i))

### This would be the expected output ###
'Android Tablets'
'Android'
'foursquare'
'web'
'iPhone'
'BlackBerry'

The later is the code i have that does the job but looks awful. I was hoping there is a better way of doing this including re.sub() or other function that could be more approapiate.

Était-ce utile?

La solution 2

here are advices to improve upon your code:

  • Use regex compilation so you don't process the regex each time you apply the regex,
  • use raw strings to avoid any interpretation of the regex string by python,
  • use a regex that takes anything but the closing tag character for matching within the tag
  • you don't need to repeat the substitution as it's matching every occurance on the line per default

here's a simpler and better result:

>>> import re
>>> r = re.compile(r'<[^>]+>')
>>> for it in source:
...     r.sub('', it)
... 
'Twitter for Android Tablets'
'Twitter for  Android'
'foursquare'
'web'
'Twitter for iPhone'
'Twitter for BlackBerry'

N.B.: the best solution for your use case would be @bakuriu's suggestion:

 >>> for it in source:
 ...     it[it.index('>')+1:it.rindex('<')]
'Twitter for Android Tablets'
'Twitter for  Android'
'foursquare'
'Twitter for iPhone'
'Twitter for BlackBerry'

which adds no important overhead and uses basic, fast string operations. But that solution takes only what is between tags, instead of removing it, which may have side effects if there are tags within the <a> and </a> or no tags at all, i.e. it won't work for the web string. A solution against no tags at all:

 >>> for it in source:
 ...     if '>' in it and '<' in it:
 ...         it[it.index('>')+1:it.rindex('<')]
 ...     else:
 ...         it
 'Twitter for Android Tablets'
 'Twitter for  Android'
 'foursquare'
 'web'
 'Twitter for iPhone'
 'Twitter for BlackBerry'

Autres conseils

Just another alternative, using BeautifulSoup html parser:

>>> from bs4 import BeautifulSoup
>>> for link in source:
...     print BeautifulSoup(link, 'html.parser').text.replace('Twitter for', '').strip()
... 
Android Tablets
Android
foursquare
web
iPhone
BlackBerry

If you're doing a lot of these, use a library designed to handle (X)HTML. lxml works well but I'm more familiar with the BeautifulSoup wrapper.

from bs4 import BeautifulSoup

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
      '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
      '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
      '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
      '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']

soup = BeautifulSoup('\n'.join(source))
for tag in soup.findAll('a'):
    print(tag.text)

This might be a little overkill for your use case, though.

One option, if the text really is in this consistent of a format, is to just use string operations instead of regex:

source = ['<a href="https://twitter.com/download/android" rel="nofollow">Twitter for Android Tablets</a>', 
          '<a href="https://twitter.com/download/android" rel="nofollow">Twitter for  Android</a>',
          '<a href="http://foursquare.com" rel="nofollow">foursquare</a>', 'web',
          '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
          '<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry</a>']

for i in source:
    print i.partition('>')[-1].rpartition('<')[0]

This code finds the first '>' in the string, takes everything after it, finds the first '<' in what remains, and returns everything before that; e.g., giving you any text between the first '>' and the last '<'.

There's also the far more minimal version @Bakuriu put in a comment, which is probably better than mine!

This looks less ugly to me and should work equally well:

import re
for i in source:
    print re.sub('(<.*?>)|(Twitter for\s+)', '', i);
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top