First, unescape HTML entities, then remove punctuation chars:
import HTMLParser
tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
text = HTMLParser.HTMLParser().unescape(text)
shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText
Here's an example, how unescape
works:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(""The truth is out there")
u'"The truth is out there'
UPD:
the solution to UnicodeDecodeError
problem : use text.decode('utf8')
. Here's a good explanation why do you need to do this.