Python NLTK not taking out punctuations correctly

https://stackoverflow.com/questions/22978956

30-06-2023
|

Question

I have defined the following code

exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer

I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?

Solution

When you create a set from wordList it stores the string '"the' as the only element,

>>> set(wordList)
set(['"the'])

So using set difference will return the same set,

>>> set(wordList) - set(string.punctuation)
set(['"the'])

If you want to just remove punctuation you probably want something like,

>>> [word.translate(None, string.punctuation) for word in wordList]
['the']

Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.

You can then perform the lemmatization on the new list.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow