Extract non-content English language words string - python [duplicate]

https://stackoverflow.com/questions/22904678

28-06-2023
|

Question

I am working on Python script in which I want to remove the common english words like "the","an","and","for" and many more from a String. Currently what I have done is I have made a local list of all such words and I just call remove() to remove them from the string. But I want here some pythonish way to achieve this. Have read about nltk and wordnet but totally clueless about that's what I should use and how to use it.

Edit

Well I don't understand why marked as duplicate as my question does not in any way mean that I know about Stop words and now I just want to know how to use it.....the question is about what I can use in my scenario and answer to that was stop words...but when I posted this question I din't know anything about stop words.

Solution 3

I have found that what I was looking for is this:

from nltk.corpus import stopwords
my_stop_words = stopwords.words('english')

Now I can remove or replace the words from my list/string where I find the match in my_stop_words which is a list.

For this to work I had to download the NLTK for python and the using its downloader I downloaded stopwords package.

It also contains many other packages which can be used in different situations for NLP like words,brown,wordnet etc.

OTHER TIPS

Do this.

vocabular = set (english_dictionary)
unique_words = [word for word in source_text.split() if word not in vocabular]

It is simple and efficient as can be. If you don't need positions of unique words, make them set too! Operator in is extremely fast on sets (and slow on lists and other containers)

this will also work:

yourString = "an elevator is made for five people and it's fast"
wordsToRemove = ["the ", "an ", "and ", "for "]

for word in wordsToRemove:
    yourString = yourString .replace(word, "")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow