Add stop_words while performing TF-IFcosine similarity

https://stackoverflow.com/questions/19655951

python
tf-idf
cosine-similarity

01-07-2022
|

Question

I'm using sklearn to perform cosine similarity.

Is there a way to consider all the words starting with a capital letter as stop words?

Solution

The following regex will take as input a string, and remove/replace all sequences of alphanumeric characters that begin with an uppercase character with the empty string. See http://docs.python.org/2.7/library/re.html for more options.

s1 = "The cat Went to The store To get Some food doNotMatch"
r1 = re.compile('\\b[A-Z]\w+')
r1.sub('',s1)
' cat  to  store  get  food doNotMatch'

Sklearn also has many great facilities for text feature generation, such as sklearn.feature_extraction.text Also you might want to consider NLTK to assist in sentence segmentation, removing stop words, etc...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow