Question

I'm using sklearn to perform cosine similarity.

Is there a way to consider all the words starting with a capital letter as stop words?

Was it helpful?

Solution

The following regex will take as input a string, and remove/replace all sequences of alphanumeric characters that begin with an uppercase character with the empty string. See http://docs.python.org/2.7/library/re.html for more options.

s1 = "The cat Went to The store To get Some food doNotMatch"
r1 = re.compile('\\b[A-Z]\w+')
r1.sub('',s1)
' cat  to  store  get  food doNotMatch'

Sklearn also has many great facilities for text feature generation, such as sklearn.feature_extraction.text Also you might want to consider NLTK to assist in sentence segmentation, removing stop words, etc...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top