Question

I want to know what are the most acceptable ways to find features(special words) within large data set. When I say special words, I mean words which are most used in a specific field.

For example, I have two books:

  1. book1: a book about economics
  2. book2: a book about art

Now, I choose book1 and want to see which words are most related to it. I guess such words as 'financial', 'dollar', 'revenue' etc. will dominate the top of the most used words list. Even though the words may occur in the book2 too, frequencies will be less than book1.

On the other hand, choosing book2 is supposed to yield words such as 'abstract', 'renaissance', 'romanticism', 'culture' etc.

Of course the result depends on context(in the above example, it depends on book1 and book2).

It is obvious, chosen algorithm must be able to eliminate stop-words.

So, I am wondering which methods are being used for this problem.

Was it helpful?

Solution

tf-idf should help since it combines

  1. number of times a word appears in a document (i.e. each book)
  2. number of times a word appears in the set of documents (a.k.a. corpus)

If a word appears a lot in a document but not so much in the corpus, it is likely characteristic of the document and will have a high tf-idf score. If, on the other hand, a word appears frequently in a document and also frequently in the whole corpus, it is not very characteristic of such document and this does not have a high tf-idf score. The words with the highest tf-idf measures per document are the most relevant.

Stop word removal may be a step you want want to perform on your data before getting tf-idf measures for your documents, but you may want to try with and without stop words to compare performance.

EDIT:

To support what I mentioned in the comment re. not having to come up with the stopwords yourself, here's NLTK's English stopwords, which you can add to or remove from according to whatever you want to implement:

>>> import nltk
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 
'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 
'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 
'with', 'about', 'against', 'between', 'into', 'through', 'during', 
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 
'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 
'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 
'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 
't', 'can', 'will', 'just', 'don', 'should', 'now']

OTHER TIPS

Take a look at Latent Dirichlet Allocation (LDA). It is an unsupervised algorithm that treats "topics" as distributions of terms, and documents as distributions of topics. Source code for it is widely available in multiple languages (see below for some example libraries).

To eliminate stopwords you can simply find a stopwords list online or through a package supported under your language of choice. Often this option is built into text mining or NLP packages. Examples:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top