Comparing text frequencies in a document to frequency in a corpus

https://stackoverflow.com/questions/4372661

09-10-2019
|

سؤال

I want to analyse a document for items such as letters, bigrams, words, etc and compare how frequent they are in my document to how frequent they were over a large corpus of documents.

The idea is that words such as "if", "and", "the" are common in all documents but some words will be much more common in this document than is typical for the corpus.

This must be pretty standard. What is it called? Doing it the obvious way I always had a problem with novel words in my document but not in the corpus rating infinitely significant. How is this dealt with?

المحلول

most likely you've already checked the tf-idf or some other metrics from okapi_bm25 family.

also you can check natural language processing toolkit nltk for some ready solutions

UPDATE: as for novel words, smoothing should be applied: Good-Turing, Laplace, etc.

نصائح أخرى

It comes under the heading of linear classifiers, with Naive Bayesian classifiers being the most well-known form (due to its remarkably simplicity and robustness in attacking real-world classification problems).

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow