How does the Naive Bayes algorithm function effectively as a classifier, despite the assumptions of conditional indpendence and bag of words?

datascience.stackexchange https://datascience.stackexchange.com/questions/74134

문제

Naive Bayes algorithm used for text classification relies on 2 assumptions to make it computationally speedy:

  • Bag of Words assumption: the position of words is not considered

  • Conditional Independence: words are independent of one another

In reality, neither of those conditions often holds, yet Naive Bayes is quite effective. Why is that?

도움이 되었습니까?

해결책

The main reason is that in many cases (but not always) the model obtains enough evidence to make the right decision just from knowing which words appear and don't appear in the document (possibly also using their frequency, but this is not always needed either).

Let's take the textbook example of topic detection from news documents. A 'sports' article is likely to contain at least a few words which are unambiguously related to sports, and the same holds for many topic as long as the topics are sufficiently distinct.

In general tasks which are related to the general semantics of the text work reasonably well with unigrams (single words, unordered) as features, whether with NB or other methods. It's different for tasks which require taking syntax into account, or which require a deeper understanding of the semantics.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top