Question

At least 3 types of n-grams can be considered for representing text documents:

  • byte-level n-grams
  • character-level n-grams
  • word-level n-grams

It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".

Are there other criteria to consider for choosing the "right" representation?

Was it helpful?

Solution

Evaluate. The criterion for choosing the representation is whatever works.

Indeed, character level (!= bytes, unless you only care about english) probably is the most common representation, because it is robust to spelling differences (which do not need to be errors, if you look at history; spelling changes). So for spelling correction purposes, this works well.

On the other hand, Google Books n-gram viewer uses word level n-grams on their books corpus. Because they don't want to analyze spelling, but term usage over time; e.g. "child care", where the individual words aren't as interesting as their combination. This was shown to be very useful in machine translation, often referred to as "refrigerator magnet model".

If you are not processing international language, bytes may be meaningful, too.

OTHER TIPS

I would outright discard byte-level n-grams for text-related tasks, because bytes are not a meaningful representation of anything.

Of the 2 remaining levels, the character-level n-grams will need much less storage space and will , subsequently, hold much less information. They are usually utilized in such tasks as language identification, writer identification (i.e. fingerprinting), anomaly detection.

As for word-level n-grams, they may serve the same purposes, and much more, but they need much more storage. For instance, you'll need up to several gigabytes to represent in memory a useful subset of English word 3-grams (for general-purpose tasks). Yet, if you have a limited set of texts you need to work with, word-level n-grams may not require so much storage.

As for the issue of errors, a sufficiently large word n-grams corpus will also include and represent them. Besides, there are various smoothing methods to deal with sparsity.

There other issue with n-grams is that they will almost never be able to capture the whole needed context, so will only approximate it.

You can read more about n-grams in the classic Foundations of Statistical Natural Language Processing.

I use character ngrams on small strings, and word ngrams for something like text classification of larger chunks of text. It is a matter of which method will preserve the context you need more or less...

In general for classification of text, word ngrams will help a bit with word-sense dissambiguation, where character ngrams would be easily confused and your features could be completely ambiguous. For unsupervised clustering, it will depend on how general you want your clusters, and on what basis you want docs to converge. I find stemming, stopword removal, and word bigrams work well in unsupervised clustering tasks on fairly large corpora.

Character ngrams are great for fuzzy string matching of small strings.

I like to think of a set of grams as a vector, and imagine comparing vectors with the grams you have, then ask yourself if what you are comparing maintains enough context to answer the question you are trying to answer.

HTH

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top