Question

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct:

Sentence: "I live in NY."

word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"

When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:

word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]

Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why.

What is a good nGram value and how should I take it into account when using a tool like LingPipe?

Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

Was it helpful?

Solution 2

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.

You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.

The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

OTHER TIPS

Usually a picture is worth thousand words. enter image description here

Source: http://recognize-speech.com/language-model/n-gram-model/comparison

An n-gram is a n-tuple or group of n words or characters (grams, for pieces of grammar) which follow one another. So an n of 3 for the words from your sentence would be like "# I live", "I live in", "live in NY", "in NY #". This is used to create an index of how often words follow one another. You can use this in a Markov Chain to create something that will be similar to language. As you populate a mapping of the distributions of word groups or character groups, you can recombine them with the probability that the output will be close to natural, the longer the n-gram is.

Too high of a number, and your output will be a word for word copy of the original, too low of a number, and the output will be too messy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top