Pergunta

Hello i got an assignment on Information Retrieval and i could not realise how to create that partial specification,i mean the value of the words like here: http://nlp.stanford.edu/IR-book/html/htmledition/finite-automata-and-language-models-1.html

the = 0.2

a = 0.1

frog = 0.01... and so on. I would be thankful if someone explains how to calculate these values.

Learn about Language models!

a) Explain the idea!

b) Consider the following document collection:

D1: Today is sunny. Sunny Berlin! To be or not to be.

D2: She is in Berlin today. She is a sunny girl. Berlin is always exciting!

Calculate the corresponding Unigram Language Model for each document! Assume the stop probability to be xed across models (and equal to 0:2). Use these models to rank the documents given the query \sunny Berlin"!

Foi útil?

Solução

The value of those words are not calculated there on the page. The are obtained from statistics of from the definition of the model.

For example if you look at the picture below, there are two different models with different probabilities for each word. As the designer of your model you will need to define the probabilities by yourself.

enter image description here

If you couldn't understand what is the language model here is a simple example:

Imagine people who are living in London have one language model M1 and people living in NY have other language model M2.

Based on some statistics, we know that people in London use the word "sunny" two times more than people in NY (for any reason) so in M1 probability of using "sunny" will be 0.04 and in M2 "sunny" = 0.02. Refereeing to other texts TV, Magazine and so on, we can define "with what probability" people of London(M1) and NY(M2) use other words and we create a table like what shown above.

Now we have a sentence "She is a sunny girl" which we don't know its from person in London or in NY.

Referring to the table we can guess this more likely is from a Londoner (M1) because they use this word more!

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top