Pregunta

I've been researching different algorithms, but haven't found exactly what I'm looking for.

Hamming distance (Only good for strings of the same length) Levenstein distance (finds similar words like kitten and sitten)

What I'm looking for is something that would find sentences about the same idea.

For example:

Sentence 1: Josh got hurt while playing in the park.
Sentence 2: Josh fell off the slide and got hurt at the park.
Sentence 3: Be careful at the park, your kids could get hurt.
Sentence 4: Josh likes to go shopping.

What I'm looking for would consider

sentence 1 and 2 on topic, but not sentence 3 or 4.

I guess I could try to compare each word in the sentence?

I would greatly appreciate anyone who could point me in the right direction.

¿Fue útil?

Solución

In general you would need to use some natural language processing (NLP). If you are new to the subject, I recommend you to take a look at nltk. It is a python library that includes tools for a variety of NLP problems. They also have a free book that you can check to take a quick look at the tools that you may need.

www.nltk.org/book/‎

I hope it helps

Otros consejos

Check out http://en.wikipedia.org/wiki/Topic_model to see how people model documents in terms of hidden "topics" that they share. Some common models and algorithms are mentioned. In general you are looking for a topic model. Some googling should find papers if you are looking for more advanced stuff than what's on the wiki.

Levenshtein and Hamming distances are very concerned with differences at a local level. If you want to look for the topic behind the sentence, it's better to consider all the words in the sentence together.

A simple whole sentence approach would be tf-idf. If you treat each sentence as a document, then count the number of times a term (word) appears in the sentence, and divide by the number of documents that term appears in, you get a number for each distinct term in the sentence. Sentences with similar numbers for the same term are likely to be about the same topic.

You could use a simple approach with that, and then try different lemmatization or other grouping schemes if you need better performance.

A simple comparison for the numbers related to each sentence is cosine similarity.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top