Algorithm to match sentences about the same topic

Question 1

In general you would need to use some natural language processing (NLP). If you are new to the subject, I recommend you to take a look at nltk. It is a python library that includes tools for a variety of NLP problems. They also have a free book that you can check to take a quick look at the tools that you may need.

www.nltk.org/book/‎

I hope it helps

Question 2

Check out http://en.wikipedia.org/wiki/Topic_model to see how people model documents in terms of hidden "topics" that they share. Some common models and algorithms are mentioned. In general you are looking for a topic model. Some googling should find papers if you are looking for more advanced stuff than what's on the wiki.

Question 3

Levenshtein and Hamming distances are very concerned with differences at a local level. If you want to look for the topic behind the sentence, it's better to consider all the words in the sentence together.

A simple whole sentence approach would be tf-idf. If you treat each sentence as a document, then count the number of times a term (word) appears in the sentence, and divide by the number of documents that term appears in, you get a number for each distinct term in the sentence. Sentences with similar numbers for the same term are likely to be about the same topic.

You could use a simple approach with that, and then try different lemmatization or other grouping schemes if you need better performance.

A simple comparison for the numbers related to each sentence is cosine similarity.