Question

I'd like to sift text (in particular, Twitter messages) to see if they relate to a particular topic. Have you been down that road? If so, I'd love to hear what approach you'd use.

For my case, just searching for topic keywords gets me useful text about 7% of the time; the keywords have multiple meanings, some of which aren't on topic. For my use, automatic filtering doesn't need to be perfect; I'd be happy if the extracted messages related to the topic 80% of the time. I'm also willing to lose 10-30% of the on-topic messages.

Doing a first pass by hand, there are some characteristics that make messages pretty likely to be good, like certain English phrases. Other characteristics give a high likelihood of rejection, like URLs, multiple hash tags, and other phrases. Others are harder to evaluate.

I could manually make a bunch of regexes and associated weights, and tweak things by hand until I got output I liked. That could well work. But I can name several other possible approaches, and I'm wondering which ones Stack Overflow readers have had good luck with.

Thanks!

Was it helpful?

Solution

This is an entire field in itself! I recommend doing some research in the natural language processing literature.

There are ad-hoc ways to do it, but these methods would be very error prone: many false positives and false negatives. It may be a good start though.

  1. If you use a keyword, you can attempt to disambiguate the meaning of keyword (if it has multiple meanings) by using the words around the keyword in question. But, to do this disambiguation would require a processed corpus (bunch of documents) to be able to determine which words appear together most frequently, and may mean the same thing.

  2. You could measure the distance between the text you are analyzing and a document that is known to be similar. You would need to use the word counts from both text sources, and then compare the term/document vectors. Look up "document vector model" for a more thorough treatment.

This is a good project to work on, but it is not simple.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top