Pergunta

I'm working on an app which is creating a data bank of questions from old question papers. I wanted to maintain a table linking similar questions together as they were inserted. (The table I had in mind was a Modified Preordered Traversal Tree).

The requirements I have are:

  1. Word problems with changed numbers should be linked together
  2. Word problems with proper nouns/names being different should be linked together.
  3. XYZ, ABC, PQR, MNO are equivalent (eg. triangle nomenclatures)
  4. Ignore punctuations and conjunctions and 'small words'.
  5. Tags! I'm tagging each question with its subject. The likelyhood of a Math question being similar to a History question is rare. But a Chemistry thermodynamics question could be similar to a Physics thermodynamics question.

Any idea on how to proceed on the algorithm side of things would be very much appreciated.

Also I'll be dealing with images containing Math notation. Should I make sure all my images have LaTeX in the 'ALT' attribute to make sure they are too processable by this algorithm or is there a better way of doing it?

Foi útil?

Solução

It sounds like you want to consider two questions to be similar when they have the same sentence structure, after stripping out a laundry list of syntactic patterns you expect to vary. As such this problem looks similar to the problem of detecting near-duplicate documents in a corpus.

One way to do that is a technique called "simhashing"; one takes a (preprocessed) document and calculates a simhash fingerprint. Like a typical hash, the fingerprint has a fixed size and looks like binary gibberish. Unlike a typical hash, documents that are textually similar will also have similar fingerprints. By choosing a maximum (Hamming) distance that fingerprints can differ by, you can define clusters of documents (questions) that you consider "similar".

The process for indexing a new question would then look like this:

  1. Normalize the question text. This is a standard information retrieval task and it means slightly different things to everyone, but transformations like collapsing whitespace, putting everything in lower-case and stripping punctuation are typical. You may also want to turn all numbers or a white list of proper names into place holders ("NUMBER", "NAME", etc.)
  2. Feed the resulting text into a simhash implementation to get a fingerprint.
  3. Search your corpus for fingerprints sufficiently near the new one. This is actually trickier than you might think to do efficiently. Google came up with a reasonable approach, which boils down to a sorted table of finger prints with a few artificial ones added in.
  4. Having found the questions the inserted one should be considered similar to, you're free discard the duplicate question, hang on to it and do book keeping, etc.

This book is an excellent primer on information retrieval in general. This is the simhash paper. Here's the manpage of a simple program to compute simhashes, it may be a good starting point if you don't want to implement the algorithm yourself.

Outras dicas

You need to do feature extraction and read up on this.

The key to data mining is preprocessing. You can't just throw your harddisk at it and expect it to find anything useful, but you need to direct the search by preprocessing the data appropriately. This maybe is 90% of the work. So read up on feature extraction!

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top