Question

I am a computer scientist working on a problem that requires some statistical measures, though (not being very well versed in statistics) I am not quite sure what statistics to use.

Overview:

I have a set of questions (from StackExchange sites, of course) and with this data, I am exploring algorithms that will find similar questions to one I provide. Yes, StackExchange sites already perform this function, as do many other Q&A sites. What I am trying to do is analyze the methods and algorithms that people employ to accomplish this task to see which methods perform best. My problem is finding appropriate statistical measures to quantitatively determine "which methods perform best."

The Data:

I have a set of StackExchange questions, each of which is saved like this: {'questionID':"...", 'questionText':"..."}. For each question, I have a set of other questions either linked to it or from it. It is common practice for question answer-ers on StackExchange sites to add links to other similar posts in their answers, i.e. "Have you read this post [insert link to post here] by so-and-so? They're solving a similar problem..." I am considering these linked questions to be 'similar' to one another.

More concretely, let's say we have question A. Question A has a collection of linked questions {B, C, D}. So  A_linked = {B, C, D}. My intuition tells me that the transitive property does not apply here. That is, just because A is similar to B, and A is similar to C, I cannot confirm that B is similar to C. (Or can I?) However, I can confidently say that if A is similar to B, then B is similar to A.

So, to simplify these relationships, I will create a set of similar pairs: {A, B}, {A, C}, {A, D} These pairs will serve as a ground truth of sorts. These are questions we know are similar to one another, so their similarity confidence values equals 1. So similarityConfidence({A,B}) = 1

Something to note about this set-up is that we know only a few similar questions for each question in our dataset. What we don't know is whether some other question E is also similar to A. It might be similar, it might not be similar, we don't know. So our 'ground truth' is really only some of the truth.

The algorithm:

A simplified pseudocode version of the algorithm is this:

for q in questions: #remember q = {'questionID':"...", 'questionText':"..."}
    similarities = {} # will hold a mapping from questionID to similarity to q
    q_Vector = vectorize(q) # create a vector from question text (each word is a dimension, value is unimportant)
    for o in questions: #such that q!=o
        o_Vector = vectorize(o)
        similarities[o['questionID']] = cosineSimilarity(q_Vector,o_Vector) # values will be in the range of 1.0=identical to 0.0=not similar at all
    #now what???

So now I have a complete mapping of cosine similarity scores between q and every other question in my dataset. My ultimate goal is to run this code for many variations of the vectorize() function (each of which will return a slightly different vector) and determine which variation performs best in terms of cosine scores.

The Problem:

So here lies my question. Now what? How do I quantitatively measure how good these cosine scores are?

These are some ideas of measurements I've brainstormed (though I feel like they're unrefined, incomplete):

  • Some sort of error function similar to Root Mean Square Error (RMSE). So for each document in the ground-truth similarities list, accumulate the squared error (with error roughly defined as 1-similarities[questionID]). We would then divide that accumulation by the total number of similar pairs *2 (since we will consider a->b as well as b->a). Finally, we'd take the square root of this error.

    • This requires some thought, since these values may need to be normalized. Though all variations of vectorize() will produce cosine scores in the range of 0 to 1, the cosine scores from two vectorize() functions may not compare to one another. vectorize_1() might have generally high cosine scores for each question, so a score of .5 might be a very low score. Alternatively, vectorize_2() might have generally low cosine scores for each question, so a .5 might be a very high score. I need to account for this variation somehow.
    • Also, I proposed an error function of 1-similarities[questionID]. I chose 1 because we know that the two questions are similar, therefore our similarity confidence is 1. However, a cosine similarity score of 1 means the two questions are identical. We are not claiming that our 'linked' questions are identical, merely that they are similar. Is this an issue?  
  • We can find the recall (number of similar documents returned/number of similar documents), so long as we set a threshold for which questions we return as 'similar' and which we do not.

    • Although, for the reasons mentioned above, this shouldn't be a predefined threshold like similarity[documentID]>7 because each vectorize() function may return different values.  
  • We could find recall @ k, where we only analyze the top k posts.

    • This could be problematic though, because we don't have the full ground truth. If we set k=5, and only 1 document (B) of the 3 documents we knew to be relevant ({B,C,D}) were in the top 5, we do not know whether the other 4 top documents are actually equally or more similar to A than the 3 we knew about, but no one linked them.

Do you have any other ideas? How can I quantitatively measure which vectorize() function performs best?

Was it helpful?

Solution

First note that this question is highly relevant to the Information Retrieval problem of similarity and near duplicate detection.

As far as I see it, your problem can be split to two problems:

  1. Determining ground truth: In many 'competitions', where the ground truth is unclear, the way to determine which are the relevant documents is by taking documents which were returned by X% of the candidates.
  2. Choosing the best candidate: first note that usually comparing scores of two different algorithms is irrelevant. The scales could be completely different, and it is usually pointless. In order to compare between two algorithms, you should use the ranking of each algorithm - how each algorithm ranks documents, and how far is it from the ground truth.
    A naive way to do it is simply using precision and recall - and you can compare them with the f-measure. Problem is, a document that is ranked 10th is as important as a document that is ranked 1st.
    A better way to do it is NDCG - this is the most common way to compare algorithms in most articles I have encountered, and is widely used in the main IR conferences: WWW, sigIR. NDCG is giving a score to a ranking, and giving high importance to documents that were ranked 'better', and reduced importance to documents that were ranked 'worse'. Another common variation is NDCG@k - where NDCG is used only up to the k'th document for each query.

Hope this background and advises help.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top