質問

I'm trying to calcuate the confidence score that a string appears within a subset of a much larger set.

Say I have 10 words in my original list and I match a new word against all 10 words. Each match returns a similarity score. I've set a threshold to ignore any similarity score that is below 70%. So at the end I'm left with my input word possibly matching 3 words within my list.

To me this gives me 33.333% chance that my input word is a match against the 3 words with the higher similarity score. I want to calculate how confident I am that the word is a match is these three. I've calculated my confidence score as follows but this seems wrong and way to simple.

  1. Cat 1 - 70% similarity - 33.3% chance.
  2. Cat 2 - 75% similarity - 33.3% chance.
  3. Cat 3 - 80% similarity - 33.3% chance.

((0.70) * (0.333)) + ((0.75) * (0.333)) + ((0.80) * (0.333)) = 75% Confident.

What is the best method of calculating confidence levels?

EDIT: Better Sample as requested

Original Word Set

  1. Hello
  2. Help
  3. Hell
  4. Problem
  5. World
  6. Ocean
  7. Animal
  8. Carrot
  9. Brown
  10. Black

Match New Word - Helicopter against original word set. The match returns 3 words from the original set with a similarity score of over 70%. The words returned were: 1. Hello - Similarity 70% 2. Help - Similarity 75% 3. Hell - Similarity 80%

I want to calcuate score that show how confident I am that helpicopter is a match to the words returned.

Answer: at [link] http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/ff9fc38e-8ca3-4d9a-b505-dfbe37910b17

役に立ちましたか?

解決

Your probabilities are not right (or are not probabilities). You seem to have assumed that your word is a match for one of the top three similarity scores (if it is, your confidence level is de facto 100%...). Also, the probability and similarity scores are not independent, so your calculation is also flawed if you're looking for anything that has a basis in probability/statistics.

What you have actually done is work out the mean "similarity" for the top three cases. If that's acceptable as your (non-statistical) confidence level, then that's fine. But you're going to have to make a value call on this yourself - there's no mathematical basis really to what you are trying to do. To help further, you'll have to give us a lot more information on:

  • How your similarity score is calculated.
  • What the probability is that your word matches something in the list of 10 to start with.
  • How similar are the 10 words in your list.
  • etc. etc.

Edit following your edit:

Your three "similarity" scores are far from indepdendent, because the three words themselves are very "similar". And in any event, any algorithm that says "helicopter" is 80% similar to "hell" is not very good. I'd say the confidence level is pretty close to zero in this case....!

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top