Fuzzy Matching, Confidence Score, C#

https://stackoverflow.com/questions/10603449

09-06-2021
|

質問

I'm trying to calcuate the confidence score that a string appears within a subset of a much larger set.

Say I have 10 words in my original list and I match a new word against all 10 words. Each match returns a similarity score. I've set a threshold to ignore any similarity score that is below 70%. So at the end I'm left with my input word possibly matching 3 words within my list.

To me this gives me 33.333% chance that my input word is a match against the 3 words with the higher similarity score. I want to calculate how confident I am that the word is a match is these three. I've calculated my confidence score as follows but this seems wrong and way to simple.

Cat 1 - 70% similarity - 33.3% chance.
Cat 2 - 75% similarity - 33.3% chance.
Cat 3 - 80% similarity - 33.3% chance.

((0.70) * (0.333)) + ((0.75) * (0.333)) + ((0.80) * (0.333)) = 75% Confident.

What is the best method of calculating confidence levels?

EDIT: Better Sample as requested

Original Word Set

Hello
Help
Hell
Problem
World
Ocean
Animal
Carrot
Brown
Black

Match New Word - Helicopter against original word set. The match returns 3 words from the original set with a similarity score of over 70%. The words returned were: 1. Hello - Similarity 70% 2. Help - Similarity 75% 3. Hell - Similarity 80%

I want to calcuate score that show how confident I am that helpicopter is a match to the words returned.

Answer: at [link] http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/ff9fc38e-8ca3-4d9a-b505-dfbe37910b17

解決

Your probabilities are not right (or are not probabilities). You seem to have assumed that your word is a match for one of the top three similarity scores (if it is, your confidence level is de facto 100%...). Also, the probability and similarity scores are not independent, so your calculation is also flawed if you're looking for anything that has a basis in probability/statistics.

What you have actually done is work out the mean "similarity" for the top three cases. If that's acceptable as your (non-statistical) confidence level, then that's fine. But you're going to have to make a value call on this yourself - there's no mathematical basis really to what you are trying to do. To help further, you'll have to give us a lot more information on:

How your similarity score is calculated.
What the probability is that your word matches something in the list of 10 to start with.
How similar are the 10 words in your list.
etc. etc.

Edit following your edit:

Your three "similarity" scores are far from indepdendent, because the three words themselves are very "similar". And in any event, any algorithm that says "helicopter" is 80% similar to "hell" is not very good. I'd say the confidence level is pretty close to zero in this case....!

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow