Fuzzy Matching, Confidence Score, C#
-
09-06-2021 - |
Frage
I'm trying to calcuate the confidence score that a string appears within a subset of a much larger set.
Say I have 10 words in my original list and I match a new word against all 10 words. Each match returns a similarity score. I've set a threshold to ignore any similarity score that is below 70%. So at the end I'm left with my input word possibly matching 3 words within my list.
To me this gives me 33.333% chance that my input word is a match against the 3 words with the higher similarity score. I want to calculate how confident I am that the word is a match is these three. I've calculated my confidence score as follows but this seems wrong and way to simple.
- Cat 1 - 70% similarity - 33.3% chance.
- Cat 2 - 75% similarity - 33.3% chance.
- Cat 3 - 80% similarity - 33.3% chance.
((0.70) * (0.333)) + ((0.75) * (0.333)) + ((0.80) * (0.333)) = 75% Confident.
What is the best method of calculating confidence levels?
EDIT: Better Sample as requested
Original Word Set
- Hello
- Help
- Hell
- Problem
- World
- Ocean
- Animal
- Carrot
- Brown
- Black
Match New Word - Helicopter against original word set. The match returns 3 words from the original set with a similarity score of over 70%. The words returned were: 1. Hello - Similarity 70% 2. Help - Similarity 75% 3. Hell - Similarity 80%
I want to calcuate score that show how confident I am that helpicopter is a match to the words returned.
Answer: at [link] http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/ff9fc38e-8ca3-4d9a-b505-dfbe37910b17
Lösung
Your probabilities are not right (or are not probabilities). You seem to have assumed that your word is a match for one of the top three similarity scores (if it is, your confidence level is de facto 100%...). Also, the probability and similarity scores are not independent, so your calculation is also flawed if you're looking for anything that has a basis in probability/statistics.
What you have actually done is work out the mean "similarity" for the top three cases. If that's acceptable as your (non-statistical) confidence level, then that's fine. But you're going to have to make a value call on this yourself - there's no mathematical basis really to what you are trying to do. To help further, you'll have to give us a lot more information on:
- How your similarity score is calculated.
- What the probability is that your word matches something in the list of 10 to start with.
- How similar are the 10 words in your list.
- etc. etc.
Edit following your edit:
Your three "similarity" scores are far from indepdendent, because the three words themselves are very "similar". And in any event, any algorithm that says "helicopter" is 80% similar to "hell" is not very good. I'd say the confidence level is pretty close to zero in this case....!