Aggregate strings using fuzzy matching

https://stackoverflow.com/questions/15088182

11-03-2022
|

Question

Suppose I have an error log and I wish to get a count of each type of error. I have already performed a naive count by grouping by error message, however a lot of the messages contain context-specific information, which means that despite being caused by the same bug I cannot simply group by message text.

Some examples, where the italicised segments vary per instance of error:

failed to retrieve results for user 188a9e12-6797-4d9b-8adf-4588b2435326 on page /primate/gorilla
failed to retrieve results for user 08c610d2-27d2-4f97-bf60-d5b3010e8dd6 on page /primate/monkey

I would like to group all such messages using some fuzzy logic. I understand the Levenshtein Distance algorithm is valuable for this type of processing, but I guess the raw distance is not valuable because it is not weighted against the string's length (a distance of 30 is less significant in a string of 1000 characters, versus 30 out of 100).

So my aim is to iterate over a list of messages and get some kind of fuzzily matched count. There may be a side issue of generating some kind of consistent key for each fuzzily matched message? How would i go about this?

Solution

I would give q-gram distance a try. The distance between two strings is then determined by the number of N-grams they have in common. N would have to be large enough that an N-gram represents a relevant detail. N=4 might be a good starter.

Further string distances are derived from the concept of N-grams: f.x. cosine and Jaccard distance.

This text explains different types of string distance algorithms in context of R.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow