Data cleansing - how to decide which names are misspellings or are equivalent but slightly different?

StackOverflow https://stackoverflow.com/questions/18825285

Pergunta

We have table with companies names and a numeric primary key identity. We are cleaning up the data and we have discovered the name column is full of similar names that represent the same company.

E.g. BA and Ba or GTC Ltd and GTC Limited.

Is there anyway using SQL server that we can get counts and summary of all items that have similar names and list of IDs. I wondered if there was some sort of similarity comparison we could set a threshold value for etc

We need to present a list of names to the client that look like they need merged.

Foi útil?

Solução

The basic answer is "No". Name rectification is a hard problem. Two names like "GTC Ltd" and "GTC Limited" are more different than "GTC" and "GTE" (by more obvious measures). There are outside service bureaus and special-purpose software for this purpose.

If you are dealing with a smallish amount of data, I would suggest that you alphabetize the values, load them into Excel, and add a column in Excel with the "official" name. You can then re-import this as a table in the database to do what you want. It might help if you remove known suffixes and prefixes, such as "ltd", "bros", "partners" and so on.

If you do try to go down the path of something like soundex(), then be sure that you understand it well. For instance, the soundex() values of the following two strings are the same: "gte, blah blah blah" and "gdteey, junk goes here".

Outras dicas

Your answer lies in the SoundEx() and Difference() functions.

DECLARE @a varchar(50) = 'BA'
      , @b varchar(50) = 'Ba'
;

SELECT @a
     , @b
     , SoundEx(@a)
     , SoundEx(@b)
     , Difference(@a, @b)
;

SET @a = 'GTC Ltd';
SET @b = 'GTC Limited';

SELECT @a
     , @b
     , SoundEx(@a)
     , SoundEx(@b)
     , Difference(@a, @b)
;

SET @a = 'BLAH';

SELECT @a
     , @b
     , SoundEx(@a)
     , SoundEx(@b)
     , Difference(@a, @b)
;

This of SoundEx as "sounds like" - it's a function that returns a representation of the input which you can compare with other outputs.

The Difference() function returns a value between 0 and 4, where the higher numbers represent better matches.

There's a plenty of functions to check similarities. MS SQL provides SOUNDEX and DIFFERENCE functions, which I've never actually used.

Although I once used Levenshtein (minimum edits to covert string1 into string2) in PHP and it was very effictient. Here is the Devio's implementation in TSQL as a function which you can put into your code:

SELECT 
    LEVENSHTEIN(COL1, COL2) 
FROM 
    ExampleTable

Or in WHERE condition:

SELECT 
    COL1, COL2
FROM
    ExampleTable
WHERE
    LEVENSHTEIN(COL1, COL2) < 5

Here I'd suggest you to implement some CASE - WHEN - THEN logic, to find the correct levenshtein distance for you needs.

You can use COLLATE UTF8_GENERAL_CI and LIKE operator to check for BA and Ba . But for GTC Ltd and GTC Limited you can still use the same but then you should manually check them and merge carefully.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top