Question

I have searched the questions and there are similar questions asked but no solution that I think I can use. This question is similar to a fuzzy match...sorta. I need help with comparing two tables. One table is a company reference table and the other is a table that get's raw company data imported into it daily. The reference table is clean and has a company id associated with every single company. The daily data that is imported does not have a company id. What I'm trying to do is have the daily data reference the 'Company Reference Table' on company_name and update the 'Company Table's column company_state based on the company_name. Unfortunately, the daily data string for company_name coming in is not always the same each day. There can be various characters (a-z, 0-9, +, -, .) and spaces in front or after the actual company name with varying lengths daily so I do not believe I can use charindex to clean it up.

Company Reference Table

company_id  company_name  company_state
1           Awesome Inc   NY
2           Excel-guru    AL
3           Clean All     MI 

Company Table

company_name              company_state
abc123 Awesome   Inc      NULL
Excel gur xyz-987         NULL
Clean All Cleanall        NULL

What I want it to do is this. Sorta like a fuzzy match.

Company Table

company_name              company_state
abc123 Awesome   Inc      NY
Excel gur xyz-987         AL
Clean All Cleanall        MI

Any help is much appreciated. Thank you.

Was it helpful?

Solution

Try below Query to update company table:

update company c INNER JOIN company_ref cr
ON c.company_name LIKE concat('%', cr.company_name, '%') 
SET c.company_state = cr.company_state;

Another way just by using SELECT

SELECT c.*, cr.* FROM company c INNER JOIN company_ref cr
ON c.company_name LIKE concat('%', cr.company_name, '%');

SQL Fiddle: http://sqlfiddle.com/#!2/ec76f/1

OTHER TIPS

If I understand, the company_name in the company table always contains the entire string that is in the reference table - it just might contain some junk before or after that string. If so, you just need to find an appropriate string function for your DBMS that lets you check if string A contains string B. For example, with MySQL I think the following will work (not tested):

select c.company_name, r.company_state from company_table c, reference_table r where locate(r.company_name, c.company_name) != 0

that works because the MySQL locate(A, B) function returns 0 if and only if the string A doesn't occur in the string B.

Because the incoming data is not in a consistent format, I don't think you will be able to do this with a database. In fact, I would suggest NOT doing it with a database, allowing you to run a matching routine beforehand.

You'll then need to examine as much of the data as possible, and see if you can find any patterns, or things which you can do to the data in bulk to make it easier to match. For example:

  • Remove repeated whitespace (e.g. "Awesome Inc" -> "Awesome Inc")
  • Remove non-alphanumeric characters
  • If possible, can you remove the obvious codes?

I would then suggest something similar to the following:

  • Add a field to your Company Table (the incoming data) to indicate the matched company, allowing you to keep track of matched items (and use for joins further on). If you don't want to modify this table, add a second table to link the two.
  • Run repeated attempts to match, starting with the most definite versions (e.g. State in Company Table is present AND States match AND Company Reference Name within Company Table Name) - store these associations. They reduce the possible matches on your next attempts. At any point where your match returns > 1 possibility, it should not be used.
  • When you've eliminated the easy matches, you can proceed to more fuzzy methods, such as Levenshtein Distance, individual words (tokens) matching.

I would expect that for a while, you should probably flag up low confidence matches, having a human review them, while you tune your process.

You can also store all previous matches for a company, meaning that over time your system might get better. It depends on how much the data varies each day.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top