SQL - comparing strings from two tables (fuzzy match...sorta)

Question 1

Try below Query to update company table:

update company c INNER JOIN company_ref cr
ON c.company_name LIKE concat('%', cr.company_name, '%') 
SET c.company_state = cr.company_state;

Another way just by using SELECT

SELECT c.*, cr.* FROM company c INNER JOIN company_ref cr
ON c.company_name LIKE concat('%', cr.company_name, '%');

SQL Fiddle: http://sqlfiddle.com/#!2/ec76f/1

Question 2

If I understand, the company_name in the company table always contains the entire string that is in the reference table - it just might contain some junk before or after that string. If so, you just need to find an appropriate string function for your DBMS that lets you check if string A contains string B. For example, with MySQL I think the following will work (not tested):

select c.company_name, r.company_state from company_table c, reference_table r where locate(r.company_name, c.company_name) != 0

that works because the MySQL locate(A, B) function returns 0 if and only if the string A doesn't occur in the string B.

Question 3

Because the incoming data is not in a consistent format, I don't think you will be able to do this with a database. In fact, I would suggest NOT doing it with a database, allowing you to run a matching routine beforehand.

You'll then need to examine as much of the data as possible, and see if you can find any patterns, or things which you can do to the data in bulk to make it easier to match. For example:

Remove repeated whitespace (e.g. "Awesome Inc" -> "Awesome Inc")
Remove non-alphanumeric characters
If possible, can you remove the obvious codes?

I would then suggest something similar to the following:

Add a field to your Company Table (the incoming data) to indicate the matched company, allowing you to keep track of matched items (and use for joins further on). If you don't want to modify this table, add a second table to link the two.
Run repeated attempts to match, starting with the most definite versions (e.g. State in Company Table is present AND States match AND Company Reference Name within Company Table Name) - store these associations. They reduce the possible matches on your next attempts. At any point where your match returns > 1 possibility, it should not be used.
When you've eliminated the easy matches, you can proceed to more fuzzy methods, such as Levenshtein Distance, individual words (tokens) matching.

I would expect that for a while, you should probably flag up low confidence matches, having a human review them, while you tune your process.

You can also store all previous matches for a company, meaning that over time your system might get better. It depends on how much the data varies each day.