Because the incoming data is not in a consistent format, I don't think you will be able to do this with a database. In fact, I would suggest NOT doing it with a database, allowing you to run a matching routine beforehand.
You'll then need to examine as much of the data as possible, and see if you can find any patterns, or things which you can do to the data in bulk to make it easier to match. For example:
- Remove repeated whitespace (e.g. "Awesome Inc" -> "Awesome Inc")
- Remove non-alphanumeric characters
- If possible, can you remove the obvious codes?
I would then suggest something similar to the following:
- Add a field to your Company Table (the incoming data) to indicate the matched company, allowing you to keep track of matched items (and use for joins further on). If you don't want to modify this table, add a second table to link the two.
- Run repeated attempts to match, starting with the most definite versions (e.g. State in Company Table is present AND States match AND Company Reference Name within Company Table Name) - store these associations. They reduce the possible matches on your next attempts. At any point where your match returns > 1 possibility, it should not be used.
- When you've eliminated the easy matches, you can proceed to more fuzzy methods, such as Levenshtein Distance, individual words (tokens) matching.
I would expect that for a while, you should probably flag up low confidence matches, having a human review them, while you tune your process.
You can also store all previous matches for a company, meaning that over time your system might get better. It depends on how much the data varies each day.