Approximate String Matching - Machine Learning [closed]

Question 1

There are typically two parts to such a problem: figuring out which items are likely in error, and then fixing those.

If you assume that the majority of items are spelled correctly, then finding the likely errors is pretty easy. Fixing the errors is a lot harder to automate, and it's probably impossible to do it 100% correctly in any reasonable length of time. But you might find that if you do a good job finding the errors, fixing them manually is no big deal.

To find the errors I would suggest that you make a list of each of the skills and a count of how many times each skill is referenced in the entire data set. When you're done you'll have a list like:

MANAGEMENT, 22
JAVA, 298
HADOOP, 12
HADUP, 1
SALES, 200
SALS, 1

etc. Each skill is listed along with the number of users who possess that skill.

Now, sort those by frequency and choose a threshold. Say you choose to examine more closely anything that has a frequency of 3 or less. The idea is that items that are used a very small number of times in relation to other items are probably misspellings.

Once you've identified the terms you want to examine more closely, you can determine if you'd like to automate the change or if you will do it manually. When I had to do this, I got my list of likely misspellings and manually created a file that had the misspelling and the correction. For example:

SALS,SALES
HADUP,HADOOP
PREFORMANCE,PERFORMANCE

There were a couple hundred, but manually creating the file was a whole lot faster than writing a program to figure out what the correct spelling should be.

Then I loaded that file and went through my user records, making the replacements as required.

The big time saver is finding the likely candidates for replacement. After that, fixing them is almost an afterthought.

That is, unless you really want to spend months on a research project. Then you can knock yourself out playing with edit distance algorithms, phonetic algorithms, and other stuff that might figure out that "edicit" and "etiquette" are supposed to be the same word.

Question 2

Something that works very nicely for this in the machine learning paradigm is String Matching kernels. Since these are actual kernel functions, if you want to formulate learning as an SVM they are very convenient.