Try user-written strgroup
from SSC for matching using Levenshtein distance. It comes with a another command called levenshtein
that you can use to do this. Some toy code to give you an idea:
ssc install strgroup
input str8 names
Bob
James
Jim
Jameson
end
gen james = "James"
levenshtein names james, gen(LD)
You can then sort by LD to get an idea what might work well in your case.
The other way would be to do this, which creates groups for you:
strgroup names , gen(group) threshold(0.5)
and play around with the threshold.
I don't think a standard exists and these procedures will still entail lots of manual work.