Question

Is there an industry standard for how large the Jaro-Winkler score should be to say that the two strings are likely similar?

I have a list of strings and I want to see if any of them are plausible typographical errors for the name James. I have used the perl module that was written in C, and in turn, whose strings I received from a dataset in stata. (So if there were a Stata module, I'd be all ears!)

Here is the code that I wrote so far in perl to make the comparisons to the string James.

   #!/usr/bin/perl

   use 5.10.0;
   use Text::JaroWinkler qw( strcmp95 );
   use List::Util qw(min max);

   open( my $l,  '<', 'Strings.txt' )          or die "Can't open locations: $!";
   open( my $o,  '>', 'JW.txt' )          or die "Can't open locations: $!";

   while ( my $line = <$l> ) {
    chomp($line);
    my $length = min(length($line),length('James'));
    my $jarow = strcmp95($line, 'JAMES', $length);
    print "$line,'JAMES',$jarow,\n" ;
    print( $o ("$line,'JAMES',$jarow"),"\n" );

  }
close $o;

I'm also not sure whether I'm interpreting the 3rd parameter of the Jaro-Winkler function appropriately or effectively. Perhaps I should be doing length('JAMES') ?

Was it helpful?

Solution

Try user-written strgroup from SSC for matching using Levenshtein distance. It comes with a another command called levenshtein that you can use to do this. Some toy code to give you an idea:

ssc install strgroup

input str8 names
Bob
James
Jim
Jameson
end

gen james = "James"

levenshtein names james, gen(LD)

You can then sort by LD to get an idea what might work well in your case.

The other way would be to do this, which creates groups for you:

strgroup names , gen(group) threshold(0.5)

and play around with the threshold.

I don't think a standard exists and these procedures will still entail lots of manual work.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top