Question

I have a very interesting problem statement...I have two datasets that are actually related to each other (both of them relate to car makes and models). While one of them is processed data (make, model and variant have been split, processed and edited) while the other column is a raw feed.

Is there any way to get an affiliation of the two? I am totally lost and hence, have not provided any code. The problem is there is no way to link the two columns as there is no SID....its purely names.

Was it helpful?

Solution

Joe is right, you need to provide sample data or at least a starting point for this to be a good question. But here's an attempt at an answer, regardless.

If all you have is name variables, assuming those are character variables, you're going to want to use string comparison functions. A general procedure for this is as follows:

  1. Clean both name variables by removing punctuation and standardizing the case. You should be using the compress() and upcase() or lowcase() functions as starting points.

  2. Next, you need to compare each name in one data set to every name in the other, and choose the most similar pair as a preliminary match. Look into the spedis() and complev() functions for ways to create a similarity score.

  3. Review the output data set results! Fuzzy matching like this can be tuned to perform pretty well, but it will NOT be perfect over the long run, and you must review at least a random sample of the results to check for errors. The first few times you will catch problems and hopefully begin to iterate towards a better solution by updating your scoring method in #2.

Here is a very basic shell of code that might be helpful:

DATA output_matches (keep = name_1 match_name match_score); 
    SET input_data_1;

    match_score = 0;

    do i = 1 to N_data_2; /* N_data_2 is the number of observations in data set 2. */

        SET input_data_2 point = i;

        score = ...; /* You need to edit this to calculate a similarity score between the variables name_1 and name_2. */

        if score < match_score then do;
            match_score = score;
            match_name = name_2;
        end;            

    end;    
RUN;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top