Question

I was given these crude database exports to fix up. In SQL have two tables that each have three columns: ID, first name and last name. In sheet1, the ID column is empty at the start whereas there are ID's in sheet2. I needed to join sheet1 and sheet2 where both first and last name matched and if there was an ID present in sheet2 from the match, to set it in the ID column in sheet1. Very simple SQL query:

UPDATE sheet1
JOIN sheet2
ON sheet1.NAME_LAST = sheet2.Last_Name
AND sheet1.NAME_FIRST = sheet2.First_Name
SET sheet1.CONSTITUENT_ID = sheet2.ID

As this is a task that needs to be done often, what I'd like to do is set up my less savvy co-workers with a simple R script. I'm still a total beginner at R however. At present, I've got sheet1 and sheet2 in two data.frames (oh-so-descriptively named book1 and book2) via the native R method: book2 = read.csv("sheet2.csv") Through some googling it was suggested I take a look at the data.table library, but I'm still at a bit of a loss for where to start. Can someone help me translate SQL in to R?

Was it helpful?

Solution

While SQL is a special-purpose language that for the most part id declarative and set-based, R is a functional programming language. It is often difficult to do direct language translations. Using base R, you would probably use a merge() command to bring different tables together.

If we have these sample tables

sheet1<-data.frame(
    NAME_LAST=letters[1:10],
    NAME_FIRST=letters[11:20],
    SEX=sample(c("M","F"), 10, replace=T)
)

sheet2<-data.frame(
    Last_Name=letters[1:10],
    First_Name=letters[11:20],
    ID = sample(100:200, 10)
)[sample(1:10), ]

sheet1;sheet2;

Then we can merge them with

mm<-merge(sheet1, sheet2, 
    by.x=c("NAME_LAST","NAME_FIRST"),
    by.y=c("Last_Name","First_Name")
)
mm;

You can them probably subset this merged data set to get what you need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top