tag duplicates by time difference, generate id

https://stackoverflow.com/questions/8840083

15-04-2021
|

Question

ok, I am just starting out with R and somewhat stuck at the moment. I have a dataset with election results, and the only identifier for a person is a string variable with his/her name. Many politicians appear more than once as they participate in more than one election.

I want to generate an id to identify each politician. However, some names are more common and do actually identify different persons. I want to single out these cases by looking at the time difference of occurence, i.e. if there are more than 30 years between appearances, the same name belongs to a different person.

I have computed the difference between each occurence, and each time there is a difference larger than 30 years between occurences, I want to make a record that all subsequent occurences belong to a different person. I have dabbled with loops, but didn't get them to work the way I wanted, and I guess there's a more idiomatic way to solve this.

Then I want to create a unique id for each person using the name variable and the record, but i guess this can simply be done using the id() function.

df <- df[order(df$name, df$year),]

# difference between each occurence, NA for first occurence 
df$timediff <- ave(df$year, df$name, FUN=function(x) c(NA,diff(x)))

# absolute difference to first occurence, haven't used this so far
df$timediff.abs <- ave(df$year, df$name, FUN=function(x) x - x[1])

Solution

You can reorder the data and then compare subsequent rows. If there is a new name - it is a new person. If there is a gap greater than 30 years, then it is a new person. If the name is the same, and the gap in years is < 30, same person. As the data is reordered, if the gap in dates is less than 0, then the name has changed, so it's obviously a new person.

Concisely, if there is either a change in name or the same name but a gap greater than 30 years, you do not assume the same identity as for the previous row. (Conversely, if you don't assume the same identity, then you increment your unique identifier.)

Here is an example that assigns a unique identifier, using the above rules.

set.seed(0)
d = sample((1900:2000), 100, replace = TRUE)
v = sample(letters, 100, replace = TRUE)
t1 = data.frame(v,d)
t2 = t1[order(t1$v,t1$d),]
t2$sameName = c(FALSE, t2$v[2:100] == t2$v[1:99])
t2$diffYrs = c(0,diff(t2$d))
t2$close = (t2$diffYrs >= 0) & (t2$diffYrs < 30)
t2$keepPerson = (t2$sameName & t2$close)
t2$identifier = cumsum(!t2$keepPerson)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow