Question

Say, I have two tables, name and age like this:

> name
    key   name
1 a,b,c   jack
2     d daniel
3     e    foo
4   f,g    bar
> age
  key age
1   b  13
2   d  21
3   e  24
4   k  34
5   f 100

I am trying to join these two tables using the key column, which is present in both tables. The challenge here is that key column in the name table is not normalized. My question is, what is the best way to combine the above two tables in a way that all of rows in the name table is present and intact as original in the joined table(like "res" table)?

> res
    key   name age
1 a,b,c   jack  13
2     d daniel  21
3     e    foo  24
4   f,g    bar 100

Here is the necessary table information

> dput(name)

structure(list(key = structure(1:4, .Label = c("a,b,c", "d", 
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L, 
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor")), .Names = c("key", 
"name"), class = "data.frame", row.names = c(NA, -4L))

> dput(age)

structure(list(key = structure(c(1L, 2L, 3L, 5L, 4L), .Label = c("b", 
"d", "e", "f", "k"), class = "factor"), age = c(13L, 21L, 24L, 
34L, 100L)), .Names = c("key", "age"), class = "data.frame", row.names = c(NA, 
-5L))

> dput(res)

structure(list(key = structure(1:4, .Label = c("a,b,c", "d", 
"e", "f,g"), class = "factor"), name = structure(c(4L, 2L, 3L, 
1L), .Label = c("bar", "daniel", "foo", "jack"), class = "factor"), 
    age = c(13L, 21L, 24L, 100L)), .Names = c("key", "name", 
"age"), class = "data.frame", row.names = c(NA, -4L))
Was it helpful?

Solution 2

I you don't mind using 2 joins:

library(plyr)
# factors to character vectors:
name <- as.data.frame(sapply(name, as.character), stringsAsFactors=F)

# split comma-seperated ids into named list:
(tmp <- setNames(strsplit(name$key, ","), name$name))
# $jack
# [1] "a" "b" "c"
# 
# $daniel
# [1] "d"
# 
# $foo
# [1] "e"
# 
# $bar
# [1] "f" "g"

# list to long 2-column data frame:
(tmp <- setNames(ldply(tmp, matrix), c("name", "key")) )
#     name key
# 1   jack   a
# 2   jack   b
# 3   jack   c
# 4 daniel   d
# 5    foo   e
# 6    bar   f
# 7    bar   g

# join data frame with age table (1st join) &
# add original comma-seperated key column (2nd join)
join(join(age, b, type="inner"),
     name, by="name")[-1] 
#   age   name   key
# 1  13   jack a,b,c
# 2  21 daniel     d
# 3  24    foo     e
# 4 100    bar   f,g

OTHER TIPS

Perhaps you can coerce the "key" column from the "name" data.frame to a regex pattern and use sapply as follows:

sapply(gsub(",", "|", name$key), function(x) grep(x, age$key))
# a|b|c     d     e   f|g 
#     1     2     3     5 

The above basically returns the row number from the "age" data.frame where a match was found, in the order in which it was found.

You could then use this information to extract the "age" value from the "age" data.frame using basic [row, col] extracting as follows, assigning the result to age$age:

age[sapply(gsub(",", "|", name$key), function(x) grep(x, age$key)), "age"]
# [1]  13  21  24 100

For each row, I would split each complex key with the stri_split_fixed function from the stringi package and then try to match one of the keys from the second dataset.

library(stringi)
res <- name
keys <- stri_split_fixed(name$key, ",") # returns a list of individual keys in each row
res$age <- sapply(1:nrow(name), function(r) {
   keys <- keys[[r]] # get the keys in rth row
   age$age[which(age$key %in% keys)]
})

This gives the result you asked for.

If the keys contain (or may contain) spaces, then a regex search would be more appropriate:

stri_split_regex(name$key, ",\\p{Z}*")

or even the extraction of sequences of word characters

stri_extract_all_regex(name$key, "\\w+")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top