Question

I have a data.frame that looks like this

dput(repex) = structure(list(cat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("x", 
"y", "z"), class = "factor"), year = c(1980, 1980, 1982, 1982, 
1990, 1991, 1991, 1991, 1993, 1981, 1981, 1983, 1990, 1996, 1996, 
1996, 1996, 1999, 2002, 1994), org = structure(c(2L, 3L, 4L, 
2L, 5L, 6L, 7L, 8L, 9L, 2L, 3L, 5L, 3L, 10L, 11L, 4L, 9L, 10L, 
3L, 9L), .Label = c("709340", "a", "b", "c", "d", "f", "j", "k", 
"e", "h", "m"), class = "factor")), .Names = c("cat", "year", 
"org"), row.names = c(NA, 20L), class = "data.frame")

I want to create a new object (ideally a data.table or data.frame) in which the elements of org are grouped horizontally behind a specific cat, year combination

I tried to run the following:

repex <- data.table(repex)
setkey(repex,cat,year)
repex[, list(org), by="cat,year"]  #OR
repex[, paste(org,sep="_"), by="cat,year"] # OR
with(repex, tapply(org,paste(cat,year,sep="_"),paste))

The first two data.table options merely copy the entire data.table and the tapply option (applied to repex as either data.table or data.frame) works for a small dataset but creates a list object which is not really convenient as I would need to add the output to another data.frame that is based on the cat_year combination... Additionally for a long dataset (nrow > 100,000) it takes forever, especially as in some cases it needs to paste > 100 org-variants.

My desired output would be a data.table that looks something like this

x 1980 a b
x 1982 a c # org would ideally be rearranged
x 1990 d
x 1991 f j k 
...
y 1996 c e h m
...
z 2002 b
Was it helpful?

Solution

One of your actual problems is using the incorrect arguments to paste. You are looking for collapse, not sep. Another problem is using "data.table" syntax incorrectly.


Update

Considering the comments to this answer, I would suggest something like this instead:

library(data.table)
library(reshape2)
DT <- as.data.table(repex)

setkey(DT, cat, year, org) ## Sorts everything

## Creates a column "var" with the sequence of values ("V1", "V2", and so on)
DT[, var := paste("V", sequence(.N), sep = ""), by = list(cat, year)]
head(DT)
#    cat year org var
# 1:   x 1980   a  V1
# 2:   x 1980   b  V2
# 3:   x 1982   a  V1
# 4:   x 1982   c  V2
# 5:   x 1990   d  V1
# 6:   x 1991   f  V1

Converts that to a "wide" format:

dcast.data.table(DT, cat + year ~ var, value.var="org")
#     cat year V1 V2 V3 V4
#  1:   x 1980  a  b NA NA
#  2:   x 1982  a  c NA NA
#  3:   x 1990  d NA NA NA
#  4:   x 1991  f  j  k NA
#  5:   x 1993  e NA NA NA
#  6:   y 1981  a  b NA NA
#  7:   y 1983  d NA NA NA
#  8:   y 1990  b NA NA NA
#  9:   y 1996  c  e  h  m
# 10:   z 1994  e NA NA NA
# 11:   z 1999  h NA NA NA
# 12:   z 2002  b NA NA NA

Original answer

This is a pretty straightforward aggregate problem:

aggregate(org ~ cat + year, repex, function(x) paste(sort(x), collapse = " "))
#    cat year     org
# 1    x 1980     a b
# 2    y 1981     a b
# 3    x 1982     a c
# 4    y 1983       d
# 5    x 1990       d
# 6    y 1990       b
# 7    x 1991   f j k
# 8    x 1993       e
# 9    z 1994       e
# 10   y 1996 c e h m
# 11   z 1999       h
# 12   z 2002       b

A "data.table" approach:

library(data.table)
DT <- as.data.table(repex)
DT[, list(org = paste(sort(org), collapse = " ")), by = list(cat, year)]

And, to round things out, a "dplyr" approach:

library(dplyr)
repex %.% group_by(cat, year) %.% summarise(org = paste(sort(org), collapse = " "))

OTHER TIPS

@Anandaaaaaaaaaaaaaaaa,

Here's my inelegant way of solving the problem myself. I am sure there is an easier way that takes your advice but just thought I'd share as well.

Step 1: Paste all the org into a list

tmp1 <- with(repex, tapply(org,paste(cat,year,sep="_"), paste))

Step 2: Find the longest length of the list (very inelegantly)

x<-as.vector(NA)
for (i in 1:length(fy_ids)) {
  x[i] <- length(fy_ids[[i]])
  }
max(x)

Step 3: Using the maximum for x, construct a data.frame in which each organization occurs in a new cell (with special thanks to @agstudy for a previous answer

tmp <- do.call(rbind,lapply(tmp1,
               function(y)
                 if(length(y)>0)c(y,rep(NA, max(x)-length(y)))
                             else c(y,rep(NA,max(x)))))

Step 4: Turn tmp into a data.frame

tmp <- data.frame(tmp)

I know it's pretty cumbersome but it has the advantage of making search for specific org a lot easier as each org appears in a different cell.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top