Question

I have the example data frame

test.df<-data.frame(id=c("A","A","A","B","B","B"), time=c(1:3,1:3), x1=c(1,1,1,2,2,2), x2=c("A","A","A","B","B","B"))

x1 and x2 variables are the same within each id

I would like to aggregate the above data frame to obtain the following

target.df<-data.frame(id=c("A","B"), x1=c(1,2), x2=c("A","B"))

In a sense I wish to aggregate without any FUN. I tried with FUN=unique but does not seem to work. My original dataframe has 1 million rows and thousands x1,x2.... variables of different type (character, dates etc) but are the same within each ID. This is the same as a pivot table in excel

Many thanks

Was it helpful?

Solution

The problem as you state seems to be removing duplicate rows from a data.frame and this does not require any aggregation. Based on your example this is what you're after:

unique(test.df[c(1,3,4)])
# id x1 x2
#1  A  1  A
#4  B  2  B

EDIT:

I don't quite get as to what do you mean by:

"I tried with FUN=unique but does not seem to work."

Just for the sake of explaining as to what you might have gotten with aggregate wrong, here, I show how one could get the same with aggregate:

test.df$x2 <- as.character(test.df$x2)
aggregate(. ~ id, FUN=unique , data = test.df[c(1,3,4)] )

#  id x1 x2
#1  A  1  A
#2  B  2  B

However, there is no need to use aggregate() here. It's terribly inefficient for this problem. You can check it out with system.time(.) which already gives a difference even on this data:

system.time(unique(test.df[c(1,3,4)]))
#    user  system elapsed 
#   0.001   0.000   0.001 
system.time(aggregate(. ~ id, FUN=unique , data = test.df[c(1,3,4)] ))
#    user  system elapsed 
#   0.004   0.000   0.004 

Go ahead and run this on your million rows and check your results with identical and have a look at the run time.

From your comments I think you're confused with the behaviour of unique. As @mnel explains, it (unique.data.frame) removes all duplicate rows alone from the given data.frame. It works for your case because you say that x1 and x2 will have the same values for each ID. So, you dont have to know where in the data.frame ID is. You just have to pick 1 row for each ID.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top