The problem as you state seems to be removing duplicate rows from a data.frame
and this does not require any aggregation. Based on your example this is what you're after:
unique(test.df[c(1,3,4)])
# id x1 x2
#1 A 1 A
#4 B 2 B
EDIT:
I don't quite get as to what do you mean by:
"I tried with
FUN=unique
but does not seem to work."
Just for the sake of explaining as to what you might have gotten with aggregate
wrong, here, I show how one could get the same with aggregate
:
test.df$x2 <- as.character(test.df$x2)
aggregate(. ~ id, FUN=unique , data = test.df[c(1,3,4)] )
# id x1 x2
#1 A 1 A
#2 B 2 B
However, there is no need to use aggregate()
here. It's terribly inefficient for this problem. You can check it out with system.time(.)
which already gives a difference even on this data:
system.time(unique(test.df[c(1,3,4)]))
# user system elapsed
# 0.001 0.000 0.001
system.time(aggregate(. ~ id, FUN=unique , data = test.df[c(1,3,4)] ))
# user system elapsed
# 0.004 0.000 0.004
Go ahead and run this on your million rows and check your results with identical
and have a look at the run time.
From your comments I think you're confused with the behaviour of unique
. As @mnel explains, it (unique.data.frame)
removes all duplicate rows alone from the given data.frame
. It works for your case because you say that x1
and x2
will have the same values for each ID
. So, you dont have to know where in the data.frame
ID
is. You just have to pick 1 row for each ID.