Removing duplicate rows with ddply

https://stackoverflow.com/questions/23079248

r
plyr

03-07-2023
|

Domanda

I have a dataframe df containing two factor variables (Var and Year) as well as one (in reality several) column with values.

df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Year = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L), .Label = c("2000", "2001", 
"2002"), class = "factor"), Val = structure(c(1L, 2L, 2L, 4L, 
1L, 3L, 3L, 5L, 6L, 6L), .Label = c("2", "3", "4", "5", "8", 
"9"), class = "factor")), .Names = c("Var", "Year", "Val"), row.names = c(NA, 
-10L), class = "data.frame")

> df
   Var Year Val
1    A 2000   2
2    A 2001   3
3    A 2002   3
4    B 2000   5
5    B 2001   2
6    B 2002   4
7    B 2002   4
8    C 2000   8
9    C 2001   9
10   C 2002   9

Now I'd like to find rows with the same value for Val for each Var and Year and only keep one of those. So in this example I would like row 7 to be removed.

I've tried to find a solution with plyr using something like df_new <- ddply(df, .(Var, Year), summarise, !duplicate(Val)) but obviously that is not a function accepted by ddply.

I found this similar question but the plyr solution by Arun only gives me a dataframe with 0 rows and 0 columns and I do not understand the answer well enough to modify it according to my needs.

Any hints on how to go about that?

Soluzione 2

you can just used the unique() function instead of !duplicate(Val)

df_new <- ddply(df, .(Var, Year), summarise, Val=unique(Val))
# or
df_new <- ddply(df, .(Var, Year), function(x) x[!duplicated(x$Val),])
# or if you only have these 3 columns:
df_new <- ddply(df, .(Var, Year), unique)
# with dplyr
df%.%group_by(Var, Year)%.%filter(!duplicated(Val))

hth

Altri suggerimenti

Non-duplicates of Val by Var and Year are the same as non-duplicates of Val, Var, and Year. You can specify several columns for duplicated (or the whole data frame).

I think this does what you'd like.

df[!duplicated(df), ]

Or.

df[!duplicated(df[, c("Var", "Year", "Val")]), ]

You don't need the plyr package here. If your whole dataset consists of only these 3 columns and you need to remove the duplicates, then you can use,

df_new <- unique(df)

Else, if you need to just pick up the first observation for a group by variable list, then you can use the method suggested by Richard. That's usually how I have been doing it.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow