Question

So, I have a data.frame object called "DATA". This object contains one column called "Point"(DATA$Point). Since there are some duplicates on this particular column, I would like to build a function that sample only one row among these duplicates in DATA.

I've been trying to do it this way:

sort.song<-function(DATA){

 Point<-levels(DATA$Point)
 DATA.NEW<-DATA[1:length(Point),] 

#Ideally DATA.NEW should have an empty dataframe with nrow=length(Point) and the same columns
#as in DATA. But I THINK it will work (I don't know how to do the "ideally" way)

 for(i in 1:dim(DATA)[1]){ #dim(DATA)[1] always bigger than length(Point)
  SUBDATA<-DATA[which(DATA$Point%in%Point[i]),]

#I need to sample one row of the original data set only of the duplicates of the same value.
#So if there isn't a duplicate of one particular value, move on. Otherwise sample one between
#those duplicates.

  l<-dim(SUBDATA)[1]
  if (l==1){DATA.NEW[i,]<-SUBDATA[l,]}else{lc<-sample(1:l,1)}
  DATA.NEW[i,]<-SUBDATA[lc,]
  }
 return(DATA.NEW)
}

test<-sort.song(DATA)

But it doesn't work! :( I get the following error message:

Error in `[<-.factor`(`*tmp*`, iseq, value = integer(0)) : 
replacement has length zero

It may be a silly question, but I'm kind of without options here (total R beginner)

Any help will be highly appreciated!!!!

Was it helpful?

Solution

If you want to chose a random duplicate to keep, rather than duplicateds default behaviour of only keeping the first, then why not randomly shuffle the whole dataset, so that choosing the first in the shuffled set is effectively a random row from the original:

DATAr <- DATA[sample(1:nrow(DATA),]
DATAr <- DATAr[!duplicated(DATAr$Point),]

If the order of your original DATA was inportant, store the sample(...) in a variable, use that to re-order your data, and apply an inverse once you've removed duplicates (or add a column DATA$ind <- 1:nrow(DATA) and sort your data to restore this afterwards.

OTHER TIPS

R has built in functions, sample and duplicated. Thus you can simply use

DATA[ sample( !duplicated(DATA$Point), N ), ]
# where `N` is the sample size you'd like. 

in data.table syntax, the above would be

DATA[ sample( !duplicated(Point), N )]

So you want every row that is not duplicated AND the first instance of those that are duplicated right ?

Then try this:

# build fake dataset
DATA <- as.data.frame(cbind(sample(c(1:10,3:7)),sample(1:15),sample(1:15)))
names(DATA) <- c("Point","some_col","some_other_col")

# check
print(DATA) # See Point has duplicate values


# your function
filter_data <- function(DATA){
distinct_points <- unique(DATA$Point)
as.data.frame(t(sapply(distinct_points, function(x){subset(DATA,Point == x)[1,]})))
}


#result
DATA.new <- filter_data(DATA)
print(DATA.new)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top