Question

I have a data frame consisting of:

df[1] <- c(red, white, blue, Flag, red, yellow, black, Flag, Flag, white, red, Flag)

I want to have a list beside it consisting of numbers for Flags that don't list doubles unless their colors are reported in between. So it should be:

df[2] <- c(1,1,1,1,2,2,2,2,2,3,3,3) 

I have a code that does this in a for loop:

#list of unique Flags
numrows<-nrow(df[1])
df[2]<-rep(1,numrows)

counter<-1
for (i in 1:12){
  if (df[i,1]=="Flag" & df[i+1,1]!="Flag"){
    df[i,2]<-counter

    counter<-counter+1
  }else{
    df[i,2]<-counter
  }
}
df[numrows,2]<-counter

Problem is that my full dataset has 650.000 rows and it will take 8+ hours. Is there a way to get this specific result without a for loop in R?

Was it helpful?

Solution

Here is a slightly convoluted solution using cumsum() and data.table() - making use of the .SD object to only flag "Flags" which have a color following. I'm sure it could be made more concise with a bit of thought.

6.24sec for 650k rows

require(data.table)
# function to return leading 1 and trailing 0s for each instance of flag
# no 1 returned for single instance (duplicate)
get_s<-function(x){
  ifelse(x==1,
         y<-c(0),
         y<-c(1,rep(0,x-1))
  )
  return(y)
}

system.time({
  df<-data.frame(V1=sample(c("red", "white", "blue", "Flag", "yellow", "black"),650000,T)) #650k rows
  df$V2<-cumsum(ifelse(df$V1=="Flag",1,0))                                                 #index each "Flag"
  df$V2<-cumsum(data.table(df,key="V2")[,list(get_s(nrow(.SD))),by="V2"][,V1])             #return 1 for Flags with following color
})

#user  system elapsed 
#6.16    0.06    6.24 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top