Question

I am using data.table in R and looping over my table, it s really slow because of my table size. I wonder if someone have any idea on

I have a set of value that I want to "cluster". Each line have a position, a positive integer. You can load a simple view of that :

    library(data.table)
    #Here is a toy example    
    fulltable=c(seq (1,4))*c(seq(1,1000,10))
    fulltable=data.table(pos=fulltable[order(fulltable)])
    fulltable$id=1

So I loop in my lines and When there is more than 50 between two position I change the group :

#here is the main loop
lastposition=fulltable[1]$pos
lastid=fulltable[1]$id
for(i in 2:nrow(fulltable)){
    if(fulltable[i]$pos-50>lastposition){
        lastid=lastid+1
        print(lastid)
    }
    fulltable[i]$id=lastid;
    lastposition=fulltable[i]$pos
}

Any idea for an effi

Was it helpful?

Solution

fulltable[which((c(fulltable$pos[-1], NA) - fulltable$pos) > 50) + 1, new_group := 2:(.N+1)]
fulltable[is.na(new_group), new_group := 1]
fulltable[, c("lastid_new", "new_group") := list(cummax(new_group), NULL)]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top