Quicken this for loop?

https://stackoverflow.com/questions/23599806

20-07-2023
|

Domanda

I have a dataset called cpue with 3.3 million rows. I have made a subset of this dataframe called dat.frame. (See below for the heads of cpue and dat.frame.) I have added two new fields to dat.frame: "ssh_vec" and "ssh_mag". Although the heads of cpue and dat.frame look the same, the rest of the rows are not actually in the same order.

head(cpue)
  code  event    Lat   Long stat_area Day Month Year id
1  BCO 447602 -43.45 182.73        49  17     3 1995  1



head(dat.frame)
  code  event    Lat   Long stat_area Day Month Year id cal.jdate  ssh_vec  ssh_mag
1  BCO 447602 -43.45 182.73        49  17     3 1995  1   2449857 56.83898 4.499350

Currently, I am running a loop to add the ssh_vec and ssh_mag variables to "cpue" using the unique identifier "id":

cpue$ssh<- NA
cpue$sshmag<- NA

for(i in 1:nrow(dat.frame))
{
    ndx<- dat.frame$id[i]
    cpue_full$ssh[ndx]<- dat.frame$ssh_vec[i]
    cpue_full$sshmag[ndx]<- dat.frame$ssh_mag[i]
}

This has been running over the weekend and is only up to:

i
[1] 132778

... out of:

nrow(dat.frame)
[1] 2797789

Within the loop, there is nothing that looks too computationally demanding. Is there a better alternative?

Soluzione

Are you sure you need a for loop at all? I think this might be equivalent:

cpue_full$ssh[dat.frame$id]<- dat.frame$ssh_vec
cpue_full$sshmag[dat.frame$id]<- dat.frame$ssh_mag

Altri suggerimenti

I would recommend taking a look at data.table. Since I don't have your data, here is a simple example using dummy data.

library(data.table)
N = 10^6
dat <- data.table(
  x = rnorm(1000),
  g = sample(LETTERS, N, replace = TRUE)
)

dat2 <- dat[,list(mx = mean(x)),g]

h = merge(dat, dat2, 'g')

Do you even need to loop? From the code fragment posted it would appear not.

cpue_full$ssh[dat.frame$id] <- dat.frame$ssh_vec
cpue_full$sshmag[dat.frame$id]<- dat.frame$ssh_mag

should work. A quick (and small) dummy example:

set.seed(666)
ssh <- rnorm(10^4) 
datf <- data.frame(id = sample.int(10000L), ssh = NA)

system.time(datf$ssh[datf$id] <- ssh) # user 0, system 0, elapsed 0

# Reset dummy data
datf$ssh <- NA 

system.time({
  for (i in 1:nrow(datf) ) {
    ndx <- datf$id[i]
    datf$ssh[ndx] <- ssh[i]
  }
} ) # user 2.26, system 0.02, elapsed 2.28

PS - I've not used the data.table package, so I don't follow Ramnath's answer. In general you should avoid loops if possible (see fortune(142) and Circle 3 of The R Inferno).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow