Question

So I'm trying to create a list of lists of data frames, basically for the purposes of passing them to multiple cores via mclapply. But that's not the part I'm having trouble with. I wrote a function to create a list of smaller data frames from one large data frame, and then applied it sequentially to break a large data frame down into a list of lists of small data frames. The problem is that when the function is called the second time (via lapply to the first list of data frames), it's adding extra small data frames to each list of data frames in the larger list. I have no idea why. I don't think it's the lapply, since when I ran the function manually on one frame from the first list it also did work. Here's the code:

create_frame_list<-function(mydata,mystep,elnames){

    datalim<-dim(mydata)[1]
    mylist<-list()
    init<-1
    top<-mystep
    i<-1

    repeat{

        if(top < datalim){
            mylist[[i]]<-assign(paste(elnames,as.character(i),sep=""),data.frame(mydata[init:top,]))
            }
        else {
            mylist[[i]]<-assign(paste(elnames,as.character(i),sep=""),data.frame(mydata[init:datalim,]))
            }

        if(top > datalim){break}    

        i<-i+1
        init<-top+1
        top<-top+mystep

        }

        return(mylist)
    }

test_data<-data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))

#Create the first list of data frames, works fine
master_list<-create_frame_list(test_data,300,"bd")

#check the dimensions of the data frames created, they are correct
lapply(master_list,dim)

#create a list of lists of data frames, doesn't work right
list_list<-lapply(master_list,create_frame_list,50,"children")

#check the dimensions of the data frames in the various lists. The function when called again is making extra data frames of length 2 for no reason I can see
lapply(list_list,lapply,dim)

So that's it. Any help is appreciated as always.

Was it helpful?

Solution

Okay, so your code only has one small bug, but there are definitely better ways of doing this. Your code doesn't work when the number of rows is an exact multiple of step. This has to do with the position of your break. Here is a fix:

create_frame_list<-function(mydata,mystep,elnames){
  datalim<-dim(mydata)[1]
  mylist<-list()
  init<-1
  top<-mystep
  i<-1
  repeat{
    if(top < datalim)
      # mylist[[i]]<-assign(paste0(elnames,as.character(i)),data.frame(mydata[init:top,]))
      mylist[[i]]<-mydata[init:top,]
    else 
      mylist[[i]]<-mydata[init:datalim,]
    # if(top > datalim) break 
    i<-i+1
    init<-top+1
    top<-top+mystep
    if(init > datalim) break
  }
  return(mylist)
}

The main fix was to move the if and make it reliant on init, and not top.

You'll note that I cleaned up your code, and removed the assign statments. One good rule of thumb is: if you think you need to use assign or get, you're doing it wrong. In your case, the assign was completely redundant, and did not assign the names in the way you wanted.


If you're looking for a better way to do this, here is one option:

n<-nrow(test_data)
step<-300
split.var<-rep(1:ceiling(n/step),each=step,length.out=n)
master_list<-split(test_data,split.var)
names(master_list)<-paste0('bd',seq_along(master_list))
# If you didn't care about the order of the rows you could just do 
# split(test_data,seq(ceiling(n/step)))

If you want to get fancy, you could do something like:

special.split<-function(data,step) 
  split(data,rep(1:ceiling(nrow(data)/step),each=step,length.out=nrow(data)))
lapply(special.split(test_data,300),special.split,step=50)

And that would do everything in one step.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top