質問

Inserting multiple disjoint rows into a data frame or a data table efficiently. My code would be doing this repeatedly re-evaluating the results after each insertion.

I have two data frames x and tmp. tmp is x's complement and needs to be inserted into x. tmp has an additional column, the first column, which indicates the proper position of tmp's row in x. I found a solution on SO that dealt with inserting a single row in one position but I couldn't generalize it to my need.

x <- matrix(as.character(seq(100)),20,5)
tmp <- rbind(c(6,letters[1:5]),c(15,LETTERS[1:5]))

The link here is the solution provided on SO to deal with the insertion of one row into a data.frame stackoverflow.com/questions/11561856/add-new-row-to-dataframe

役に立ちましたか?

解決

The above solution is very elegant and succinct. If you're interested in a function similar to the one described in the original posting to avoid the often slow call to rbind, you could use this:

existingDF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
rs <- c(2,4)
newrows <- matrix(seq(-8, -1),nrow=2,ncol=4)
insertRow <- function(existingDF, newrows, rs) {
    rs <- sort(rs) + seq(0, length(rs) - 1)
    old_rs <- seq(nrow(existingDF) + length(rs))[-rs]
    existingDF[old_rs,] <- existingDF
    existingDF[rs,] <- newrows
    existingDF
}

insertRow(existingDF, newrows, rs)

This essentially also expands the old data frame by the number of new rows to be inserted but skips the indices of the new rows when reassigning the old data frame, and then inserts the new rows at the appropriate positions.

UPDATE: I forgot to take the shifting of the rows due to prior insertions into account, this is what the rs <- sort(rs) + seq(0, length(rs) - 1) takes care of (now rows are inserted at the correct positions relative to the original data frame, i.e. always before the specified rows of the original data frame), without it, the new rows will be inserted exactly at the positions in the data frame that are specified.

UPDATE2: and this is a modification to use the function directly with the original data set put forth by the OP

x <- matrix(as.character(seq(100)),20,5)
tmp <- rbind(c(6,letters[1:5]),c(15,LETTERS[1:5]))

insertRow <- function(existingDF, newrows) {
    new_idx <- as.integer(newrows[,1]) # get indices of the new rows
    new_idx <- sort(new_idx) + seq(0, length(new_idx) - 1) # adjust for rows shifting due to other insertions 
    old_idx <- seq(nrow(existingDF) + length(new_idx))[-new_idx] # ge indices for the old rows
    existingDF[old_idx,] <- existingDF # assign old rows
    existingDF[new_idx,] <- newrows[,-1] # assign new rows
    existingDF
}

insertRow(data.frame(x, stringsAsFactors = F), tmp)

他のヒント

You can expand x to include extra rows:

x2 <- x[rep(1:nrow(x), times=ifelse(1:nrow(x) %in% tmp[,1], 2,1)), ]

This duplicates rows where the original row number is in tmp[,1]. Now you can insert the tmp values

tmp <- tmp[order(tmp[,1]),]
x2[tmp[,1] -1 + 1:nrow(tmp)] <- tmp[,-1]

We re-order tmp so that rows are getting inserted in the correct order. If the first element needs to be inserted into row 6 of the original, that's where it goes in the new x2. But the second needs to be inserted in row 15 of the original, which has 'moved down' to take account of the previous insertion, so that's why I'm offsetting the row by j-1, where j is the current insertion count.

Or you can do:

x2 <- rbind(x, tmp[,-1])[order(c(1:nrow(x), tmp[,1]),]

Here is my solution that doesn't build off the other post. It works with rbind so it might be a little easier to understand.

df=matrix(1:40,10,4)
breaks=c(3,5,8)
breaks=append(breaks,nrow(df))
add1=1:4
add2=2:5
add3=3:6
newrows=rbind(add1,add2,add3)
newmat=df[1:breaks[1],]
for(i in 1:(length(breaks)-1)){
newmat=rbind(newmat,newrows[i,],df[(breaks[i]+1):(breaks[i+1]),])}

newmat

Of course you can always just do things manually and rbind all at once.

newmat=rbind(df[1:breaks[1],],add1,df[(breaks[1]+1):breaks[2],],add2,df[(breaks[2]+1):breaks[3],],add3,df[(breaks[3]+1):nrow(df),])
newmat

Alternative

For increased speed.

insertrows <- function(df,breaks,newrows){#As above we will be adding our new rows in as a matrix. Breaks are a vector and df is the dataframe you want all the rows to go into.
xx=1:length(breaks)
breaks=breaks+xx #To space out the insertion points.
newmat=matrix(NA,length(breaks)+nrow(df),ncol(df)) #Preallocate memory by creating final dataframe.
for(i in 1:length(breaks)){newmat[breaks[i],]=newrows[i,]} #Insert added rows into new dataframe>
x=1:nrow(newmat)
x=x[-(breaks)] #Finding the rows of the new dataframe that will receive old rows
for(i in 1:nrow(df)){newmat[x[i],]=df[i,]} #Notice how we use x to index the new dataframe for placement of old rows.
return(newmat)}

add1=1:4
add2=2:5
add3=3:6
newrows=rbind(add1,add2,add3)
df=matrix(1:40,10,4)
breaks=c(3,5,8)

insertrows(df,breaks,newrows)

How fast is this?

Pretty fast.

#Some new data. We're inserting 100 rows into a dataset of 1000 rows. There are 4 columns. 
df=matrix(1:4000,1000,4)
breaks=sample(1:1000,100)
newrows=matrix(1:400,100,4)

library("microbenchmark"
microbenchmark(insertrows(df,breaks,newrows))
Unit: milliseconds
                        expr      min       lq   median       uq      max neval
insertrows(df, breaks, newrows) 3.333208 3.372965 3.408644 3.494566 4.995151   100

Lets go for broke!

df=matrix(1:400000,100000,4)
breaks=sample(1:100000,10000)
newrows=matrix(1:40000,10000,4)
microbenchmark(insertrows(df,breaks,newrows))
Unit: milliseconds
                        expr     min       lq   median       uq      max neval
insertrows(df, breaks, newrows) 349.581 354.8166 358.2672 409.6821 470.7878   100

Remember these are milliseconds. So run time is actually only .36 seconds even with this huge data set. I don't doubt there are improvements to be made in this code here and there but I'd be surprised if you ever had a reason to need more speed than this.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top