Question

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.

I have the following dataframe

df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))

df1
  LOCAT START  END CODE1 CODE2
1     1   120  150     1     1
2     2   345  390     1     0
3     3   765  802     0     0
4     4  1045 1120     1     0
5     5  1347 1436     0    -1
6     6  1879 1935     0    -1

This is a sample dataframe. The rows continue until 180000 and the columns are over 400. What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.

Something like this for CODE1:

   LOCAT SIZE VALUE
1 1 to 2  270   POS
2 4 to 4   75   POS

And like this for CODE2:

   LOCAT SIZE VALUE
1 1 to 1   30   POS
2 5 to 6  588   NEG

Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.

Any help would be appreciated. Thanks in advance

Was it helpful?

Solution

Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).

It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".

# need package "plyr"
library("plyr")

# test2 is the example data frame that you gave in the question
test2 <- data.frame(
    "LOCAT"=1:6, 
    "START"=c(120,345,765, 1045, 1347, 1879), 
    "END"=c(150,390,803,1120,1436, 1935), 
    "CODE1"=c(1,1,0,1,0,0),
    "CODE2"=c(1,0,0,0,-1,-1)
    )

codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column

sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
    rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
    stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
    starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
    ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
    ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)

    # The next 3 lines calculate the equivalent of your SIZE column
    sizeX1 <- x[ssX[,"Starts"],"START"]
    sizeX2 <- x[ssX[,"Stops"],"END"]
    sizeX <- sizeX2 - sizeX1

    sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
    return(sssX) #Added in EDIT

}

answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1

Hopefully this helps, good luck!

OTHER TIPS

Use run-length encoding to determine groups where CODE1 takes the same value.

rle_of_CODE1 <- rle(df1$CODE1)

For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.

CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]

Ignore the parts of df1 where CODE1 is zero.

df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)

Define a group based on the contiguous blocks we found with rle.

df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)

Use ddply to get summary stats for each group.

summarised_by_CODE1 <- ddply(
  df1_with_nonzero_CODE1, 
  .(GROUP), 
  summarise, 
  MinOfLOCAT = min(LOCAT), 
  MaxOfLOCAT = max(LOCAT),
  SIZE       = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
  rle_of_CODE1$values[CODE1_is_nonzero] == 1, 
  "POS", 
  "NEG"
)
summarised_by_CODE1
##   GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1     1          1          2  270   POS
## 2     3          4          4   75   POS

Now repeat with CODE2.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top