Finding median of values in one column given corresponding consecutive values in another column are equal numbers

StackOverflow https://stackoverflow.com/questions/19303075

  •  30-06-2022
  •  | 
  •  

Question

I have a data frame that looks like this in R:

Date | Time | value  
A  | 1  | 3       
A  | 1  | 6     
A  | 2  | 4  
A  | 3  | 3  
A  | 4  | 2  
A  | 5  | 7  
B  | 1  | 6       
B  | 2  | 5  
B  | 2  | 3  
B  | 2  | 4  
B  | 3  | 2  
B  | 5  | 3  
B  | 6  | 4  
... 

My aim is to find the median of the numbers in the third column if the numbers in the second columns are equal in consecutive rows. I.e take the median of the values if they occur at the same time, and replacing that median value for that corresponding time slot.

So output I am aiming for:

A  | 1  | median (3,6)       
A  | 2  | 4  
A  | 3  | 3  
A  | 4  | 2  
A  | 5  | 7  
B  | 1  | 6       
B  | 2  | median (3,4,5)   
B  | 3  | 2  
B  | 5  | 3  
B  | 6  | 4  
... 

I am desperately trying to avoid loops since the data set is huge. The main problem I am having is collecting the values separately. This is what I had thus far:

#First find consecutive time slots that are equal:
timeslots_equal<-which(diff(data_RAW$TIME)==0)

coordinates_placesholder <- sort(c(as.vector(timestamp_equal_coordinates), as.vector(timestamp_equal_coordinates)+1))

coordinates_placesholder2  <-  coordinates_placesholder[-c(which(diff(coordinates_placesholder)==0), which(diff(coordinates_placesholder)==0) +1)]

 #The following matrix are the coordinates in the value vector with equal time slots
 matrix_ranges<-t(matrix(coordinates_placesholder2,2))

matrix_ranges for the example above would look like:

1 | 2  
8 | 10  

Then I tried to apply something like

median(data_RAW$Value[matrix_ranges[,1]:matrix_ranges[,2]])

This did not work. Does anyone have any answers on doing this?

Also is there an easier way to doing this than what I did above?

Was it helpful?

Solution

Two interpretations come to mind.

Interpretation 1: It's the combination of "Date" + "Time" that matters, not the consecutive repetition. In this case, just use aggregate (or your favorite aggregating function or package, like "data.table").

aggregate(value ~ Date + Time, mydf, median)
#    Date Time value
# 1     A    1   4.5
# 2     B    1   6.0
# 3     A    2   4.0
# 4     B    2   4.0
# 5     A    3   3.0
# 6     B    3   2.0
# 7     A    4   2.0
# 8     A    5   7.0
# 9     B    5   3.0
# 10    B    6   4.0

Interpretation 2: The consecutive repetitions are important. In this case, you need another "grouping" variable. For this, we can use rle. After that, the aggregation step is pretty much the same.

RLE <- rle(DF$Time)$lengths
RLE <- rep(seq_along(RLE), RLE)
aggregate(value ~ Date + Time + RLE, DF, median)
#    Date Time RLE value
# 1     A    1   1   4.5
# 2     A    2   2   4.0
# 3     A    3   3   3.0
# 4     A    4   4   2.0
# 5     A    5   5   7.0
# 6     B    1   6   6.0
# 7     B    2   7   4.0
# 8     B    3   8   2.0
# 9     B    5   9   3.0
# 10    B    6  10   4.0
# 11    A    1  11   3.0
# 12    B    3  12   2.0

For the benefit of others, here's some reproducible data: mydf and DF. (DF is just mydf with a few rows repeated.)

mydf <- structure(list(Date = c("A", "A", "A", "A", "A", "A", "B", "B", 
        "B", "B", "B", "B", "B"), Time = c(1L, 1L, 2L, 3L, 4L, 5L, 1L, 
        2L, 2L, 2L, 3L, 5L, 6L), value = c(3L, 6L, 4L, 3L, 2L, 7L, 6L, 
        5L, 3L, 4L, 2L, 3L, 4L)), .Names = c("Date", "Time", "value"), 
        class = "data.frame", row.names = c(NA, -13L))
DF <- rbind(mydf, mydf[c(1, 1, 11, 11), ])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top