Sum rows based on other column values

https://stackoverflow.com/questions/22710778

23-06-2023
|

Question

I am a new R user and am looking for someone to point me in the right direction regarding what function I should use to achieve the following.

I have the following data frame. Output using the dput command.

structure(list(ID = 4701:4702, Date.1 = structure(c(5L, 5L), .Label = c("01/02/2013", 
"01/03/2013", "01/05/2013", "02/05/2013", "04/02/2013", "04/03/2013", 
"05/02/2013", "05/03/2013", "06/02/2013", "06/03/2013", "07/02/2013", 
"07/03/2013", "08/02/2013", "08/07/2013", "12/12/2012", "13/12/2012", 
"14/01/2013", "14/12/2012", "15/01/2013", "16/01/2013", "17/01/2013", 
"17/12/2012", "18/01/2013", "18/04/2013", "18/12/2012", "19/04/2013", 
"23/01/2013", "24/01/2013", "25/01/2013", "26/04/2013", "28/01/2013", 
"29/01/2013", "29/04/2013", "30/04/2013", "31/01/2013"), class = "factor"), 
 Day.of.Week.1 = structure(c(2L, 2L), .Label = c("Friday", 
"Monday", "Thursday", "Tuesday", "Wednesday"), class = "factor"), 
Sedentary.1 = c(511.5, 405.5), Light.1 = c(133.666666666667, 
119.166666666667), Moderate.1 = c(12.1666666666667, 13.1666666666667
), Vigorous.1 = c(4.33333333333333, 3.5), Axis.1.Counts.1 = c(157124L, 
126177L), Axis.1.CPM.1 = c(237.5, 233.1), Time.1 = c(661.67, 
541.33), Day.of.Week.2 = structure(c(1L, 4L), .Label = c("Friday", 
"Monday", "Thursday", "Tuesday", "Wednesday"), class = "factor"), 
Sedentary.2 = c(370.166666666667, 601.833333333333), Light.2 = c(113, 
162.5), Moderate.2 = c(12, 13), Vigorous.2 = c(4, 10), Axis.1.Counts.2 = c(141593L, 
201373L), Axis.1.CPM.2 = c(283.7, 255.8), Number.of.Epochs.2 = c(2995L, 
4724L), Time.2 = c(499.17, 787.33), Day.of.Week.3 = structure(c(NA, 
5L), .Label = c("Friday", "Monday", "Thursday", "Tuesday", 
"Wednesday"), class = "factor"), Sedentary.3 = c(NA, 463), 
Light.3 = c(NA, 121.666666666667), Moderate.3 = c(NA, 14.5
), Vigorous.3 = c(NA, 11.5), Axis.1.Counts.3 = c(NA, 196192L
), Axis.1.CPM.3 = c(NA, 321.3), Number.of.Epochs.3 = c(NA, 
3664L), Time.3 = c(NA, 610.67), Day.of.Week.4 = structure(c(NA, 
3L), .Label = c("Friday", "Monday", "Thursday", "Tuesday", 
"Wednesday"), class = "factor"), Sedentary.4 = c(NA, 472.333333333333
), Light.4 = c(NA, 149.166666666667), Moderate.4 = c(NA, 
11.3333333333333), Vigorous.4 = c(NA, 14.1666666666667), 
Axis.1.Counts.4 = c(NA, 218895L), Axis.1.CPM.4 = c(NA, 338.3
), Number.of.Epochs.4 = c(NA, 3882L), Time.4 = c(NA, 647), 
Day.of.Week.5 = structure(c(NA, 1L), .Label = c("Friday", 
"Monday", "Thursday", "Tuesday", "Wednesday"), class = "factor"), 
Sedentary.5 = c(NA, 383.166666666667), Light.5 = c(NA, 106.5
), Moderate.5 = c(NA, 8), Vigorous.5 = c(NA, 0.5), Axis.1.Counts.5 = c(NA, 
92163L), Axis.1.CPM.5 = c(NA, 185), Number.of.Epochs.5 = c(NA, 
2989L), Time.5 = c(NA, 498.17)), .Names = c("ID", "Date.1", 
"Day.of.Week.1", "Sedentary.1", "Light.1", "Moderate.1", "Vigorous.1", 
"Axis.1.Counts.1", "Axis.1.CPM.1", "Time.1", "Day.of.Week.2", 
"Sedentary.2", "Light.2", "Moderate.2", "Vigorous.2", "Axis.1.Counts.2", 
"Axis.1.CPM.2", "Number.of.Epochs.2", "Time.2", "Day.of.Week.3", 
"Sedentary.3", "Light.3", "Moderate.3", "Vigorous.3", "Axis.1.Counts.3", 
"Axis.1.CPM.3", "Number.of.Epochs.3", "Time.3", "Day.of.Week.4",  
"Sedentary.4", "Light.4", "Moderate.4", "Vigorous.4", "Axis.1.Counts.4", 
"Axis.1.CPM.4", "Number.of.Epochs.4", "Time.4", "Day.of.Week.5", 
"Sedentary.5", "Light.5", "Moderate.5", "Vigorous.5", "Axis.1.Counts.5", 
"Axis.1.CPM.5", "Number.of.Epochs.5", "Time.5"), reshapeWide = structure(list(
v.names = NULL, timevar = "ID2", idvar = "ID", times = 1:5, 
varying = structure(c("Filename.1", "Epoch.1", "Weight..kg..1", 
"Age.1", "Gender.1", "Date.1", "Day.of.Week.1", "Day.of.Week.Num.1", 
"Sedentary.1", "Light.1", "Moderate.1", "Vigorous.1", "Axis.1.Counts.1", 
"Axis.1.Average.Counts.1", "Axis.1.CPM.1", "Number.of.Epochs.1", 
"Time.1", "Calendar.Days.1", "Filename.2", "Epoch.2", "Weight..kg..2", 
"Age.2", "Gender.2", "Date.2", "Day.of.Week.2", "Day.of.Week.Num.2", 
"Sedentary.2", "Light.2", "Moderate.2", "Vigorous.2", "Axis.1.Counts.2", 
"Axis.1.Average.Counts.2", "Axis.1.CPM.2", "Number.of.Epochs.2", 
"Time.2", "Calendar.Days.2", "Filename.3", "Epoch.3", "Weight..kg..3", 
"Age.3", "Gender.3", "Date.3", "Day.of.Week.3", "Day.of.Week.Num.3", 
"Sedentary.3", "Light.3", "Moderate.3", "Vigorous.3", "Axis.1.Counts.3", 
"Axis.1.Average.Counts.3", "Axis.1.CPM.3", "Number.of.Epochs.3", 
"Time.3", "Calendar.Days.3", "Filename.4", "Epoch.4", "Weight..kg..4", 
"Age.4", "Gender.4", "Date.4", "Day.of.Week.4", "Day.of.Week.Num.4", 
"Sedentary.4", "Light.4", "Moderate.4", "Vigorous.4", "Axis.1.Counts.4", 
"Axis.1.Average.Counts.4", "Axis.1.CPM.4", "Number.of.Epochs.4", 
"Time.4", "Calendar.Days.4", "Filename.5", "Epoch.5", "Weight..kg..5", 
"Age.5", "Gender.5", "Date.5", "Day.of.Week.5", "Day.of.Week.Num.5", 
"Sedentary.5", "Light.5", "Moderate.5", "Vigorous.5", "Axis.1.Counts.5", 
"Axis.1.Average.Counts.5", "Axis.1.CPM.5", "Number.of.Epochs.5", 
"Time.5", "Calendar.Days.5"), .Dim = c(18L, 5L))), .Names = c("v.names", 
"timevar", "idvar", "times", "varying")), row.names = c(1L, 3L
), class = "data.frame")

I would like to sum for each row ACROSS columns sedentary.1, sedentary.2, sedentary.3, sedentary.4 and sedentary.5. But I want each column to be included in the calculation ONLY if another column meets a certain criteria.

That is include column:

-sedentary.1 if value in time.1 >= 377
-sedentary.2 if value in time.2 >= 377
-sedentary.3 if value in time.3 >= 377
-sedentary.4 if value in time.4 >= 377
-sedentary.5 if value in time.5 >= 377

I could do this in excel with the SumIf function but I don't know where to start in R for this. If you could point me to a function I could read up on I would be most grateful.

Many thanks,

Ash

Solution

There's probably a more efficient and/or clean looking way, but here I find which Time columns are not NA, and meet your criteria, then take rowSums after multiplying the Sedentary columns by the answer. TRUE will be treated as 1, and FALSE as 0 - so the result is the sum of rows meeting the criteria, as unwanted Sedentary values are multiplied by 0 before summing.

x is the name of the data frame you provided.

rowSums(x[c("Sedentary.1","Sedentary.2","Sedentary.3","Sedentary.4","Sedentary.5")] * (!is.na(x[,c("Time.1","Time.2","Time.3","Time.4","Time.5")]) & x[,c("Time.1","Time.2","Time.3","Time.4","Time.5")] >= 377), na.rm=TRUE)

Edit for question in comments:

Something like this should work:

# make TRUE/FALSE table
TF = !is.na(x[,c("Time.1","Time.2","Time.3","Time.4","Time.5")]) & x[,c("Time.1","Time.2","Time.3","Time.4","Time.5")] >= 377

# take rowSums of Sedentary.x when TF rowSums are greater than or equal to 3
rowSums(x[rowSums(TF) >= 3,c("Sedentary.1","Sedentary.2","Sedentary.3","Sedentary.4","Sedentary.5")] * TF[rowSums(TF) >= 3,], na.rm=TRUE)

You could make it a one-liner if you wanted, but I've split it into stages, saving the TRUE/FALSE table as "TF" to improve readability.

OTHER TIPS

Indexing on other columns will get you started.

sum(df$Sedentary.1[df$Time.1 >= 377])

The plyr package is a nice way to get the sums of multiple columns at once.

library(plyr)

df2 <- ddply(df, .(), summarise, Sedentary.1 = sum(Sedentary.1[Time.1 >= 377], na.rm = TRUE), 
             Sedentary.2 = sum(Sedentary.2[Time.2 >= 377], na.rm = TRUE))

   .id Sedentary.1 Sedentary.2
1 <NA>         917         972

I went about it like this. First I find which Time* columns have values >= 377, and then multiply that with the data.frame which is a subset of only Sedentary* columns. R handles TRUE as 1 and FALSE as zero, so values where there is FALSE, are turned to 0. If there's NA, the value will remain NA.

This code assumes that Time and Sedentary are listed in the same order.

sub.time <- mydf[, names(mydf)[grepl("Time", names(mydf))], ]
sumif <- sub.time >= 377
sub.sed <- mydf[, names(mydf)[grepl("Sedentary", names(mydf))], ]
apply(sub.sed * sumif, MARGIN = 1, sum, na.rm = TRUE)

        1         3 
 881.6667 2325.8333

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow