The data.table
package is a natural fit for the first problem. (data.table
is an evolved version of data.frame
with lot more functionality and speed.)
For the second problem, there is a whole class of functions defined: na.rm
, na.action
etc.
Here is a toy example:
library(data.table)
set.seed(12345)
dt <- data.table(
Year= sample(1980:2014,1000,replace=T),
Date= sample(1:10000, 1000, replace=T),
killed_min= sample(c(15:150,NA), 1000, replace=T),
killed_max=sample(c(NA,250:1500), 1000, replace=T),
Injured_min=sample(150:250, 1000, replace=T),
Injured_max=sample(500:4000, 1000, replace=T))
dt # Note the missing value in row 996
dt[,list(killed_min=sum(killed_min,na.rm=TRUE),
killed_max=sum(killed_max,na.rm=TRUE)),by=Year]
Hope this helps!!
Alternatively, you can also use .SDcols
here with lapply
in j
as follows:
dt[, lapply(.SD, sum, na.rm=TRUE), by=Year,
.SDcols=c("killed_min", "killed_max")]