Grouping data with sub-features

https://stackoverflow.com/questions/23320334

grouping
r

10-07-2023
|

Question

I have data of some events like :

Year,   Date,      killed_min, killed_max, Injured_min, Injured_max
2000    4/3/2000      34          54          31         39
2000    6/4/2000      24          34          11         19
...

I am facing two main problems:

Grouping these events by year or applying clustering. There are sub-features in this data like minimum and maximum values. How can I deal with them?
There are a lot of missing values in data, which may effect applying clustering on it.

I want to group this data by parameters like people killed or injured year wise and things like that.

La solution

The data.table package is a natural fit for the first problem. (data.table is an evolved version of data.frame with lot more functionality and speed.)

For the second problem, there is a whole class of functions defined: na.rm, na.action etc.

Here is a toy example:

library(data.table)

set.seed(12345)
dt <- data.table(
  Year= sample(1980:2014,1000,replace=T), 
  Date= sample(1:10000, 1000, replace=T),     
  killed_min= sample(c(15:150,NA), 1000, replace=T),
  killed_max=sample(c(NA,250:1500), 1000, replace=T), 
  Injured_min=sample(150:250, 1000, replace=T), 
  Injured_max=sample(500:4000, 1000, replace=T))

dt # Note the missing value in row 996

dt[,list(killed_min=sum(killed_min,na.rm=TRUE),
         killed_max=sum(killed_max,na.rm=TRUE)),by=Year]

Hope this helps!!

Alternatively, you can also use .SDcols here with lapply in j as follows:

dt[, lapply(.SD, sum, na.rm=TRUE), by=Year, 
       .SDcols=c("killed_min", "killed_max")]

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow