Question

I have a dataframe with a POSIXct datetime column and a column with a value. The value may contain periods of NA, sometimes even lags between some hours (no data at all, eg.)

t                   v
2014-01-01 20:00:00 1000
2014-01-01 20:15:00 2300
2014-01-01 20:30:00 1330
2014-01-01 20:45:00 NA
2014-01-01 21:00:00 NA
2014-01-01 22:15:00 NA
2014-01-01 22:30:00 1330
2014-01-01 22:45:00 3333

One can easily see that there is a period with simply no data written (21:00 to 22:15) When I now apply

aggregate(data, list(t=cut($t, "1hour"), FUN=sum)

it interprets anything missing as zero. When plotting it with ggplot2 and geom_line, the curve in that region will break down from 1000s to 10s.

I want that aggregate returns NA values for every hour that is not represented by the data (missing or NA itself), such that the values are not bent down to 0 and such that the line plot shows a gap in that period (disconnected data points).

Était-ce utile?

La solution

Thanks to @JulienNavarre and @user20650 who both contributed parts of the solution, I put here my final solution which is additionally capable of handling data at non-regular times and demands at least x values per hour for aggregation.

data$t <- as.POSIXct(strptime(data$t,"%Y-%m-%d %H:%M:%S"))
x <- 4 # data available x times per hour
h <- 1 # aggregate to every h hours
# aggregation puts NA if data has not x valid values per hour
dataagg <- aggregate(data$v, list(t=cut(data$t, paste(h,"hours"))),
                     function(z) ifelse(length(z)<x*h||any(is.na(z)),NA,sum(z,na.rm=T)))
dataagg$t <- as.POSIXct(strptime(dataagg$t, '%Y-%m-%d %H:%M:%S'))
# Now fill up missing datetimes with NA
a <- seq(min(dataagg$t), max(dataagg$t), by=paste(h,"hours"))
t <- a[seq(1, length(a), by=1)]
tdf <- as.data.frame(t)
tdf$t <- as.POSIXct(strptime(tdf$t, '%Y-%m-%d %H:%M:%S'))
dataaggfinal <- merge(dataagg, tdf, by="t", all.y=T)

Autres conseils

What you want is not clear tho, but maybe you are looking for a right join, which you can do with merge and all.Y = TRUE.

And after you can do your sum grouped by, with aggregate.

> data$t <- as.POSIXct(data$t)
> 
> time.seq <- seq(min(as.POSIXct(data$t)), max(as.POSIXct(data$t)), by = "min")[seq(1, 166, by = 15)]
> 
> merge(data, as.data.frame(time.seq), by.x = "t", by.y = "time.seq", all.y = T)
                     t    v
1  2014-01-01 20:00:00 1000
2  2014-01-01 20:15:00 2300
3  2014-01-01 20:30:00 1330
4  2014-01-01 20:45:00   NA
5  2014-01-01 21:00:00   NA
6  2014-01-01 21:15:00   NA
7  2014-01-01 21:30:00   NA
8  2014-01-01 21:45:00   NA
9  2014-01-01 22:00:00   NA
10 2014-01-01 22:15:00   NA
11 2014-01-01 22:30:00 1330
12 2014-01-01 22:45:00 3333

And the x argument in aggregate should be, in this case, the variable you want to "sum", then its "data$v" not "data".

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top