Question

I'm trying to use the cut2() function from the Hmisc package to create a factor based on time periods.

Here's some code:

library(Hmisc)

i.time <- as.POSIXct("2013-07-16 13:55:14 CEST")
f.time <- i.time+as.difftime(1, units="hours")

data.points <- seq(from=i.time, to=f.time, by="1 sec")
cut.points <- seq(from=i.time, to=f.time, by="60 sec")



intervals <- cut2(x=data.points, cuts=cut.points, minmax=TRUE)

I expected intervals to be created such that each point in data.point were placed in a interval of time. But there are some NA values in the end:

> tail(intervals, 1)
[1] <NA>
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ... [2013-07-16 14:54:14,2013-07-16 14:55:14]

I was expecting that the option minmax=TRUE would make sure that hte cuts included all the values in data.points.

Can anyone clarify what's going on here? How can I use the cut2 function to generate a factor that includes all the values in the data?

Was it helpful?

Solution

The reason I use cut2 in preference to cut is that its default for "right" is the way I expect it to work (left-closed intervals). Looking at the code I see that when 'cuts' is present in the argument list, then the cut function is used with a shifted set of cuts that has the effect of making the intervals left-closed, and then the code relabels the factor to change the "("'s to ["'s, but then does not use include.lowest = TRUE. This has the effect of turning the last value into <NA>. Frankly, I see this as a bug. After looking at this more closely I see that cut2's help page does not promise to handle either Date or date-time objects, so "bug" is too strong. It completely fails with Date objects and it appears to be only an accident that is is almost correct with POSIXct objects. (This implementation is somewhat surprising to me in that I always assumed that it was just using cut( ... , right=FALSE, include.lowest=TRUE).)

You can alter the code and one idea I had was to extend the range back to the right end point in the original data by changing this line:

r <- range(x,  na.rm = TRUE)

To this line:

r <- range(c(x,max(x)+min(diff(x.unique))/2),  na.rm = TRUE)

It's not exactly the result I expected since you get a new category at the right end because the penultimate interval was still open on the right.

intervals <- cut3(x=data.points, cuts=cut.points, minmax=TRUE)
> tail(intervals, 1)
[1] 2013-07-16 14:55:14
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
> tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14) 2013-07-16 14:55:14                      
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...

A different idea gives a more satisfactory result. Change only this line:

y <- cut(x, k2)

To to this:

y <- cut(x, k2, include.lowest=TRUE)

Giving the expected right and left closed interval and no NA:

 tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14] [2013-07-16 14:54:14,2013-07-16 14:55:14]
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...

Note: include.lowest=TRUE with right=FALSE, will actually become include.highest. And I'm scratching my head about why I am actually getting the desired behavior in this case when I did not also need to do something with the 'right' parameter. I sent Frank Harrell a message, and he is willing to consider revisions to the code to handle other cases. I'm working on that.

Why this is an issue: The labeling for cut.POSIXt and cut.Date differs from the labeling of cut.numeric (actually cut.default) results. The former two label strategy is to just reprot the beginnings of the intervals whereas the labeling from cut.numeric includes "[" and ")" and the ends of the intervals. Compare the output from these:

levels( cut(0+1:100, 3) )
levels( cut(Sys.time()+1:100, 3) )
levels( cut(Sys.Date()+1:100, 3) )

OTHER TIPS

from ??cut2:

minmax : if cuts is specified but min(x) < min(cuts) or max(x) > max(cuts), augments cuts to include min and max x

Checking your arguments:

x=data.points
cuts=cut.points
r <- range(x, na.rm = TRUE)
 (r[1] < min(cuts) | (r[2] > max(cuts)))
FALSE ## no need to include mean and max

So here setting minmax don't change the result. But here a result using cut by setting include.lowest=TRUE) :

res <- cut(x=data.points, breaks=cut.points, include.lowest=TRUE)
table(is.na(res))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top