Question

I'm trying to use the R function quantcut() to recode a numeric variable as a factor with levels corresponding to quantiles. For example:

> X
[1]  6  4  9  6  1  2  5  3  5  7 10  7  2  7  7  5  6  6  3  4  6  4  2  7  6  7
[27]  4  3  5  3  7  6  8 12  4  4  0  1  7  6  7  4  7  1  1  1  2  3  3  1  1  6
[53]  5  3  1  1  1  3  3  3  1  1  3  1  1  1  3  3  0  1  3  1  8  5  3  0  0  2
[79]  1  3  8  0  1  4  1  1  1  1  1  1  3  2  1  4  1  5  5 12  7  2  6  6  2  6
[105]  0  1  4  1  4  0  7  3  2  1  1  8  5  5  3  0  5  6  2  4  2  2  2  6  4  2
[131]  2  2  2  6  8  5  1  2  8  3  2  7  4  6  6  6  7  5  1  5  5  6  1  4  4  5
[157]  6  2  4  7  2  4 10  6  3  5  2  2  6  6  2  4  5  7  4  5 11  6  6  8  2  4
[183]  4  6 12 16  9  7 14 13 11  5  5  2  2  7  7  6  4  3  4  3  5  4  5  7  9  4
[209]  3 12  4  4  4  8  7  6  1  3  6  7  5  5  6  9  6  4  7  8  5  6  3  6  4  7
[235]  3  3  4  7  5  7  5  9  5  8  3  4  3  2  5  2  4  3  8  4  2  2  1  5  3  5
[261]  8  5  6  4  5  1  1  2  6  2  7  2  4  4  3  3  4 10  5  6 10  2  5  5  0  1
[287]  6  2  5  4  6  6  9  5  5  6  3  8  1  5  1  8  5  2  5  2  4  2  4  4

bins=10
labels = 1:bins
library(gtools)
x2 = quantcut(X, q = seq(0, 1, by=1/bins), labels=labels)

I get the error: "Error in cut.default(x[!flag], breaks = newquant, include.lowest = TRUE, : 'breaks' are not unique". I thought this was because there are ties in the quantiles, but the documentation for quantcut specifically shows an example of how the function can handle ties by using fewer intervals. The error occurs regardless of whether I specify the labels argument.

Any advice would be greatly appreciated.

EDIT: Here is code to enter the variable X:

X = c(6L, 4L, 9L, 6L, 1L, 2L, 5L, 3L, 5L, 7L, 10L, 7L, 2L, 7L, 7L, 
5L, 6L, 6L, 3L, 4L, 6L, 4L, 2L, 7L, 6L, 7L, 4L, 3L, 5L, 3L, 7L, 
6L, 8L, 12L, 4L, 4L, 0L, 1L, 7L, 6L, 7L, 4L, 7L, 1L, 1L, 1L, 
2L, 3L, 3L, 1L, 1L, 6L, 5L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 
3L, 1L, 1L, 1L, 3L, 3L, 0L, 1L, 3L, 1L, 8L, 5L, 3L, 0L, 0L, 2L, 
1L, 3L, 8L, 0L, 1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 4L, 
1L, 5L, 5L, 12L, 7L, 2L, 6L, 6L, 2L, 6L, 0L, 1L, 4L, 1L, 4L, 
0L, 7L, 3L, 2L, 1L, 1L, 8L, 5L, 5L, 3L, 0L, 5L, 6L, 2L, 4L, 2L, 
2L, 2L, 6L, 4L, 2L, 2L, 2L, 2L, 6L, 8L, 5L, 1L, 2L, 8L, 3L, 2L, 
7L, 4L, 6L, 6L, 6L, 7L, 5L, 1L, 5L, 5L, 6L, 1L, 4L, 4L, 5L, 6L, 
2L, 4L, 7L, 2L, 4L, 10L, 6L, 3L, 5L, 2L, 2L, 6L, 6L, 2L, 4L, 
5L, 7L, 4L, 5L, 11L, 6L, 6L, 8L, 2L, 4L, 4L, 6L, 12L, 16L, 9L, 
7L, 14L, 13L, 11L, 5L, 5L, 2L, 2L, 7L, 7L, 6L, 4L, 3L, 4L, 3L, 
5L, 4L, 5L, 7L, 9L, 4L, 3L, 12L, 4L, 4L, 4L, 8L, 7L, 6L, 1L, 
3L, 6L, 7L, 5L, 5L, 6L, 9L, 6L, 4L, 7L, 8L, 5L, 6L, 3L, 6L, 4L, 
7L, 3L, 3L, 4L, 7L, 5L, 7L, 5L, 9L, 5L, 8L, 3L, 4L, 3L, 2L, 5L, 
2L, 4L, 3L, 8L, 4L, 2L, 2L, 1L, 5L, 3L, 5L, 8L, 5L, 6L, 4L, 5L, 
1L, 1L, 2L, 6L, 2L, 7L, 2L, 4L, 4L, 3L, 3L, 4L, 10L, 5L, 6L, 
10L, 2L, 5L, 5L, 0L, 1L, 6L, 2L, 5L, 4L, 6L, 6L, 9L, 5L, 5L, 
6L, 3L, 8L, 1L, 5L, 1L, 8L, 5L, 2L, 5L, 2L, 4L, 2L, 4L, 4L)
Was it helpful?

Solution

Okay, the issue can be traced to here, where as you say, the 70% and 80% quantiles are the same. quantile is used internally by quantcut

quantile(X,probs=seq(0,1,0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 0.0  1.0  2.0  3.0  3.6  4.0  5.0  6.0  6.0  8.0 16.0 

I can't see how to address this issue using quantcut itself, but you could always just use cut and quantile and unique in combination to sort it out. From what I can tell, this is what quantcut does internally when there are ties anyway.

result <- cut(X,unique(quantile(X,probs=seq(0,1,0.1))),include.lowest=TRUE)

> result[2:10]
 [1] (3.6,4] (8,16]  (5,6]   [0,1]   (1,2]   (4,5]   (2,3]   (4,5]   (6,8]  
#Levels: [0,1] (1,2] (2,3] (3,3.6] (3.6,4] (4,5] (5,6] (6,8] (8,16]
> X[2:10]
 [1]      4      9      6       1       2       5       3       5       7
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top