문제

I have a DF called "data" about 10 000 rows long (for the sake of illustration we'll say 10 000). I have a numerical column called "SimDelta" which I want to put in to 4 categories (0-0.25, 0.25-0.5,0.5-0.75, and >0.75) which I make using this piece of code:

data$SimDeltaClass = 
       ifelse(data$SimDelta>0.75, ">0.75",
       ifelse(data$SimDelta<0.75&data$SimDelta>0.5, "0.5-0.75",
       ifelse(data$SimDelta<0.5&data$SimDelta>0.25, "0.25-0.5",
       ifelse(data$SimDelta<0.25&data$SimDelta>0, "0-0.25", "void"))))

this is then plotted in to a boxplot of the four categories and the number of samples in each category is written above the box using:

text(x=1,y=1.07,length(data$rMF[data$SimDeltaClass=="0-0.25"]),cex=0.8,col="black")
text(x=2,y=1.07,length(data$rMF[data$SimDeltaClass=="0.25-0.5"]),cex=0.8,col="black")
text(x=3,y=1.07,length(data$rMF[data$SimDeltaClass=="0.5-0.75"]),cex=0.8,col="black")
text(x=4,y=1.07,length(data$rMF[data$SimDeltaClass==">0.75"]),cex=0.8,col="black")

This section ( length(data$rMF[data$SimDeltaClass=="0-0.25"]) ) should give the number per group. When these 4 counts are summed I get a value in excess of 14 000, far more than the 10 000 I had expected.

Why is this not forming the categories correctly? I have based it on a previous piece that I wrote which works perfectly so I am not sure what R (or myself) is struggling with.

Obviously I need to edit the ifelse() section because they contain incorrectly assign values, but I don't know what to do

Note: there are no error messages or warnings & the str() is the same as the version that works

도움이 되었습니까?

해결책

Likely you have NA's that contribute to length.

> x = c(1, NA)
> x[x==1]
[1]  1 NA

Use cut rather than ifelse (the default without the labels= argument is better).

set.seed(123); x = c(runif(10, -1, 2), NA)
y = cut(x, c(-Inf, seq(0, .75, .25), Inf), 
        labels=c("void", "0-0.25", "0.25-0.5", "0.5-0.75", ">0.75"))

leading to

> y
 [1] void     >0.75    0-0.25   >0.75    >0.75    void     0.5-0.75 >0.75   
 [9] 0.5-0.75 0.25-0.5 <NA>    
Levels: void 0-0.25 0.25-0.5 0.5-0.75 >0.75

Use table to summarize the data.

> table(y)
y
    void   0-0.25 0.25-0.5 0.5-0.75    >0.75 
       2        1        1        2        4 
> table(y, useNA="ifany")
y
    void   0-0.25 0.25-0.5 0.5-0.75    >0.75     <NA> 
       2        1        1        2        4        1 

text is vectorized.

text(1:4, 1.07, table(y)[2:5])

Complete solution (tested by rg255)

data$SimDeltaClass <- cut(data$SimDelta, c(-Inf, seq(0, .75, .25), Inf),
    labels=c("void", "0-0.25", "0.25-0.5", "0.5-0.75", ">0.75"))
text(x=1:4, y=1.07, table(data$SimDeltaClass[fdr])[2:5], cex=0.8, col="black")
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top