R - cox hazard model not including levels of a factor

https://stackoverflow.com/questions/21367259

03-10-2022
|

Question

I am fitting a cox model to some data that is structured as such:

str(test)
'data.frame':   147 obs. of  8 variables:
 $ AGE              : int  71 69 90 78 61 74 78 78 81 45 ...
 $ Gender           : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
 $ RACE             : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ...
 $ SIDE             : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ...
 $ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ...
 $ RUTH.CLASS       : int  3 5 4 5 4 3 3 5 5 5 ...
 $ LESION.TYPE      : Factor w/ 3 levels "","OCCLUSION",..: 3 3 2 3 3 3 2 3 3 3 ...
 $ Primary          : int  1190 1032 166 689 219 840 1063 115 810 157 ...

the RUTH.CLASS variable is actually a factor, and i've changed it to one as such:

> test$RUTH.CLASS <- as.factor(test$RUTH.CLASS)
> summary(test$RUTH.CLASS)
 3  4  5  6 
48 56 35  8

great.

after fitting the model

stent.surv <- Surv(test$Primary)
> cox.ruthclass <- coxph(stent.surv ~ RUTH.CLASS, data=test )
> 
> summary(cox.ruthclass)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS, data = test)

  n= 147, number of events= 147 

              coef exp(coef) se(coef)     z Pr(>|z|)   
RUTH.CLASS4 0.1599    1.1734   0.1987 0.804  0.42111   
RUTH.CLASS5 0.5848    1.7947   0.2263 2.585  0.00974 **
RUTH.CLASS6 0.3624    1.4368   0.3846 0.942  0.34599   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4     1.173     0.8522    0.7948     1.732
RUTH.CLASS5     1.795     0.5572    1.1518     2.796
RUTH.CLASS6     1.437     0.6960    0.6762     3.053

Concordance= 0.574  (se = 0.026 )
Rsquare= 0.045   (max possible= 1 )
Likelihood ratio test= 6.71  on 3 df,   p=0.08156
Wald test            = 7.09  on 3 df,   p=0.06902
Score (logrank) test = 7.23  on 3 df,   p=0.06478

> levels(test$RUTH.CLASS)
[1] "3" "4" "5" "6"

When i fit more variables in the model, similar things happen:

cox.fit <- coxph(stent.surv ~ RUTH.CLASS + LESION.INDICATION + LESION.TYPE, data=test )
> 
> summary(cox.fit)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS + LESION.INDICATION + 
    LESION.TYPE, data = test)

  n= 147, number of events= 147 

                                          coef exp(coef) se(coef)      z Pr(>|z|)  
RUTH.CLASS4                            -0.5854    0.5569   1.1852 -0.494   0.6214  
RUTH.CLASS5                            -0.1476    0.8627   1.0182 -0.145   0.8847  
RUTH.CLASS6                            -0.4509    0.6370   1.0998 -0.410   0.6818  
LESION.INDICATIONEMBOLIC               -0.4611    0.6306   1.5425 -0.299   0.7650  
LESION.INDICATIONISCHEMIA               1.3794    3.9725   1.1541  1.195   0.2320  
LESION.INDICATIONISCHEMIA/CLAUDICATION  0.2546    1.2899   1.0189  0.250   0.8027  
LESION.INDICATIONREST PAIN              0.5302    1.6993   1.1853  0.447   0.6547  
LESION.INDICATIONTISSUE LOSS            0.7793    2.1800   1.0254  0.760   0.4473  
LESION.TYPEOCCLUSION                   -0.5886    0.5551   0.4360 -1.350   0.1770  
LESION.TYPESTEN                        -0.7895    0.4541   0.4378 -1.803   0.0714 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                                       exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4                               0.5569     1.7956   0.05456     5.684
RUTH.CLASS5                               0.8627     1.1591   0.11726     6.348
RUTH.CLASS6                               0.6370     1.5698   0.07379     5.499
LESION.INDICATIONEMBOLIC                  0.6306     1.5858   0.03067    12.964
LESION.INDICATIONISCHEMIA                 3.9725     0.2517   0.41374    38.141
LESION.INDICATIONISCHEMIA/CLAUDICATION    1.2899     0.7752   0.17510     9.503
LESION.INDICATIONREST PAIN                1.6993     0.5885   0.16645    17.347
LESION.INDICATIONTISSUE LOSS              2.1800     0.4587   0.29216    16.266
LESION.TYPEOCCLUSION                      0.5551     1.8015   0.23619     1.305
LESION.TYPESTEN                           0.4541     2.2023   0.19250     1.071

Concordance= 0.619  (se = 0.028 )
Rsquare= 0.137   (max possible= 1 )
Likelihood ratio test= 21.6  on 10 df,   p=0.01726
Wald test            = 22.23  on 10 df,   p=0.01398
Score (logrank) test = 23.46  on 10 df,   p=0.009161

> levels(test$LESION.INDICATION)
[1] "CLAUDICATION"          "EMBOLIC"               "ISCHEMIA"              "ISCHEMIA/CLAUDICATION"
[5] "REST PAIN"             "TISSUE LOSS"          
> levels(test$LESION.TYPE)
[1] ""          "OCCLUSION" "STEN"

truncated output from model.matrix below:

> model.matrix(cox.fit)
    RUTH.CLASS4 RUTH.CLASS5 RUTH.CLASS6 LESION.INDICATIONEMBOLIC LESION.INDICATIONISCHEMIA
1             0           0           0                        0                         0
2             0           1           0                        0                         0

We can see that the the first level of each of these is being excluded from the model. Any input would be greatly appreciated. I noticed that on the LESION.TYPE variable, the blank level "" is not being included, but that is not by design - that should be a NA or something similar.

I'm confused and could use some help with this. Thanks.

Solution

Factors in any model return coefficients based on a base level (a contrast).Your contrasts default to a base factor. There is no point in calculating a coefficient for the dropped value because the model will return the predictions when that dropped value = 1 given that all the other factor values are 0 (factors are complete and mutually exclusive for every observation). You can alter your default contrast by changing the contrasts in your options.

For your coefficients to be versus an average of all factors:

options(contrasts=c(unordered="contr.sum", ordered="contr.poly"))

For your coefficients to be versus a specific treatment (what you have above and your default):

options(contrasts=c(unordered="contr.treatment", ordered="contr.poly"))

As you can see there are two types of factors in R: unordered (or categorical, e.g. red, green, blue) and ordered (e.g. strongly disagree, disagree, no opinion, agree, strongly agree)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow