Tree regression plot with classes of the dependent variable

https://stackoverflow.com/questions/23385452

12-07-2023
|

Question

I want to run a tree regression. The data is this format:

           L2         L3        L4        L5        L6 ele         ndvi    nd_var nd_ps ldclas
1 0.010814554 0.11304182 0.1360298 0.2098749 0.2437155 179  0.012483470  286688.2  7361   agri
2 0.010853562 0.10954640 0.1279681 0.1986370 0.2224236 183 -0.005020924  383210.9  7353   agri
3 0.011879258 0.12245614 0.1507865 0.2681184 0.2980641 184  0.005531083 1210329.6  7539   agri
4 0.009947186 0.09288491 0.1018834 0.2433811 0.2778357 193 -0.043884473  372672.2  7189   agri
5 0.010979766 0.10698310 0.1283619 0.2131286 0.2349639 193 -0.022636201  472360.7  7392   agri
6 0.011418039 0.11616439 0.1401070 0.2539036 0.3128864 195 -0.001042468  629364.2  7263   agri

ldclas is the dependent variable. ldclas has 10 levels, namely agri, tea, teak, rubber etc..

output of dput(tt) is

structure(list(L2 = c(0.00912571167754499, 0.00930928144178689, 
0.00934829001668829, 0.0088274108106519, 0.00936205774900643, 
0.00895361502356821, 0.00898573973231054, 0.00755389557122373, 
0.0075997880122842, 0.00758602027996606, 0.00788891039096519, 
0.00775582231188981, 0.00781777710732146, 0.00793250820997264, 
0.00815738117116897, 0.00817114890348711), L3 = c(0.0878981140668165, 
0.0923722488117655, 0.0880612335627261, 0.0763632354274946, 0.0775283746839917, 
0.082748198553099, 0.0864766441738899, 0.0545518285458678, 0.0588628437949073, 
0.0566956847778226, 0.0579540351748395, 0.0588628437949073, 0.0606105526796531, 
0.0575345850425006, 0.0649681734989524, 0.0623116559941389), 
    L4 = c(0.0848333226476736, 0.0903004613645694, 0.088516691528972, 
    0.073088240743156, 0.0761924635739359, 0.0779299017254917, 
    0.0815206072387071, 0.036532542034421, 0.0375518390833337, 
    0.0378298291875827, 0.0388722920785162, 0.0384089752381013, 
    0.0395672673391385, 0.0402622425997609, 0.0436212896927688, 
    0.0423240025396071), L5 = c(0.22561265031896, 0.236273695432274, 
    0.208398062322137, 0.17396888632849, 0.135616814946827, 0.208075000349006, 
    0.217836087108599, 0.118148392542544, 0.198013927471506, 
    0.166792295353943, 0.149716162488461, 0.183937655785095, 
    0.18880666123728, 0.129386334036449, 0.223697354335399, 0.193560287413347
    ), L6 = c(0.177203322015849, 0.200068266889341, 0.190253179119034, 
    0.163732501780303, 0.16849603196228, 0.173259562144258, 0.184647722672334, 
    0.0603306628998872, 0.0772578120116587, 0.0753302439845328, 
    0.0664678622506211, 0.0696583196748293, 0.0774350596463369, 
    0.0615492403883001, 0.0991922068030903, 0.0796728110341496
    ), ele = c(666, 773, 766, 678, 787, 809, 857, 738, 748, 855, 
    500, 612, 588, 397, 261, 258), ndvi = c(-0.0283995447391665, 
    -0.0135402419404802, -0.0395083528567925, -0.0819444409706586, 
    -0.103586067539291, -0.0490366118119649, -0.0288226681221347, 
    -0.17071641510358, -0.136545326259316, -0.154017449391041, 
    -0.16240155229558, -0.146503439773889, -0.136064892814646, 
    -0.168614157809797, -0.122837753698589, -0.144167470536185
    ), nd_var = c(131202.666666667, 433640.666666667, 461440.222222222, 
    210334.888888889, 79202, 4817.55555555556, 55640.6666666667, 
    105110.222222222, 263000.888888889, 63993.5555555556, 95738.8888888889, 
    29214, 34386.8888888889, 74852.6666666667, 63421.5555555556, 
    47259.5555555556), nd_ps = c(7836, 7407, 8644, 7460, 8731, 
    7675, 8202, 8457, 8160, 8152, 7705, 8108, 8016, 7898, 7901, 
    7946), ldclas = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
    5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("agri", "coconut", 
    "DDF", "grass", "MDF", "rubber", "tea", "teak", "water", 
    "young rubber"), class = "factor")), .Names = c("L2", "L3", 
"L4", "L5", "L6", "ele", "ndvi", "nd_var", "nd_ps", "ldclas"), row.names = 95:110, class = "data.frame")

I have used the following code:

library(party)
ct <- ctree(ldclas ~ L2 + L3 + L4 + L5 + L6 + ele + ndvi + nd_var + nd_ps, data = tt)

I get the result like:

1) ele <= 637; criterion = 1, statistic = 216.044
  2) L3 <= 0.09185959; criterion = 1, statistic = 187.431
    3) L5 <= 0.05141302; criterion = 1, statistic = 165.797
      4)*  weights = 12

But I am not able to know which class of dependent variable is segregated in the regression tree. Eg: which class of response variable is getting classified for ele >637 and what is the code to get this on the plot?

Solution

I'm not sure I fully understand your question and you haven't provided any reproducible example, so I'll try to wrap it up with a stand alone example and adress your comments on the way

So lets run a classification tree with response variable with 3 different levels

library(party)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct)

enter image description here

So the plot shows us what is the distribution (in percentage) of the explained variable in each leaf (terminal node). For example, you can see that in node number 2, we have 100% of Setosa. The n=50 (you asked about it the comments) means that we have 50 observations in that specific node (not unique, but overal). Now if we want to see the tree structure, we can do:

irisct
##1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
##  2)*  weights = 50 
##1) Petal.Length > 1.9
##  3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
##    4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
##      5)*  weights = 46 
##    4) Petal.Length > 4.8
##      6)*  weights = 8 
##  3) Petal.Width > 1.7
##    7)*  weights = 46

You can see that in 2)* weights = 50, wieghts = 50, which means we have 50 observations in that node. As we didn't specify the weights parameter in the ctree(), ctree by default sets a weight of 1 to each observation (you can set the weights parameter differntly, see ?ctree). You can also see * at some nodes, which means they are terminal nodes.

Now to get to you main question, you can get the distribution of each level in each node (no metter if it's terminal or not) by using the following code

target <- "Species" # your explained variable, which will be "ldclas" in your case
Node <- 5 # the node you want to investigate
n <- nodes(irisct, Node)[[1]] # retreving the weights of that node
x <- iris[which(as.logical(n$weights)), ] # retreiving all inforamtion for that node
paste(paste(names(table(x[target])), ": ", round((as.numeric(table(x[target]))/nrow(x))*100, 3), "%", sep = ""), collapse = ", ")
## [1] "setosa: 0%, versicolor: 97.826%, virginica: 2.174%"

The output gives you the distribution of each level in that specific node

Hope that was what you needed

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow