문제

I have a dataset of forest polygons and I am attempting to compare the Field classifications with the Map classifications using a confusion matrix. The only package I could find that would run on a larger dataset (more than 2 classification options) and could compare text values was in the package 'mda'. I have run the package 'mda' and used the code for 'confusion'.

The provided example with the package is...

data(iris)
irisfit <- fda(Species ~ ., data = iris)
confusion(predict(irisfit, iris), iris$Species)
                 Setosa       Versicolor       Virginica
Setosa            50              0               0
Versicolor         0             48               1
Virginica          0              2              49

attr(, "error"):
[1] 0.02

I run mine as simply

data(Habitat)
confusion(Habitat$Field,Habitat$Map)

Which gives me a confusion matrix output similar (but not nearly as neat) as the code example provided. At this point I get lost. I have 2 results with mine.

attr(,"error")
[1] 0.3448276
attr(,"mismatch")
[1] 0.889313

Error I understand, mismatch however, I cannot seem to find any hint of online or within the literature that comes with the package. I doubt having such a high "mismatch" value is good, but I have no idea how to interpret it. I figure this is probably a fairly specific question that perhaps could only be answered by someone that has worked with this package before, but if anyone knows, or has a hint on how to find out, I would greatly appreciate it.

Thanks, Ayden

EDIT - To include clips of my dataset, showing what may be the mismatch (suspect it means consistent misclassifications). Shows clips of the most consistent misclassification.

structure(list(Field = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 8L), .Label = c("Black Spruce ", "Clearcut ", 
"Deciduous ", "Jack Pine ", "Lowland Conifer ", "Marshwillow ", 
"Mixed Conifer ", "Open Muskeg ", "Rock ", "Treed Muskeg ", "Upland Conifer ", 
"Young Conifer", "Young Deciduous"), class = "factor"), Map = structure(c(7L, 
7L, 7L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 13L, 13L, 13L, 6L), .Label = c("Black     Spruce", "Clearcut", "Deciduous", "Jack Pine", "Lowland Conifer", "Marshwillow", 
"Mixed Conifer", "Open Muskeg", "Rock", "Treed Muskeg", "Upland Conifer", 
"Young Conifer", "Young Deciduous"), class = "factor")), .Names = c("Field", 
"Map"), row.names = 143:156, class = "data.frame")
도움이 되었습니까?

해결책

It seems to mean that the variables don't share a common set of values. If one is predicting the other, it is predicting values that are not present (or the other way round). Mismatch seems to be the proportion of cases assigned a value not present in the levels of the other variable.

In the iris dataset example you post, we can elicit this same output if we introduce a new value to one of the variables in the confusion matrix. Since they're factors, we need to create a new factor level first.

data(iris)
irisfit <- fda(Species ~ ., data = iris)
iris$Predict<-predict(irisfit, iris)
iris$Predict=factor(iris$Predict,levels= c("setosa", "versicolor",
      "virginica","monsterosa"))  #adding a new level 'monsterosa'
iris$Predict[1]<-"monsterosa"  #assign it to one of the observations

Now we can re-run the confusion function and get a mismatch:

confusion(iris$Predict, iris$Species)
            true
predicted    setosa versicolor virginica
  setosa         49          0         0
  versicolor      0         48         1
  virginica       0          2        49
  monsterosa      1          0         0
attr(,"error")
[1] 0.02013423
attr(,"mismatch")
[1] 0.006666667

And if we refactor the other variable to include all levels present in both variables, the mismatch goes away:

iris$Species=factor(iris$Species,levels= c("setosa", "versicolor",
      "virginica","monsterosa"))
confusion(iris$Predict, iris$Species)
            true
predicted    setosa versicolor virginica monsterosa
  setosa         49          0         0          0
  versicolor      0         48         1          0
  virginica       0          2        49          0
  monsterosa      1          0         0          0
attr(,"error")
[1] 0.02666667

I would compare as.character(unique(Habitat$Field)) and as.character(unique(Habitat$Map)) to track it down. The as.character is not needed, but makes it easy to read.

Now that you've added data, I see the issue seems to be that you have trailing spaces at the end of some variables and double spaces between words in others.

# see problem
as.character(levels(Habitat$Field))
as.character(levels(Habitat$Map))

# fix problem

# unfactor them for now so we can replace spaces
Habitat$Field<-as.character(Habitat$Field)
Habitat$Map<-as.character(Habitat$Map)

# replace unwanted spaces
Habitat$Field <- gsub("[[:space:]]*$","",Habitat$Field) #gets ending spaces
Habitat$Map <- gsub("[[:space:]]*$","",Habitat$Map) #gets ending spaces
Habitat$Map <- gsub("[[:space:]]{2,}"," ",Habitat$Map) # gets double spaces
Habitat$Field <- gsub("[[:space:]]{2,}"," ",Habitat$Field) # gets double spaces

# factor them again
Habitat$Field <-factor(Habitat$Field)
Habitat$Map<-factor(Habitat$Map)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top