Question

I am probably making a very simple (and stupid) mistake here but I cannot figure it out. I am playing with some data from Kaggle (Digit Recognizer) and trying to use SVM with the Caret package to do some classification. If I just plug the label values into the function as type numeric, the train function in Caret seems to default to regression and performance is quite poor. So what I tried next is to convert it to a factor with the function factor() and try and run SVM classification. Here is some code where I generate some dummy data and then plug it into Caret:

library(caret)
library(doMC)
registerDoMC(cores = 4)

ytrain <- factor(sample(0:9, 1000, replace=TRUE))
xtrain <- matrix(runif(252 * 1000,0 , 255), 1000, 252)

preProcValues <- preProcess(xtrain, method = c("center", "scale"))
transformerdxtrain <- predict(preProcValues, xtrain)

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
svmFit <- train(transformerdxtrain[1:10,], ytrain[1:10], method = "svmradial")

I get this error:

Error in kernelMult(kernelf(object), newdata, xmatrix(object)[[p]], coef(object)[[p]]) : 
  dims [product 20] do not match the length of object [0]
In addition: Warning messages:
1: In train.default(transformerdxtrain[1:10, ], ytrain[1:10], method = "svmradial") :
  At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1, X2, X3, X4, X5, X6, X7, X8, X9
2: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method,  :
  There were missing values in resampled performance measures.

Can somebody tell me what I am doing wrong? Thank you!

Was it helpful?

Solution

You have 10 different classes and yet you are only including 10 cases in train(). This means that when you resample you will frequently not have all 10 classes in individual instances of your classifier. train() is having difficulty combining the results of these varying-category SVMs.

You can fix this by some combination of increasing the number of cases, decreasing the number of classes, or even using a different classifier.

OTHER TIPS

I found it challenging to use caret with the digit recognition use case. I think part of the problem is the label data is numeric. When caret tries to create variables from it they end up starting with a numeric, which is truly not accepted as an R variable.

In my case, I got around it by discretizing the label data using dplyr. This assumes your training data is placed into "train" dataframe.

descretize label to label2

train$label2=dplyr::recode(train$label, 0="zero", 1="one", 2="two",3="three",4="four",5="five",6="six",7="seven",8="eight",9="nine")

rearrange the columns so you can see new label2 along side original label

train <- train[, c((1),(786),(2:785))] head(train)

change label to be the factorized version of the discretized label2

train$label <- factor(train$label2)

kill label2 since it was a temp variable

train$label2 <- NULL

view result

head(train)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top