문제

I am having trouble graphing my SVM model in R. The formula is:

svm_linear <- svm(open ~ review_count + recession + duration + count + stars + Freq + avgRev + avgStar, data=yelp_train, cost=100, gamma=1)
plot(svm_linear, data=yelp_train)

I can't figure out why nothing appears after running the plot function. Please help. I added the dput out.

I cut out some of the extra columns to avoid waste.

newdata <- cleanDataFrame[2:10]
set.seed(10)
(newdata[sample(1:nrow(newdata), 30),])

structure(list(open = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 
 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 
 1L, 1L, 1L, 0L), review_count = c(3L, 5L, 6L, 38L, 6L, 4L, 5L, 
 23L, 19L, 3L, 22L, 74L, 15L, 38L, 88L, 26L, 9L, 3L, 58L, 4L, 
13L, 117L, 38L, 10L, 5L, 6L, 102L, 108L, 264L, 103L), stars = c(3, 
4, 4.5, 4, 3, 3, 3, 4, 3.5, 3.5, 3.5, 4.5, 4.5, 4, 2.5, 3.5, 
3.5, 3.5, 4, 3, 4.5, 4.5, 4, 3.5, 4, 3.5, 4, 3, 3.5, 4), Freq = c(166L, 
12L, 166L, 15L, 45L, 166L, 66L, 79L, 33L, 58L, 150L, 389L, 150L, 
1L, 389L, 20L, 389L, 389L, 389L, 166L, 74L, 0L, 389L, 32L, 389L, 
161L, 126L, 389L, 98L, 3L), avgRev = c(23.7904191616766, 18.7692307692308, 
23.7904191616766, 98, 78.804347826087, 23.7904191616766, 31.3283582089552, 
64.3375, 23.1764705882353, 23.6949152542373, 60.6490066225166, 
34.1923076923077, 60.6490066225166, 22, 34.1923076923077, 33.1904761904762, 
34.1923076923077, 34.1923076923077, 34.1923076923077, 30.8443113772455, 
27.6533333333333, 117, 34.1923076923077, 30.4545454545455, 34.1923076923077, 
37.2716049382716, 47.3149606299213, 34.1923076923077, 64.3838383838384, 
73.75), avgStar = c(3.53592814371257, 3.92307692307692, 3.53592814371257, 
3.96875, 3.6195652173913, 3.53592814371257, 3.69402985074627, 
3.58125, 3.5, 3.67796610169492, 3.63245033112583, 3.5551282051282, 
3.63245033112583, 4, 3.5551282051282, 3.78571428571429, 3.5551282051282, 
3.5551282051282, 3.5551282051282, 3.48203592814371, 3.72666666666667, 
4.5, 3.5551282051282, 3.65151515151515, 3.5551282051282, 3.43827160493827, 
3.63385826771654, 3.5551282051282, 3.60606060606061, 4.25), count = c(4L, 
2L, 5L, 5L, 0L, 2L, 5L, 0L, 2L, 8L, 3L, 15L, 4L, 3L, 15L, 14L, 
1L, 1L, 0L, 1L, 2L, 0L, 0L, 50L, 1L, 27L, 4L, 51L, 36L, 14L), 
recession = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), duration = c(332L, 427L, 614L, 117L, 1894L, 
1346L, 140L, 1909L, 1100L, 1030L, 1666L, 2096L, 1054L, 352L, 
2145L, 1018L, 1763L, 391L, 2116L, 1567L, 693L, 674L, 1626L, 
301L, 295L, 378L, 649L, 376L, 1028L, 2390L)), .Names = c("open", 
"review_count", "stars", "Freq", "avgRev", "avgStar", "count", 
"recession", "duration"), row.names = c(1439L, 870L, 1210L, 1962L, 
242L, 639L, 777L, 771L, 1741L, 1214L, 1840L, 1603L, 322L, 1681L, 
1010L, 1209L, 148L, 745L, 1124L, 2354L, 2433L, 1731L, 2180L, 
1000L, 1141L, 1985L, 2814L, 674L, 2163L, 999L), class = "data.frame")
도움이 되었습니까?

해결책

It looks like you're trying to do classification, but your outcome variable is integer mode. To see this, do str(yelp_train). Turn the outcome into a factor and then try your plot again. For example:

yelp_train$openF = factor(yelp_train$open)

svm_linear <- svm(openF ~ review_count + recession + duration + count + stars + Freq + avgRev +
                         avgStar, data=yelp_train, cost=100, gamma=1)

plot(svm_linear, formula = review_count ~ Freq, data=yelp_train)

One other thing. In the portion of the data you provided, recession is always zero. If this is the case with all of the data, then remove recession from your call to svm. I had to do this to avoid an error. Once I removed recession, I was able to run the model and plot several combinations of variables successfully.

Question in Comments: Why isn't Open the dependent variable in the formula in the plot function? You're plotting where the decision boundary lies in relation to the values of two of the independent variables (or "features" in machine learning lingo). The predicted value of the dependent variable, Open, is given by the fill colors: In this case, one color for Open=1 and another for Open=0. The boundary between the two colors is the decision boundary that the svm model came up with. The plot also includes points representing the pairs of values of the two features used for the plot. The two different plot markers represent the two different values of Open and you can see how many points were properly classified and how many were misclassified by your model.

The full decision boundary is a hyperplane in a multi-dimensional space. For example, if you had 3 features in the model, the features would lie in a 3-dimensional space (imagine a 3D scatterplot) and the decision boundary would be a 2-dimensional hyperplane through that 3D space (which we of course refer to as a "plane" in this case; and in general, the decision boundary has dimension one less than the dimension of the feature space).

When you plot two features, you're looking at a two-dimensional slice through that multi-dimensional space. The plot function is setting the values of the other features to some specific values--maybe the mean for numeric variables and the base factor level for factor variables--check the documentation to be sure. The plot function for svm models allows you to set the specific values of the other features (besides the two you're plotting) using the slice argument. That allows you to see how the decision boundary for two particular features varies based on changes in the values of other features.

You might find the svm chapter of Introduction to Statistical Learning useful for additional info (you can download it at no charge).

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top