Help me understand linear separability in a binary SVM

https://stackoverflow.com/questions/3997454

10-10-2019
|

Question

I'm cross-posting this from math.stackexchange.com because I'm not getting any feedback and it's a time-sensitive question for me.

My question pertains to linear separability with hyperplanes in a support vector machine.

According to Wikipedia:

...formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.classifier.

The linear separation of classes by hyperplanes intuitively makes sense to me. And I think I understand linear separability for two-dimensional geometry. However, I'm implementing an SVM using a popular SVM library (libSVM) and when messing around with the numbers, I fail to understand how an SVM can create a curve between classes, or enclose central points in category 1 within a circular curve when surrounded by points in category 2 if a hyperplane in an n-dimensional space V is a "flat" subset of dimension n − 1, or for two-dimensional space - a 1D line.

Here is what I mean:

circularly enclosed class separation for a 2D binary SVM

That's not a hyperplane. That's circular. How does this work? Or are there more dimensions inside the SVM than the two-dimensional 2D input features?

This example application can be downloaded here.

Edit:

Thanks for your comprehensive answers. So the SVM can separate weird data well by using a kernel function. Would it help to linearize the data before sending it to the SVM? For example, one of my input features (a numeric value) has a turning point (eg. 0) where it neatly fits into category 1, but above and below zero it fits into category 2. Now, because I know this, would it help classification to send the absolute value of this feature for the SVM?

Solution

As mokus explained, support vector machines use a kernel function to implicitly map data into a feature space where they are linearly separable:

SVM mapping one feature space into another

Different kernel functions are used for various kinds of data. Note that an extra dimension (feature) is added by the transformation in the picture, although this feature is never materialized in memory.

(Illustration from Chris Thornton, U. Sussex.)

OTHER TIPS

Check out this YouTube video that illustrates an example of linearly inseparable points that become separable by a plane when mapped to a higher dimension.

alt text

I am not intimately familiar with SVMs, but from what I recall from my studies they are often used with a "kernel function" - essentially, a replacement for the standard inner product that effectively non-linearizes the space. It's loosely equivalent to applying a nonlinear transformation from your space into some "working space" where the linear classifier is applied, and then pulling the results back into your original space, where the linear subspaces the classifier works with are no longer linear.

The wikipedia article does mention this in the subsection "Non-linear classification", with a link to http://en.wikipedia.org/wiki/Kernel_trick which explains the technique more generally.

This is done by applying what is know as a [Kernel Trick] (http://en.wikipedia.org/wiki/Kernel_trick) What basically is done is that if something is not linearly separable in the existing input space ( 2-D in your case), it is projected to a higher dimension where this would be separable. A kernel function ( can be non-linear) is applied to modify your feature space. All computations are then performed in this feature space (which can be possibly of infinite dimensions too).

Each point in your input is transformed using this kernel function, and all further computations are performed as if this was your original input space. Thus, your points may be separable in a higher dimension (possibly infinite) and thus the linear hyperplane in higher dimensions might not be linear in the original dimensions.

For a simple example, consider the example of XOR. If you plot Input1 on X-Axis, and Input2 on Y-Axis, then the output classes will be:

Class 0: (0,0), (1,1)
Class 1: (0,1), (1,0)

As you can observe, its not linearly seperable in 2-D. But if I take these ordered pairs in 3-D, (by just moving 1 point in 3-D) say:

Class 0: (0,0,1), (1,1,0)
Class 1: (0,1,0), (1,0,0)

Now you can easily observe that there is a plane in 3-D to separate these two classes linearly.

Thus if you project your inputs to a sufficiently large dimension (possibly infinite), then you'll be able to separate your classes linearly in that dimension.

One important point to notice here (and maybe I'll answer your other question too) is that you don't have to make a kernel function yourself (like I made one above). The good thing is that the kernel function automatically takes care of your input and figures out how to "linearize" it.

For the SVM example in the question given in 2-D space let x1, x2 be the two axes. You can have a transformation function F = x1^2 + x2^2 and transform this problem into a 1-D space problem. If you notice carefully you could see that in the transformed space, you can easily linearly separate the points(thresholds on F axis). Here the transformed space was [ F ] ( 1 dimensional ) . In most cases , you would be increasing the dimensionality to get linearly separable hyperplanes.

SVM clustering

HTH

My answer to a previous question might shed some light on what is happening in this case. The example I give is very contrived and not really what happens in an SVM, but it should give you come intuition.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow