Intuition about the kernel trick in machine learning

Question 1

Simple.

Give me the numeric result of (x+y)^10 for some values of x and y.

What would you rather do, "cheat" and sum x+y and then take that value to the 10'th power, or expand out the exact results writing out

x^10+10 x^9 y+45 x^8 y^2+120 x^7 y^3+210 x^6 y^4+252 x^5 y^5+210 x^4 y^6+120 x^3 y^7+45 x^2 y^8+10 x y^9+y^10

And then compute each term and then add them together? Clearly we can evaluate the dot product between degree 10 polynomials without explicitly forming them.

Valid kernels are dot products where we can "cheat" and compute the numeric result between two points without having to form their explicit feature values. There are many such possible kernels, though only a few have been getting used a lot on papers / practice.

Question 2

I'm not sure if I'm answering your question, but as I remember the "trick" is that you don't explicitly calculate inner products. The perceptron calculates a straight line that separates the clusters. To get curved lines or even circles, instead of changing the perceptron you can change the space that contains the clusters. This is done by using a transformation usually called phi that transform coordinates to from one space to another. The perceptron algorithm is then applied in the new space where it produces a straight line, but when that line then is transformed back to the original space it can be curved.

The trick is that the perceptron only needs to know the inner product of the points of the clusters it is trying to separate. This means that we only need to be able to calculate the inner product of the transformed points. This is what the kernel does K(x,y) = <phi(x), phi(y)> where < . , . > is the inner product in the new space. This means that there is no need to do all the transformations to the new space and back, we don't even need to explicitly know what the transformation phi() is. All that is needed is that K defines an inner product in some space and hope that this inner product and space is useful for separating our clusters.

I think that there was some theorem that says that if the space represented by the kernel has higher dimensionality than the original space it is likely that it will separate the clusters better.

Question 3

There is really not much to it

The weight in the higher space is w = sum_i{a_i^t * Phi(x_i)}

and the input vector in the higher space Phi(x)

so that the linear classification in the higher space is

w^t * input + c > 0

so if you put these together

sum_i{a_i * Phi(x_i)} * Phi(x) + c = sum_i{a_i * Phi(x_i)^t * Phi(x)} + c > 0

the last dot product's computational complexity is linear to the number of dimensions (often intractable, or not wanted)

We solve this by going over to the kernel "magic answer to the dot product"

K(x_i, x) = Phi(x_i)^t * Phi(x)

which gives

sum_i{a_i * K(x_i, x)} + c > 0