Why won't Perceptron Learning Algorithm converge?

Question 1

Perceptrons by Minsky and Papert (in)famously demonstrated in 1969 that the perceptron learning algorithm is not guaranteed to converge for datasets that are not linearly separable.

If you're sure that your dataset is linearly separable, you might try adding a bias to each of your data vectors, as described by the question: Perceptron learning algorithm not converging to 0 -- adding a bias can help model decision boundaries that do not pass through the origin.

Alternatively, if you'd like to use a variant of the perceptron learning algorithm that is guaranteed to converge to a margin of specified width, even for datasets that are not linearly separable, have a look at the Averaged Perceptron -- PDF. The averaged perceptron is an approximation to the voted perceptron, which was introduced (as far as I know) in a nice paper by Freund and Schapire, "Large Margin Classification Using the Perceptron Algorithm" -- PDF.

Using an averaged perceptron, you make a copy of the parameter vector after each presentation of a training example during training. The final classifier uses the mean of all parameter vectors.

Question 2

The basic issue is that randomly chosen points are not necessarily linearly classifiable.

However, there is a worse problem in the algorithm:

Even if you go by a good reference like Vapnik's, "Statistical Learning Theory", you are not given the biggest problem in the OP's algorithm. The issue is NOT the learning rate parameter, either. It is trivial to prove that the learning rate parameter has no real effect on whether or not the algorithm converges - this is because the learning rate parameter is simply a scalar.

Imagine a set of four points to classify. The four points are orthogonal vectors of length sqrt(2). The "in class" point is (-1,1). The "out of class" points are (-1,-1), (1,1), and (1,-1). Regardless of any learning rate, the OP's original algorithm will never converge.

The reason why the original poster's algorithm fails to converge is because of the missing bias term (effectively the coefficient of the 0th dimensional term), which MUST augment the other dimensional terms. Without the bias term, the perceptron is not completely defined. This is trivial to prove by hand modeling a perceptron in 1D or 2D space.

The bias term is often hand-waved in literature as a way to "shift" the hyperplane, but it seems that's mostly due to the fact that many people tend to teach/learn perceptrons in 2D space for teaching purposes. The term "Shifting" doesn't provide an adequate explanation of why the bias term is needed in high dimensional spaces (what does it mean to "shift" a 100-dimensional hyperplane?)

One may notice that literature proving the mean convergence time of the perceptron excludes the bias term. This is due to the ability to simplify the perceptron equation if you assume that the perceptron will converge (see Vapnik, 1998, Wiley, p.377). That is a big (but necessary) assumption for the proof, but one cannot adequately implement a perceptron by assuming an incomplete implementation.

Alex B. J. Novikoff's 1962/1963 proofs of the perceptron convergence include this zero dimensional term.