Strictly speaking, you don't need a sigmoid activation function. What you need is a differentiable function that serves as an approximation to the step function. As an alternative to the sigmoid, you could instead use a hyperbolic tangent function.
For multi-layer perceptron networks, the simple perceptron learning rule does not provide a means for determining how a weight several layers from the output should be adjusted, based on a given output error. The backpropagation learning rule relies on the fact that the sigmoid function is differentiable, which makes it possible to characterize the rate of change in the output layer error with respect to a change in a particular weight (even if the weight is multiple layers away from the output). Note that as the k
parameter of the sigmoid tends toward infinity, the sigmoid approaches the step function, which is the activation function used in the basic perceptron.