Momentum is not always good, it can help the model to jump out of the a bad valley but may also make the model to jump out of a good valley. Especially when the previous weights update directions is not good.
There are several reasons that make your model not work well. The parameter are not well set, it is always a non-trivial task to set the parameters of the MLP.
An easy way is to first set the learning rate, momentum weight and regularization weight to a big number, but to set the iteration (or epoch) to a very large weight. Once the model diverge, half the learning rate, momentum weight and regularization weight.
This approach can make the model to slowly converge to a local optimal, and also give the chance for it to jump out a bad valley.
Moreover, in my opinion, one output neuron is enough for two class problem. There is no need to increase the complexity of the model if it is not necessary. Similarly, if possible, use a three-layer MLP instead of a four-layer MLP.