Question

I am trying to build and train a multilayer perceptron neural network that correctly predicts what president won in what county for the first time. I have the following information for training data.

Total population Median age % BachelorsDeg or higher Unemployment rate Per capita income Total households Average household size % Owner occupied housing % Renter occupied housing % Vacant housing Median home value Population growth House hold growth Per capita income growth Winner

That's 14 columns of training data and the 15th column is what the output should be.

I am trying to use Keras to build a multilayer perceptron neural network, but I need some help understanding a few properties and the pros of cons of choosing different options for these properties.

  1. ACTIVATION FUNCTION

I know my first step is to come up with an activation function. I always studied neural networks used sigmoid activation functions. Is a sigmoid activation function the best? How do you know which one to use? Keras additionally gives the options of using a softmax, softplus, relu, tanh, linear, or hard_sigmoid activation function. I'm okay with using whatever, but I just want to be able to understand why and the pros and cons.

  1. PROBABILITY INITIALIZAIONS

I know initializations define the probability distribution used to set the initial random weights of Keras layers. The options Keras gives are uniform lecun_uniform, normal, identity, orthogonal, zero, glorot_normal, glorot_uniform, he_normal, and he_uniform. How does my selection here impact my end result or model? Shouldn't it not matter because we are "training" whatever random model we start with and come up with a more optimal weighting of the layers anyways?

Was it helpful?

Solution

1) Activation is an architecture choice, which boils down to a hyperparameter choice. You can make a theoretical argument for using any function, but the best way to determine this is to try several and evaluate on a validation set. It's also important to remember you can mix and match activations of various layers.

2) In theory yes, many random initializations would be the same if your data was extremely well behaved and your network ideal. But in practice initializations seek to ensure the gradient starts off reasonable and the signal can be backpropagated correctly. Likely in this case any of those initializations would perform similarly, but the best approach is to try them out, switching if you get undesirable results.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top