Neural networks: which cost function to use?

https://datascience.stackexchange.com/questions/9850

16-10-2019
|

Question

I am using TensorFlow for experiments mainly with neural networks. Although I have done quite some experiments (XOR-Problem, MNIST, some Regression stuff, ...) now, I struggle with choosing the "correct" cost function for specific problems because overall I could be considered a beginner.

Before coming to TensorFlow I coded some fully-connected MLPs and some recurrent networks on my own with Python and NumPy but mostly I had problems where a simple squared error and a simple gradient descient was sufficient.

However, since TensorFlow offers quite a lot of cost functions itself as well as building custom cost functions, I would like to know if there is some kind of tutorial maybe specifically for cost functions on neural networks? (I've already done like half of the official TensorFlow tutorials but they're not really explaining why specific cost functions or learners are used for specific problems - at least not for beginners)

To give some examples:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_output, y_train))

I guess it applies the softmax function on both inputs so that the sum of one vector equals 1. But what exactly is cross entropy with logits? I thought it sums up the values and calculates the cross entropy...so some metric measurement?! Wouldn't this be very much the same if I normalize the output, sum it up and take the squared error? Additionally, why is this used e.g. for MNIST (or even much harder problems)? When I want to classify like 10 or maybe even 1000 classes, doesn't summing up the values completely destroy any information about which class actually was the output?

cost = tf.nn.l2_loss(vector)

What is this for? I thought l2 loss is pretty much the squared error but TensorFlow's API tells that it's input is just one tensor. Doesn't get the idea at all?!

Besides I saw this for cross entropy pretty often:

cross_entropy = -tf.reduce_sum(y_train * tf.log(y_output))

...but why is this used? Isn't the loss in cross entropy mathematically this:

-1/n * sum(y_train * log(y_output) + (1 - y_train) * log(1 - y_output))

Where is the (1 - y_train) * log(1 - y_output) part in most TensorFlow examples? Isn't it missing?

Answers: I know this question is quite open, but I do not expect to get like 10 pages with every single problem/cost function listed in detail. I just need a short summary about when to use which cost function (in general or in TensorFlow, doesn't matter much to me) and some explanation about this topic. And/or some source(s) for beginners ;)

Solution

This answer is on the general side of cost functions, not related to TensorFlow, and will mostly address the "some explanation about this topic" part of your question.

In most examples/tutorial I followed, the cost function used was somewhat arbitrary. The point was more to introduce the reader to a specific method, not to the cost function specifically. It should not stop you to follow the tutorial to be familiar with the tools, but my answer should help you on how to choose the cost function for your own problems.

If you want answers regarding Cross-Entropy, Logit, L2 norms, or anything specific, I advise you to post multiple, more specific questions. This will increase the probability that someone with specific knowledge will see your question.

Choosing the right cost function for achieving the desired result is a critical point of machine learning problems. The basic approach, if you do not know exactly what you want out of your method, is to use Mean Square Error (Wikipedia) for regression problems and Percentage of error for classification problems. However, if you want good results out of your method, you need to define good, and thus define the adequate cost function. This comes from both domain knowledge (what is your data, what are you trying to achieve), and knowledge of the tools at your disposal.

I do not believe I can guide you through the cost functions already implemented in TensorFlow, as I have very little knowledge of the tool, but I can give you an example on how to write and assess different cost functions.

To illustrate the various differences between cost functions, let us use the example of the binary classification problem, where we want, for each sample $x_n$, the class $f(x_n) \in \{0,1\}$.

Starting with computational properties; how two functions measuring the "same thing" could lead to different results. Take the following, simple cost function; the percentage of error. If you have $N$ samples, $f(y_n)$ is the predicted class and $y_n$ the true class, you want to minimize

$\frac{1}{N} \sum_n \left\{ \begin{array}{ll} 1 & \text{ if } f(x_n) \not= y_n\\ 0 & \text{ otherwise}\\ \end{array} \right. = \sum_n y_n[1-f(x_n)] + [1-y_n]f(x_n)$.

This cost function has the benefit of being easily interpretable. However, it is not smooth; if you have only two samples, the function "jumps" from 0, to 0.5, to 1. This will lead to inconsistencies if you try to use gradient descent on this function. One way to avoid it is to change the cost function to use probabilities of assignment; $p(y_n = 1 | x_n)$. The function becomes

$\frac{1}{N} \sum_n y_n p(y_n = 0 | x_n) + (1 - y_n) p(y_n = 1 | x_n)$.

This function is smoother, and will work better with a gradient descent approach. You will get a 'finer' model. However, it has other problem; if you have a sample that is ambiguous, let say that you do not have enough information to say anything better than $p(y_n = 1 | x_n) = 0.5$. Then, using gradient descent on this cost function will lead to a model which increases this probability as much as possible, and thus, maybe, overfit.

Another problem of this function is that if $p(y_n = 1 | x_n) = 1$ while $y_n = 0$, you are certain to be right, but you are wrong. In order to avoid this issue, you can take the log of the probability, $\log p(y_n | x_n)$. As $\log(0) = \infty$ and $\log(1) = 0$, the following function does not have the problem described in the previous paragraph:

$\frac{1}{N} \sum_n y_n \log p(y_n = 0 | x_n) + (1 - y_n) \log p(y_n = 1 | x_n)$.

This should illustrate that in order to optimize the same thing, the percentage of error, different definitions might yield different results if they are easier to make sense of, computationally.

It is possible for cost functions $A$ and $B$ to measure the same concept, but $A$ might lead your method to better results than $B$.

Now let see how different costs function can measure different concepts. In the context of information retrieval, as in google search (if we ignore ranking), we want the returned results to

have high precision, not return irrelevant information
have high recall, return as much relevant results as possible
Precision and Recall (Wikipedia)

Note that if your algorithm returns everything, it will return every relevant result possible, and thus have high recall, but have very poor precision. On the other hand, if it returns only one element, the one that it is the most certain is relevant, it will have high precision but low recall.

In order to judge such algorithms, the common cost function is the $F$-score (Wikipedia). The common case is the $F_1$-score, which gives equal weight to precision and recall, but the general case it the $F_\beta$-score, and you can tweak $\beta$ to get

Higher recall, if you use $\beta > 1$
Higher precision, if you use $\beta < 1$.

In such scenario, choosing the cost function is choosing what trade-off your algorithm should do.

Another example that is often brought up is the case of medical diagnosis, you can choose a cost function that punishes more false negatives or false positives depending on what is preferable:

More healthy people being classified as sick (But then, we might treat healthy people, which is costly and might hurt them if they are actually not sick)
More sick people being classified as healthy (But then, they might die without treatment)

In conclusion, defining the cost function is defining the goal of your algorithm. The algorithm defines how to get there.

Side note: Some cost functions have nice algorithm ways to get to their goals. For example, a nice way to the minimum of the Hinge loss (Wikipedia) exists, by solving the dual problem in SVM (Wikipedia)

OTHER TIPS

To answer your question on Cross entropy, you'll notice that both of what you have mentioned are the same thing.

$-\frac{1}{n} \sum(y\_train * \log(y\_output) + (1 - y\_train) \cdot \log(1 - y\_output))$

that you mentioned is simply the binary cross entropy loss where you assume that $y\_train$ is a 0/1 scalar and that $y\_output$ is again a scalar indicating the probability of the output being 1.

The other equation you mentioned is a more generic variant of that extending to multiple classes

-tf.reduce_sum(y_train * tf.log(y_output)) is the same thing as writing

$-\sum_n train\_prob \cdot \log (out\_prob)$

where the summation is over the multiple classes and the probabilities are for each class. Clearly in the binary case it is the exact same thing as what was mentioned earlier. The $n$ term is omitted as it doesn't contribute in any way to the loss minimization as it is a constant.

BLUF: iterative trial-and-error with subset of data and matplotlib.

Long Answer:

My team was struggling with this same question not that long ago. All the answers here are great, but I wanted to share with you my "beginner's answer" for context and as a starting point for folks who are new to machine learning.

You want to aim for a cost function that is smooth and convex for your specific choice of algorithm and data set. That's because you want your algorithm to be able to confidently and efficiently adjust the weights to eventually reach the global minimum of that cost function. If your cost function is "bumpy" with local max's and min's, and/or has no global minimum, then your algorithm might have a hard time converging; its weights might just jump all over the place, ultimately failing to give you accurate and/or consistent predictions.

For example, if you are using linear regression to predict someone's weight (real number, in pounds) based on their height (real number, in inches) and age (real number, in years), then the mean squared error cost function should be a nice, smooth, convex curve. Your algorithm will have no problems converging.

But say instead you are using a logistic regression algorithm for a binary classification problem, like predicting a person's gender based on whether the person has purchased diapers in the last 30 days and whether the person has purchased beer in the last 30 days. In this case, mean squared error might not give you a smooth convex surface, which could be bad for training. And you would tell that by experimentation.

You could start by running a trial with using MSE and a small and simple sample of your data or with mock data that you generated for this experiment. Visualize what is going on with matplotlib (or whatever plotting solution you prefer). Is the resulting error curve smooth and convex? Try again with an additional input variable... is the resulting surface still smooth and convex? Through this experiment you may find that while MSE does not fit your problem/solution, cross entropy gives you a smooth convex shape that better fits your needs. So you could try that out with a larger sample data set and see if the hypothesis still holds. And if it does, then you can try it with your full training set a few times and see how it performs and if it consistently delivers similar models. If it does not, then pick another cost function and repeat the process.

This type of highly iterative trial-and-error process has been working pretty well for me and my team of beginner data scientists, and lets us focus on finding solutions to our questions without having to dive deeply into the math theory behind cost function selection and model optimization.

Of course, a lot of this trial and error has already been done by other people, so we also leverage public knowledge to help us filter our choices of what might be good cost functions early in the process. For example, cross entropy is generally a good choice for classification problems, whether it's binary classification with logistic regression like the example above or a more complicated multi-label classification with a softmax layer as the output. Whereas MSE is a good first choice for linear regression problems where you are seeking a scalar prediction instead of the likelihood of membership in a known category out of a known set of possible categories, in which case instead of a softmax layer as your output you'd could just have a weighted sum of the inputs plus bias without an activation function.

Hope this answer helps other beginners out there without being overly simplistic and obvious.

Regrading your question

Where is the (1 - y_train) * log(1 - y_output) part in most TensorFlow examples? Isn't it missing?

The answer is that most output functions are softmax. That means you don't necessarily need to reduce all the probabilities in wrong cases as they will automatically be reduced when you increase probability of the right one

For Example:

before optimisation

y_output = [0.2, 0.2, 0.6] and y_train = [0, 0, 1]

after optimisation

y_output = [0.15, 0.15, 0.7] and y_train = [0, 0, 1]

here observe that even though we just increased third term, all the other terms automatically reduced

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange