Tensorflow gradient returns nan or Inf [closed]

https://datascience.stackexchange.com/questions/80898

13-12-2020
|

Question

I am trying to implement a WGAN-GP model using tensorflow and keras (for credit card fraud data from kaggle).

I mostly followed the sample code that is provided in keras website and several other sample codes on the internet (but changed them from image to my data), and it is pretty straightforward.

But when I want to update the critic, the gradient of loss w.r.t critic's weights becomes all nan after a few batches. And this causes the critic's weights to become nan and after that the generator's weights become nan,... So everything become nan!

I used tf.debugging.enable_check_numerics and found that the problem arises because a -Inf appears in the gradient after some iterations.

This is directly related to the gradient-penalty term in the loss, because when I remove that the problem goes away.

Please note that the gp itself is not nan, but when I get the gradient of the loss w.r.t critic's weights (c_grads in the code below) it contains -Inf and then somehow becomes all nan.

I checked the math and network architecture for possible mistakes (like probability of gradient vanishing, etc.), and I checked my code for possible bugs for hours and hours. But I'm stuck.

I would very much appreciate it if anyone can find the root of the problem

Note: Bear in mind that the critic's output and loss function is slightly different from the original paper (because I'm trying to make it conditional) but that has nothing to do with the problem because as I said before, the whole problem goes away when I just remove the gradient penalty term

This is my critic:

critic = keras.Sequential([
        keras.layers.Input(shape=(x_dim,), name='c-input'),
        keras.layers.Dense(64, kernel_initializer=keras.initializers.he_normal(), name='c-hidden-1'),
        keras.layers.LeakyReLU(alpha=0.25, name='c-activation-1'),
        keras.layers.Dense(32, kernel_initializer=keras.initializers.he_normal(), name='c-hidden-2'),
        keras.layers.LeakyReLU(alpha=0.25, name='c-activation-2'),
        keras.layers.Dense(2, activation='tanh', name='c-output')
    ], name='critic')

This is my gradient-penalty function:

def gradient_penalty(self, batch_size, x_real, x_fake):
    # get the random linear interpolation of real and fake data (x hat)
    alpha = tf.random.uniform([batch_size, 1], 0.0, 1.0)
    x_interpolated = x_real + alpha * (x_fake - x_real)
    with tf.GradientTape() as gp_tape:
        gp_tape.watch(x_interpolated)
        # Get the critic score for this interpolated data
        scores = 0.5 * (self.critic(x_interpolated, training=True) + 1.0)
    # Calculate the gradients w.r.t to this interpolated data
    grads = gp_tape.gradient(scores, x_interpolated)
    # Calculate the norm of the gradients
    # Gradient penalty enforces the gradient to stay close to 1.0 (1-Lipschitz constraint)
    gp = tf.reduce_mean(tf.square(tf.norm(grads, axis=-1) - 1.0))
    return gp

And this is the critic's update code

# Get random samples from latent space
z = GAN.random_samples((batch_size, self.latent_dim))

# Augment random samples with the class label (1 for class "fraud") for conditioning
z_conditioned = tf.concat([z, tf.ones((batch_size, 1))], axis=1)
# Generate fake data using random samples
x_fake = self.generator(z_conditioned, training=True)

# Calculate the loss and back-propagate
with tf.GradientTape() as c_tape:
    c_tape.watch(x_fake)
    c_tape.watch(x_real)

    # Get the scores for the fake data
    output_fake = 0.5 * (self.critic(x_fake) + 1.0)
    score_fake = tf.reduce_mean(tf.reduce_sum(output_fake, axis=1))
    # Get the scores for the real data
    output_real = 0.5 * (self.critic(x_real, training=True) + 1.0)
    score_real = tf.reduce_mean((1.0 - 2.0 * y_real) * (output_real[:, 0] - output_real[:, 1]))

# Calculate the gradient penalty
gp = self.gp_coeff * self.gradient_penalty(batch_size, x_real, x_fake)
# Calculate critic's loss (added 1.0 so its ideal value becomes zero)
c_loss = 1.0 + score_fake - score_real + gp
# Calculate the gradients
c_grads = c_tape.gradient(c_loss, self.critic.trainable_weights)
# back-propagate the loss
self.c_optimizer.apply_gradients(zip(c_grads, self.critic.trainable_weights))

Also Note: As you can see, I don't use any cross entropy or other self-written functions with the risk of division-by-zero.

Solution

So after much more digging into the internet, it turns out that this is because of the numerical instability of tf.norm (and some other functions as well).

In the case of norm function, the problem is that when calculating its gradient, its value appears in the denominator. So d(norm(x))/dx at x = 0 would become 0 / 0 (this is the mysterious division-by-zero I was looking for!)

The problem is that the computational graph sometimes ends up with things like a / a where a = 0 which numerically is undefined but the limit exists. And because of the way tensorflow works (which computes the gradients using the chain rule) it results in nans or +/-Infs.

The best way probably would be for tensorflow to detect these patterns and replace them with their analytically-simplified equivalent. But until they do so, we have another way, and that is using something called tf.custom_gradient to define our custom function with our custom gradient (related issue on their github)

Although in my case there was actually an even simpler solution (although it wasn't simple when I didn't know that the tf.norm was the culprit):

So instead of:

tf.norm(x)

You can use:

tf.sqrt(tf.reduce_sum(tf.square(x)) + 1.0e-12)

Note: Be careful about dimensions (if x is a matrix or tensor and you need to calculate row-wise or column-wise norms)! this is just a sample code to demonstrate the concept

Hope it helps someone

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange