Question

I have a dataset that contains text columns. I have used tf-idf to convert those text columns to numerical columns. I want to reduce the dimension of the dataset since tf-idf creates a multitude of new features/columns.

I am thinking of using autoencoder to reduce the dimension of the data by using the output of the encoded layer, concatenate those output to my dataframe and use it as a training set for the random forest.

My questions are: Do the above steps sound reasonable? In particular, should I train the autoencoder on the whole dataset and then use the output of the encoded layer (after training) as the new columns to feed random forest? [See code below]

I am using Keras so after defining the autoencoder model, I fit the whole dataset and then also predict on the whole dataset using only the encoded layer. And then using those predictions as my updated data.

Here is the code I am using:

def autoencoder(df):
    #return the lower dimensional data to feed directly into train_test_split
    df_copy = df.copy()
    input_dim = Input(shape = (df_copy.shape[1], ))

    # DEFINE THE ENCODER LAYER
    encoded = Dense(int(df_copy.shape[1]/2), activation = 'relu')(input_dim)

    # DEFINE THE DECODER LAYER
    decoded = Dense(df_copy.shape[1], activation = 'sigmoid')(encoded)

    # COMBINE ENCODER AND DECODER INTO AN AUTOENCODER MODEL
    autoencoder = Model(input = input_dim, output = decoded)

    # CONFIGURE AND TRAIN THE AUTOENCODER
    autoencoder.compile(optimizer = 'adadelta', loss = 'mean_squared_error')
    autoencoder.fit(df_copy.values, df_copy.values, nb_epoch = 10, batch_size = int(df_copy.shape[1]/3))

    # THE ENCODER TO EXTRACT THE REDUCED DIMENSION FROM THE ABOVE AUTOENCODER
    encoder = Model(input = input_dim, output = encoded)
    encoded_input = Input(shape = (int(df_copy.shape[1]/2), ))
    encoded_out = encoder.predict(df_copy.values) #Note how I am training on the same data I am predicting on

    return pd.DataFrame(encoded_out)

I am new to autoencoders so any suggestions and help on how to use autoencoders in this context will be appreciated.

Was it helpful?

Solution

I will repost my answer on Quora here, about the "should I train the autoencoder on the whole dataset" part of the questions.

TL;DR: NO, you always use ONLY the training set for training, even for unsupervised learning with autoencoders!

What is the goal of using a test set ? It is meant to represent data that you model (including both the autoencoder and the random forest) have NEVER seen before.

Autoencoders are unsupervised learning models, but it does not mean that they cannot overfit. They can learn latent codes that do not generalize. Imagine an AE that encodes every data point x1, x2, … as a real number (one-dimensional code):

x1 -> 0.0001, x2 -> 0.0002, etc.

Do you think this autoencoder will generalize well ? No, it just remembered the entire training set and did not learn useful representations. That is why we use regularization in autoencoders, in very different forms : undercompleteness, sparsity, denoising, variational autoencoders (VAE)…

Conclusion, if you include the test set to train your autoencoder, it may in certain cases overfit, your performance estimates will be biased and you lose all the purpose of a test set.

(don’t hesitate to ask if you want me to detail some aspects of my answer or have more questions on this matter)


Concerning the architecture of you AE:

  1. using a sigmoid output activation only makes sense if you data takes values between 0 and 1, as a sigmoid function squashes the output into the [0, 1] range. So either you normalize your data to the [0, 1] range, or you use a linear output activation.
  2. With a sigmoid output activation, you should use a binary cross-entropy loss, which is designed for binary variables following Bernoulli distributions. With a linear output, use the mean squared error loss, which is designed for real-valued output (following Gaussian distributions).
  3. Maybe Adam optimizer will converge faster that Adadelta. Try for yourself.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top