Autoencoder feature extraction plateau

https://datascience.stackexchange.com/questions/80248

13-12-2020
|

Pergunta

I am working with a large dataset (approximately 55K observations x 11K features) and trying to perform dimensionality reduction to about 150 features. So far, I tried PCA, LDA, and autoencoder. The autoencoder that I tried was 12000-8000-5000-100-500-250-150-, all layers were Dense with sigmoid activation, except the final layer, which had a linear activation in order to reproduce the continuous data from the input. The autoencoder loss effectively plateaus after 10-15 epochs, regardless of the learning rate (here, I used the ReduceLROnPlateau feature in Keras). For the record, I am normalizing each feature by z-score prior to the training. I'm not sure how to get this loss to stop reaching a plateau.

Should my next attempt be to use a convolutional neural network on this dataset to see if I can reduce the dimensionality more successfully? Are there any pre-trained convolutional autoencoders that I could use? Training a convolutional autoencorder from scratch seems to require quite a bit of memory and time, but if I could work off of a pre-trained CNN autoencoder this might save me memory and time.

Solução

A convolutional autoencoder will only make sense if you work with images (2D signals) or time series (1D signals). Convolutions identify local patterns in data, if this is not the case in your data it will most likely not solve your problem.

Using pre-trained AE will only help, if it was trained on similar data. Similar data in this case does not refer to the data type, but rather to what the data represents. If you have an AE which was trained to compress images of cats it will not work well on images of chairs, because cats and chairs do not share the same features. Although if you like to compress images of dogs you could use the weights of the AE for cats as a starting point (Transfer Learning).

What kind of loss are you using? MSE or Cross-entropy? Speaking from my experience, using cross-entropy yields better results (although this is problem dependent). Another issue could be vanishing gradients which can happen in very deep networks and with activation functions like sigmoid. What you can do is reduce the depth of your network, replace the sigmoid with ReLU and maybe try a different optimizer.

In any case PCA is a safe bet. It's linear, deterministic, well studied and quicker to use than to train a NN. Whatever method you use, you can use PCA as benchmark to see if your method beats it. Although with the size of your data you may run into memory issues.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange