Question

I would like to use dimensionality reduction algorithm in my pipeline. I have 2k features and I'm using xgboost. My model is rebuilding each day (there are new records that should be involve to training set).

I'm looking for method for dimensionality reduction with out setting n_components. I know that in PCA it shouldn't be set. But I'm looking for method that find something like clusters on my data and then I will use it to train my model. Of course the same flow I'll be using for prediction.

Do you have idea how should I do my data processing for this case?

Was it helpful?

Solution

It would be helpful to know a bit better what you're trying to achieve and why the selection of a specific number of eigenvalues bothers you. From the generic information you gave it seems you're aiming at training a model on a compressed/dense representation of several features, in which case I would suggest to train an autoencoder (or something similar), on top of which you could then train whatever classifier you need. Otherwise if the problem relies only on the amount of features you have you could try with some feature selection strategies

OTHER TIPS

It would help to know more about why you'd wanna do it this way, but one approach that I can think of is to first do PCA on your 2000 dimensions and get the explained variance and cumulative variances. Then set a threshold for the cumulative variance and whichever dimension hits that threshold (say 95%) take that dimension number as the count of target dimensions to reduce to using PCA or t-sne. So if you get 95% cumulative variance on dimension # 654, your target for reduced dimensions should be 654.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top