Question

I am trying to run a Fisher's LDA (1, 2) to reduce the number of features of matrix.

Basically, correct if I am wrong, given n samples classified in several classes, Fisher's LDA tries to find an axis that projecting thereon should maximize the value J(w), which is the ratio of total sample variance to the sum of variances within separate classes.

I think this can be used to find the most useful features for each class.

I have a matrix X of m features and n samples (m rows, n columns).

I have a sample classification y, i.e. an array of n labels, each one for each sample.

Basing on y I want to reduce the number of features to, for example, 3 most representative features.

Using scikit-learn I tried in this way (following this documentation):

>>> import numpy as np
>>> from sklearn.lda import LDA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = LDA(n_components=3)
>>> clf.fit_transform(X, y)
array([[ 4.],
   [ 4.],
   [ 8.],
   [-4.],
   [-4.],
   [-8.]])

At this point I am a bit confused, how to obtain the most representative features?

Was it helpful?

Solution

The features you are looking for are in clf.coef_ after you have fitted the classifier.

Note that n_components=3 doesn't make sense here, since X.shape[1] == 2, i.e. your feature space only has two dimensions.

You do not need to invoke fit_transform in order to obtain coef_, calling clf.fit(X, y) will suffice.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top