how to handle large number of features machine learning

https://stackoverflow.com/questions/17310817

01-06-2022
|

Question

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).

I used the below gradient descent formula:

Repeat {
    θj = θj−α/m∑(hθ(x)−y)xj
}

Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:

hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn

With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.

I wanted to improve that so I decided to attempt a polynomial of degree 2.

However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.

So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).

So any suggestions / advice is greatly appreciated.

Solution

try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)

OTHER TIPS

Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.

Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow