Вопрос

I built a neural network to predict a certain kind of data ( biological sequences ). It has 32 features where 12 have certain units and 20 are simply integers ( but positive). My positive set has 648 samples and negatives 9000+ samples.

To train my network I took 500 samples of both and rest were used for testing. When I trained and tested my network with 3 folds cross-validation it gave 100 % accuracy for all cases, provided I normalised the input data before partitioning them into training and testing sets. Precision and Recall is 100%

When I don't normalise it the accuracy falls to 65-70 % for the same experiment. Precision and recall is 5% and 80% respectively.

The case has become more peculiar. When I use the network trained in first ( normalised one) model to test on several random datasets which were present in the training sets, without normalising ( as outer world data can not be normalised because we deal with single instances) it predicts all samples as 1 or positives, completely biased to positives.

When I use the second model ( the unnormalised one) it predicts more false negatives.

If 'outp' is the output prediction of training set positives and 'outn' is the output prediction of training set negatives, I calculated threshold for my network as :

[ mean(outp) - std_dev(outp) + mean( outn) + std_dev(outn)] / 2 

I got 0.5 for the first model and for second model is 0.489

1) Where is the problem ? Can someone explain me that.

2) When we train, it is recommended to normalise the data but doesn't it mean that the classifier will mis-interprete the input values if provided by a user who is going to use the prediction tool, because a single sample can not be normalised ?

3) Also what is the best method to find threshold in such problems or say classifier problems in general ?

4) What else information I should provide I don't know. Please let me know that too.

I am providing link to the epoch to error plots.

https://www.dropbox.com/s/1gideuvbeje2lip/model2_unnormalised.jpg https://www.dropbox.com/s/nb4zyt3h370pk8m/model1_normalised.jpg

One more thing I would like to mention, to normalize I used the MATLAB's built in function

My positive matrix is 32 features by 648 samples ( i.e 32 x 648 )

and negative matrix is 32 features by 9014 samples ( i.e 32 x 9014 )

both were normalized using initially before any partitioning as train or test or validate sets by normr function of MATLAB.

Это было полезно?

Решение

You can normalize your data, but then when you receive new input from a user, you must normalize their data by using the same 'min' and 'max' you used when you trained your network. As the built-in function don't give you those values, you may want to normalize the matrix by hand and then store 'min' and max' to later normalize user input.

I use this formula but others exist:

MatNorm = (Mat - min(Mat)) / (max(Mat) - min(Mat))

Also, how many positive test data did you use training?

Другие советы

If you are using the standard scaling strategy, apply the same mean and std value that obtained from training to your validation/test data for the normalization. A 10-fold cross validation is also recommended.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top