문제

Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:

A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1

The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:

x = 0
y = 0.5
z = 1

Do I just find the nearest value to the output (0.435) which is y (0.5)?

도움이 되었습니까?

해결책

You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:

Nominal vs ordinal variables

Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .

Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.

One-of-N encoding A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.

A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.

In reply to your comment on @Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.

다른 팁

It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.

If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.

EDIT: If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.

Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.

If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.

In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top