First of all, Welcome to Stackoverflow!
I've never personally dealt with using the Kinect for image recognition, but if its possible, you should scale down the image to a fairly reasonable size such as 100x100
so that its is still manageable.
You should also try to convert the image to grayscale
as this will also help with computational efficiency, time of development, and it's much easier to start of with than RGB.
The input layer will not be 1, that's a given. If we're referring to the image that has 100x100 dimensions, the total number of inputs should be 10000
, one for each pixel. Remember, you're trying to breakup the data as fine-grained as you can so the ANN can detect patterns in the data.
The output layer should actually have 2 neurons
, and for a good reason. Remember, each output neuron is measuring the likelihood that the input belongs to that respective class. By having 2 neurons, each one can represent the positive class (Yes, this is a pen) or the negative class (no, this is not a pen). So, by having 2 neurons, you can get the probabilities that the image will belong to that class, and then you can choose the highest value as your answer.
3 Total layers should be sufficient, you'll probably never need more than that. There are some very good articles for you to determine the amount of layers to have, such as this one I hope this helps! Let me know if you have any further questions.