fill missing values of sequence with neural networks

Question 1

In general, if you are training your ANN using back propagation, you are basically training an input-output map. This means that your training set has to comprise known input-output relations (none of your unknown values included in the training set). The ANN then becomes an approximation of the actual relationship between your inputs and outputs.

You can then call x = net.activate([seq]) where seq is the input sequence associated with the unknown value x.

If x is an unknown input sequence for a known result, then you have to call the inverse of the ANN. I do not think there is a simple way of inverting an ANN in pybrain, but you could just train an ANN with the inverse of your original training data. In other words, use your known results as the training inputs, and their associated sequences as the training results.

The main thing to consider is the appropriateness of the tool and the training data for what you are trying to do. If you just want to predict x as a function of the previous number, then I think you are training correctly. I am guessing x is going to be a function of the previous n numbers though, in which case you want to update your data set as:

n = 10
for ind in range(len(myList)):
    # Don't overrun our bounds
    if ind == len(myList)-1:
        break

    # Check that our sequence is valid
    for i in range(ind-n, ind+1):
        if i >= 0 and myList[i] == "x":
            # we have an invalid sequence
            ind += i   # start next seq after invalid entry
            break

    # Add valid training sequence to data set
    ds.addSample(myList[ind-n:ind],myList[ind+1])

Question 2

What you are describing is a statistical application called Imputation: substituting missing values in your data. The traditional approach does not involve neural networks, but there has certainly been some research in this direction. This is not my area, but I recommend you check the literature.

Question 3

I can give you not a specific answer for that python library, but as I see it, you have a neural net and you give it samples of the form

    [ i0 i1 ... i n ] --> [ o0 o1 ... on ]
    (input vector)        (output vector)

Now you train the the net with sample vectors of length 1. Your net does not know about the sequence of the numbers presented to it, that sequence is only interesting for the outcome of the trained net.

To get a network, that knows about the sequence you could present vectors of consecutive numbers as input and the single number you want, as output. You leave ot the sequences containing the X Example:

    Sequence: 1 2 3 4 X 2 3 4 5 6 7 8
    Training with input length 3, output length 1:
    [1 2 3] -> 4
    [2 3 4] -> 5 (the second one, as the first one is not available)
    [3 4 5] -> 6
    [4 5 6] -> 7
    [5 6 7] -> 8

I think using this, your net can adapt a little to the input sequence. The "how" to extract the right training sequences as input, I have to leave to the domain expert (you).