Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
- Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
- Play around with the number of iterations in a single thread (as mentioned in point 1. above).
- Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
- ... plus many more choices as in each complex optimization ...