Tested
I've done some testing of my hypothesis. The two approaches appears to perform almost the same, but there is an evident improvement by using the first alternative.
Memory complexity of the momentum data structure:
- Approach 1:
O( instances * weights )
- Approach 2:
O( weights )
Result:
Each round uses a predefined weight set. Both versions were trained on the same weight set.
$ pypy backprop.py # First approach
Round: 1/10 Required epochs: 40995
Round: 2/10 Required epochs: 40997
Round: 3/10 Required epochs: 40996
Round: 4/10 Required epochs: 40997
Round: 5/10 Required epochs: 40997
Round: 6/10 Required epochs: 40997
Round: 7/10 Required epochs: 40999
Round: 8/10 Required epochs: 40996
Round: 9/10 Required epochs: 40996
Round: 10/10 Required epochs: 40997
$ pypy backprop.py # Second approach
Round: 1/10 Required epochs: 41070
Round: 2/10 Required epochs: 41072
Round: 3/10 Required epochs: 41069
Round: 4/10 Required epochs: 41069
Round: 5/10 Required epochs: 41070
Round: 6/10 Required epochs: 41071
Round: 7/10 Required epochs: 41072
Round: 8/10 Required epochs: 41069
Round: 9/10 Required epochs: 41070
Round: 10/10 Required epochs: 41071
As we may read from the tests, the second approach (which has lower memory complexity) requires a few more epochs of training before reaching the required precision.
Conclusion
The increased memory complexity might not be a worthy sacrifice in comparison to the minor training improvement.