As to your original question, you are right that there is no non-stalling forwarding option between the first lw and add - but you get what you need for add in MEM and can then forward; between that add and the following sw you get the result in EX and can then forward it with 1 stall. Etc.
As to your additional question in the comments, the loop execution stalls mostly in the lw stage as the loaded word is only available after WB; or for forwarding after MEM. The loop loads/stores 4 values; so instead of looping, rename registers and start your code with several consecutive lw into, say, $t0 - $tt2. Once a result has been written back, or can be forwarded, add it, and sw it as soon as available. So yes, your code will look much longer, but execute faster.
By the way, you seem to be using Patterson/Hennessy. There are very good diagrams in that book Illustrating this. Maybe have a look.