In general, your approach is correct. Assuming you do not have a data dependency or transfer of control instruction immediately following the target, you can generally follow the following convention.
You did this, probably without realizing:
Go to the target of the branch, copy the instruction into the delay slot and annul the branch. As you stated, you annul the branch to prevent the instruction from executing if the branch isn't taken. Then move the instruction at the target above the label.
In your code, following the above steps, you would do the following:
I removed your comments so you can explicitly see what I changed.
mov %g0, %l2
cmp %l0, %l1
bge,a END_LOOP
mov %l2, %o0
mov %l0, %o0 ! 3. Move the instruction formerly after the loop
! above the label
LOOP:
[ mov %l0, %o0 ] ! Instruction was here
call .mul
mov %l1, %o1
add %l2, %o0, %l2
inc %l0
cmp %l0, %l1
bl,a LOOP ! 1. Go to the target and copy that instruction into
! they delay slot.
! 2. Annul the branch
mov %l0, %o0 ! Instruction formerly after LOOP:
mov %l2, %o0
END_LOOP:
call .mul
mov %l1, %o1
mov %o0, %l3
If you examine the code carefully, you will see that the logic still holds and there is a systematic way to unwind the optimizations.
Regardless of whether or not you enter the loop, your code will still correctly execute the code following the loop.
This is the general approach to optimizing the code and is similar to what the compiler will do. The key is always to make sure a data dependency does not exist.