There's not much wrong in your code -- basically just keep proper track of the read/write pointer locations (remember to update with strides). This requires using 2 nested loops one way or another. (+ fix the divider to 4).
I've found the following approach useful: processing one row at a time has not much speed penalty, but allows easier integration of various kernels.
iptr=input_image; in_stride = in_width;
optr=output_image; out_stride = out_width;
for (j=0;j<out_height;j++) {
process_row(iptr, optr, in_width); // in_stride is needed
// as the function requires access to iptr+in_stride
iptr+=in_stride * 2;
optr+=out_stride;
}