If you just want a solution that works, ZoneMinder or Motion are two pieces of software that run under linux using the video4linux interface.
If you need to roll your own for some reason there are a lot of techniques or strategies you can use. You are largely on the right track with what you've outlined. You're missing a few important details though.
Since the camera is stationary, keep a record of the last N frames as your "background" image. Average them all.
http://opencv.willowgarage.com/documentation/cpp/imgproc_motion_analysis_and_object_tracking.html
Subtract the background from the current image. What you're left with we'll call the foreground.
http://opencv.willowgarage.com/documentation/cpp/core_operations_on_arrays.html#cv-absdiff
Optionally perform dilation or erosion (or both) to remove noise or join nearly connection regions.
http://opencv.willowgarage.com/documentation/image_filtering.html#dilate
Threshold the foreground image to determine what's important and what's not.
http://docs.opencv.org/doc/tutorials/imgproc/threshold/threshold.html
Optionally use the findContours function to get a description of what's "moved"
http://docs.opencv.org/doc/tutorials/imgproc/shapedescriptors/find_contours/find_contours.html
Once you have the contours you can also find the bounding rectangles if that's more what you're going for.
This will not be perfect and when debugging or optimizing you have to show output after every step to figure out what's working right and what isn't. Spend some time building the infrastructure to make that easier. Once you have source data and most of a working pipeline tuning to get the results you want is quite doable.