1) I would say it is not possible. From computer vision perspective (without 3D depth information) you might want to have a look at the "Gestalt psychology" problems (http://en.wikipedia.org/wiki/Gestalt_psychology), where you can see to find a lot of ambiguity already in very simple images. Analyzing a real world scenario without (and often even with) any background knowledge is much worse. Sometimes even a human isn't able to distinguish between seeing one or two objects and/or it depends on the interpretation and you have to consider that a human has many years of experience and a lot of background knowledge.
3) what you want to do is image segmentation. Without any background knowledge I would suggest some edge detection (e.g. Canny edge detection) combined on both, depth image and color-/grayimage. You will have to group detected edges (have a look at Gestalt psychology again) and/or extract contours, but at the end you'll still be far away from detecting all objects and subobjects.