The output stream format of Kinect

Question 1

The video frames stream is given as 4 bytes per pixel in BGRA format (blue-green-red-alpha) and the pixels are scanned line by line horizontally in the image domain. A full uncompressed frame of size 640x480 has 640x480x4 bytes.

The depth frames stream is given as 2 bytes per depth pixel in unsigned short format. The value of the unsigned shorts represent the distance from the camera plane in millimeters (if you ignore the 4 least significant bits). The 4 least significant bits contain the identity of the player at that particular pixel. A full uncompressed frame of size 320x240 has 320x240x2 bytes.

You can compress the images using standard image compression algorithms in Java using a Java library for the Kinect SDK.

Question 2

The point cloud is an uncompressed 12 bit image. It's a format unique to the Kinect, as it has additional user tracking data in the 4 least significant bits.

However, there are a number of different image types, which will depend on your configuration, whether you're using near mode, what your video res is, etc:

http://msdn.microsoft.com/en-us/library/nuiimagecamera.nui_image_type.aspx