The operating system's job is allow a lot of components, both hardware and software, to play nice with each other. In general, userland programs can't directly manipulate peripherals nor interfere with each other. I'm not familiar with the specific setup that you're citing, but it doesn't sound unusual.
The USB camera notifies the operating system that it has a new frame. When the kernel (driver) notices this it, will copy the frame with I/O commands into RAM. Since this RAM was allocated by the driver, the userland programs won't be able to see or read it due to virtual memory. To summarise it quickly, the address &0x1000 in the kernel and the address &0x1000 in a program are actually physically distinct locations in RAM. The kernel will then copy the frame into the memory of any process that is expecting input from the camera and then notify it (in this case catusb).
Likewise, since xform, detect and hdinput exist as separate processes, they must use inter-process communication. Since the operating system must ensure the isolation of the programs, each process will leverage the kernel to achieve this.
There's nothing unusual here. I imagine they are just spelling it out because gesture recognition is time-critical and doing it this way has some overhead.