According to http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-vector-microarchitecture
Each VPU has 128 entry 512-bit vector registers divided up among the threads, thus getting 32 entries per thread. These are hard-partitioned. There are eight 16-bit mask registers per thread which are part of the vector register file. The mask registers act as a filter per element for the 16 elements and thus allows one to control which of the 16 32-bit elements are active during a computation. For double precision the mask bits are the bottom 8 bits.
Intel doesn't provide any intrinsics for operating on __mmask8 types; all of the intrinsics are for __mmask16. Therefore I assume that we're expected to just use the __mmask16 intrinsics for manipulating __mask8 types. This seems to work, but I've had very little experience with these so far.