OP: can anyone poke holes in my theory?
In reading the first half, I thought out a solution using a bit array to record usage and came up with effectively the same thing you outline in the 2nd half.
So here is the hole: avoid hard coding a 16-bite block. Allow your bit map to work with, say 20 or 24 byte blocks at the beginning of you development. During this time, you may want to put tag information and sentinels on the edges of the block. Thus you can more readily track down double free(), usage outside allocation, etc. Of course, the price is a smaller effective pool.
After your debug stage, go with your 16-byte solution with confidence.
Be sure to keep track of 0 <= total allocation <= (2048 - overhead) and allow a check of it versus your bitmap.
For debug, consider filling a freed block with "0xDEAD", etc. to help force inadvertent free usage errors.