here is a few remarks:
general design : you're mentioning a "distributed" processing, but do not specify if you would have multiple instances of Task A and Task B processor. If you just have one Task A processor and one Task B processor, the data throughput will be determined by the slower one. I understand that, one average, B is slower, but it can be faster, so introducing a buffer between A and B looks like a good idea. So I think the design is fine if you want/need to stick to a single A and a single B instance, but that you might consider having multiple instances if B is indeed slower (you would then have more B instances than A)
netty or other frameworks : yes you could do that in Netty. However, I think you would have to write the "TCP channel is free" signal. I don't have much experience, but I would think that frameworks like http://akka.io/, implementing message passing and actor model would be interesting to look at.