The video encoder is going to accept N frames before producing any output. In some cases N will be 1, and you will see an output frame shortly after providing a single input frame. Other codecs will want to gather up a fair bit of video data before starting to produce output. It appears you've managed to resolve your current situation by doubling-up frames and discarding half the output, but you should be aware that different devices and different codecs will behave differently (assuming portability is a concern).
The CSD data is provided in a buffer with the BUFFER_FLAG_CODEC_CONFIG
flag set. There is no documented behavior in MediaCodec
for if or when such buffers will appear. (In fact, if you're using VP8, it doesn't appear at all.) For AVC, it arrives in the first buffer. If you're not interested in the CSD data, just ignore any packet with that flag set.
Because the buffer info flags apply to the entire buffer of data, the API doesn't provide a way to return a single buffer that has both CSD and encoded-frame data in it.
Note also that the encoder is allowed to reorder output, so you might submit frames 0,1,2 and receive encoded data for 0,2,1. The easiest way to keep track is to supply a presentation time stamp with each frame that uniquely identifies it. Some codecs will use the PTS value to adjust the encoding quality in an attempt to meet the bit rate goal, so you need to use reasonably "real" values, not a trivial integer counter.