With bi-directional frames in temporal compression, decoding order (order in which the data needs to be transmitted for sequential decoding) is different from presentation order. This explains the effect you are referring to.
On the picture below, you need data for frame P2 to decode frame B1, so when it comes to transmission, P2 goes ahead.
See more on this: B-Frames in DirectShow
(source: monogram.sk)Since MPEG-2 had appeared a new frame type was introduced - the bi-directionally predicted frame - B frame. As the name suggests the frame is derived from at least two other frames - one from the past and one from the future (Figure 2).
Since the B1 frame is derived from I0 and P2 frames both of them must be available to the decoder prior to the start of the decoding of B1. This means the transmission/decoding order is not the same as the presentation order. That’s why there are two types of time stamps for video frames - PTS (presentation time stamp) and DTS (decoding time stamp).