why DCT transform is preferred over other transforms in video/image compression

Question 1

Because cos(0) is 1, the first (0th) coefficient of DCT-II is the mean of the values being transformed. This makes the first coefficient of each 8x8 block represent the average tone of its constituent pixels, which is obviously a good start. Subsequent coefficients add increasing levels of detail, starting with sweeping gradients and continuing into increasingly fiddly patterns, and it just so happens that the first few coefficients capture most of the signal in photographic images.

Sin(0) is 0, so the DSTs start with an offset of 0.5 or 1, and the first coefficient is a gentle mound rather than a flat plain. That is unlikely to suit ordinary images, and the result is that DSTs require more coefficients than DCTs to encode most blocks.

The DCT just happens to suit. That is really all there is to it.

Question 2

When performing image compression, our best bet is to perform the KLT or the Karhunen–Loève transform as it results in the least possible mean square error between the original and the compressed image. However, KLT is dependent on the input image, which makes the compression process impractical.

DCT is the closest approximation to the KL Transform. Mostly we are interested in low frequency signals so only even component is necessary hence its computationally feasible to compute only DCT.

Also, the use of cosines rather than sine functions is critical for compression as fewer cosine functions are needed to approximate a typical signal (See Douglas Bagnall's answer for further explanation).

Another advantage of using cosines is the lack of discontinuities. In DFT, since the signal is represented periodically, when truncating representation coefficients, the signal will tend to "lose its form". In DCT, however, due to the continuous periodic structure, the signal can withstand relatively more coefficient truncation but still keep the desired shape.

Question 3

The DCT of a image macroblock where the top and bottom and/or the left and right edges don't match will have less energy in the higher frequency coefficients than a DFT. Thus allowing greater opportunities for these high coefficients to be removed, more coarsely quantized or compressed, without creating more visible macroblock boundary artifacts.

Question 4

DCT is preferred over DFT (Discrete Fourier Transformation) and KLT (Karhunen-Loeve Transformation)

1. Fast algorithm
2. Good energy compaction
3. Only real coefficients