How a standard video file is structured under the hood

https://softwareengineering.stackexchange.com/questions/377104

07-02-2021
|

Question

I am new to looking into video file formats, and am wondering what it would take to parse one. To do that, I first would need to understand what the format looks like, so that's what this question is about.

Wondering if one could briefly outline how a video file format is structured, or what it contains. Any video file format is helpful, but WebM would be preferred since it is open / royalty free, and modern (which I presume to mean it would be simpler to understand). But whatever file format is simplest to demonstrate the point is best.

The main thing I am looking for is not a detailed specification at this point, just basically what goes into a video file. I am used to "static" content like programming language files or image files, but videos are large, streamed files where it needs to handle network issues and syncing and all that, as part of "moving through the file contents". My knowledge consists of "videos are probably sent as chunks of some sort of binary data", but that's it. I would like to know roughly how it gets organized in these different files (again, any one is fine, or a generic example is okay too). What the general features of the file format are.

I would like to be able to take a look at a WebM parser, but not sure what the scope of the features are for video files. Basically trying to understand how a video file "works".

Solution

Parsing a complete digital film is an immensely complex task. Because you mostly ask about WebM – a container format – I’ll concentrate on that.

You always start with individual streams containing the payload data: video (e.g. H.264, VP9), audio (e.g. AAC, Opus) and subtitles (e.g. SubRip, Blu-ray PGS). Tied to those streams is some metadata needed for correct playback. For example the streams need to be synchronized properly.

As a simple example imagine a WebM file containing a VP9 video stream and an Opus audio stream.

The WebM container acts as a wrapper for the VP9 and Opus streams that makes it possible to put them into a single file and still access them conveniently. Also it contains additional data like the types of streams it contains or checksums for error recovery.

Naively you could store the streams one after the other in a single chunk each. Obviously that’s a horrible solution for streaming because you’d have to buffer the complete file before playback can start. That’s one reason why streams are interleaved. The file stores a small chunk of video followed by a small chunk of audio (maybe half a second each) and repeats that pattern throughout the file.

What do you need to parse such a file?

A WebM parser to process the container and extract the payload streams.
A VP9 parser (probably as a part of a full VP9 decoder) to process the video stream.
An Opus parser (again probably as a part of a full Opus decoder) to process the audio stream.

WebM is a subset of Matroska. You can get a full specification of the format on the Matroska website. The parser you link to seems extremely simplistic on first glance, but it might be a good enough starting point. For a complete implementation you should have a closer look at the reference parser: libmatroska. It’s used for example in the de-facto standard Matroska muxing application MKVMerge.

Btw: “Muxing” is short for “multiplexing”. The long form is rarely used, though.

OTHER TIPS

Video container formats are typically made up of a series of content blocks. A block typically consists of a few marker bytes (important for finding the next block if you get incomplete data while streaming), a type code (metadata, audio data, video data, ...), a length, followed by some amount of data.

The file would typically start with a metadata block. It contains information about the file: is it a single picture, an audio-only stream, mixed video and audio? Are there multiple audio channels (different languages)? What's the resolution of the video? What audio and video codec are used?

After that comes a series of content blocks. Their interpretation is up to the codecs. Audio and video blocks are interleaved so that they can be played back simultaneously. A single video block might contain all the data for one video frame, followed by audio blocks for the audio data for the same time range. But this splitting is up to the encoder, and might be done a different way, too. The audio for several frames might be put into a single audio block perhaps, to save encoding overhead.

The codecs themselves are a different, huge topic.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange