Digital Media Compression
To reduce the size of digital media we need to use compression. Virtually all the media we consume is compressed to various degrees. Whether it’s video on TV, a Blu-ray disc, streamed over the web, or purchased from the iTunes Store, we’re dealing with compressed formats. Compressing digital media can result in greatly reduced file sizes, but often with little or no perceivable degradation in quality.
Chroma Subsampling
Video data is typically encoded using a color model called Y’CbCr,—which is commonly referred to as YUV. The term YUV is technically incorrect, but YUV probably rolls off the tongue better than Y-Prime-C-B-C-R. Most software developers are more familiar with the RGB color model, where every pixel is composed of some value of red, green, and blue. Y’CbCr, or YUV, instead separates a pixel’s luma channel Y (brightness) from its chroma (color) channels UV. Figure 1.7 illustrates the effect of separating an image’s luma and chroma channels.
Figure 1.7 Original image on the left. Luma (Y) in the center. Chroma (UV) on the right.
You can see that all the detail of the image is preserved in the luma channel, leaving us with a grayscale image, whereas in the combined chroma channels almost all the detail is lost. Because our eyes are far more sensitive to brightness than they are to color, clever engineers over the years realized we can reduce the amount of color information stored for each pixel while still preserving the quality of the image. The process used to reduce the color data is called chroma subsampling.
Whenever you see camera specifications or other video hardware or software referring to numbers such as 4:4:4, 4:2:2, or 4:2:0, these values refer to the chroma subsampling it uses. These values express a ratio of luminance to chrominance in the form J:a:b where
- J: is the number of pixels contained within some reference block (usually 4).
- a: is number of chrominance pixels that are stored for every J pixels in the first row.
- b: is the number of additional pixels that are stored for every J pixels in the second row.
To preserve the quality of the image, every pixel needs to have its own luma value, but it does not need to have its own chroma value. Figure 1.8 shows the common subsampling ratios and the effects of each.
Figure 1.8 Common chroma subsampling ratios
In all forms, full luminance is preserved across all pixels, and in 4:4:4 full color information is preserved as well. In 4:2:2, color information is averaged across every two pixels horizontally, resulting in a 2:1 luma-to-chroma ratio. In 4:2:0, color information is averaged both horizontally and vertically, resulting in a 4:1 luma-to-chroma ratio.
Chroma subsampling typically happens at the point of acquisition. Some professional cameras capture at 4:4:4, but more commonly they do so at 4:2:2. Consumer-oriented cameras, such as the one found on the iPhone, capture at 4:2:0. A high-quality image can be captured even at significant levels of subsampling, as is evidenced by the quality of video that can be shot on the iPhone. The loss of color becomes more problematic when performing chroma keying or color correction in the post-production process. As the chroma information is averaged across multiple pixels, noise and other artifacts can enter into the image.
Codec Compression
Most audio and video is compressed with the use of a codec, which is short for encoder/decoder. A codec is used to encode audio or video data using advanced compression algorithms to greatly reduce the size needed to store or deliver digital media. The codec is also used to decode the media from its compressed state into one suitable for playback or editing.
Codecs can be either lossless or lossy. A lossless codec compresses the media in a way that it can be perfectly reconstructed upon decompression, making it ideal for editing and production uses, as well as for archiving purposes. We use this type of compression frequently when using utilities like zip or gzip. A lossy codec, as the name suggests, loses data as part of the compression process. Codecs employing this form of compression use advanced algorithms based on human perception. For instance, although we can theoretically hear frequencies between 20Hz and 20kHz, we are particularly sensitive to frequencies between 1kHz and 5kHz. Our sensitivity to the frequencies begins to taper off as we get above or below this range. Using this knowledge, an audio codec can employ filtering techniques to reduce or eliminate certain frequencies in an audio file. This is just one example of the many approaches used, but the goal of lossy codecs is to use psycho-acoustic or psycho-visual models to reduce redundancies in the media in a way that will result in little or no perceivable degradation in quality.
Let’s look at the codec support provided by AV Foundation.
Video Codecs
AV Foundation supports a fairly limited set of codecs. It supports only those that Apple considers to be the most relevant for today’s media. When it comes to video, that primarily boils down to H.264 and Apple ProRes. Let’s begin by looking at H.264 video.
H.264
When it comes to encoding your video for delivery, I’ll paraphrase Henry Ford by saying AV Foundation supports any video codec you want as long as it’s H.264. Fortunately, the industry has coalesced around this codec as well. It is widely used in consumer video cameras and is the dominant format used for video streaming on the Web. All the video downloaded from the iTunes Store is encoded using this codec as well. The H.264 specification is part of the larger MPEG–4 part 14 specification defined by the Motion Picture Experts Group (MPEG). H.264 builds on the earlier MPEG–1 and MPEG–2 standards, but provides greatly improved image quality at lower bit rates, making it ideal for streaming and for use on mobile devices and video cameras.
H.264, along with other forms on MPEG compression, reduces the size of video content in two ways:
- Spatially: This compresses the individual video frames and is referred to as intraframe compression.
- Temporally: Compresses redundancies across groups of video frames. This is called interframe compression.
Intraframe compression works by eliminating redundancies in color and texture contained within the individual video frames, thereby reducing their size but with minimal loss in picture quality. This form of compression works similarly to that of JPEG compression. It too is a lossy compression algorithm, but can be used to produce very high-quality photographic images at a fraction of the size of the original image. The frames created through this process are referred to as I-frames.
With interframe compression, frames are grouped together into a Group of Pictures (GOP). Within this GOP certain temporal redundancies exist that can be eliminated. If you think about a typical scene in video, there are certain elements in motion, such as a car driving by or a person walking down the street, but the background environment is often fixed. The fixed background represents a temporal redundancy that could be eliminated through compression.
There are three types of frames that are stored within a GOP, as shown in Figure 1.9.
Figure 1.9 Group of Pictures
- I-frames: These are the standalone or key frames and contain all the data needed to create the complete image. Every GOP has exactly one I-frame. Because it is a standalone frame, it is the largest in size but is fastest to decompress.
- P-frames: P-frames, or predicted frames, are encoded from a “predicted” picture based on the closest I-frame or P-frame. P-frames can reference the data in the closest preceding P-frame or the group’s I-frame. You’ll often see these referred to as reference frames, as their neighboring P-frames and B-frames can refer to them.
- B-frames: B-frames, or bidirectional frames, are encoded based on frame information that comes before and after them. They require little space, but take longer to decompress because they are reliant on their surrounding frames.
H.264 additionally supports encoding profiles, which determine the algorithms employed during the encoding process. There are three top-level profiles defined:
- Baseline: This profile is commonly used when encoding media for mobile devices. It provides the least efficient compression, thereby resulting in larger file sizes, but is also the least computationally intensive because it doesn’t support B-frames. If you’re targeting older iOS devices, such as the iPhone 3GS, you should use the baseline profile.
- Main: This profile is more computationally intensive than baseline, because a greater number of its available algorithms are used, but it results in higher compression ratios.
- High: The high profile will result in the highest quality compression being used, but is the most intensive of the three because the full arsenal of encoding techniques and algorithms are used.
Apple ProRes
AV Foundation supports two flavors of the Apple ProRes codec. Apple ProRes is considered an intermediate or mezzanine codec, because it’s intended for professional editing and production workflows. Apple ProRes codecs are frame-independent, meaning only I-frames are used, making it more suitable for editing. They additionally use variable bit rate encoding that varies the number of bits used to encode each frame based on the complexity of the scene.
ProRes is a lossy codec, but of the highest quality. Apple ProRes 422 uses 4:2:2 chroma subsampling and a 10-bit sample depth. Apple ProRes 4444 uses 4:4:4 chroma subsampling, with the final 4 indicating it supports a lossless alpha channel and up to a 12-bit sample depth.
The ProRes codecs are available only on OS X. If you’re developing only for iOS, H.264 is the only game in town. Apple does, however, provide one variation to typical H.264 encoding that can be used when capturing for editing purposes—called iFrame. This is an I-frame-only variant producing H.264 video more suitable for editing environments. This format is supported within AV Foundation and is additionally supported by a variety of camera manufacturers, such as Canon, Panasonic, and Nikon.
Audio Codecs
AV Foundation supports all the audio codecs supported by the Core Audio framework, meaning it has broad support for a variety for formats. However, when you’re not using linear PCM audio, the one you will most frequently use is AAC.
AAC
Advanced Audio Coding (AAC) is the audio counterpart to H.264 and is the dominant format used for audio streaming and downloads. It greatly improves upon MP3, providing higher sound quality at lower bit rates, which makes it ideal for distribution on the Web. Additionally, AAC doesn’t have the licensing and patent restrictions that have long plagued MP3.