Understanding Digital Media
These days it’s easy to take digital media for granted. We buy songs and albums from iTunes, stream movies and TV shows from Netflix and Hulu, and share digital photos by email, text, and on the Web. Using digital media has become second nature for most of us, but have you ever given much thought to how that media became digital in the first place? We clearly live in a digital age, but we still inhabit an analog world. Every sight that we see and every sound that we hear is delivered to us as an analog signal. The inner structures of our eyes and ears convert these signals into electrical impulses that our brains perceive as sight and sound. Signals in the real world are continuous, constantly varying in frequency and intensity, whereas signals in the digital world are discrete, having a state of either 1 or 0. In order to translate an analog signal into a form that we can store and transmit digitally, we use an analog-to-digital conversion process called sampling.
Digital Media Sampling
There are two primary types of sampling used when digitizing media. The first is called temporal sampling, which enables us to capture variations in a signal over time. For instance, when you record a voice memo on your iPhone, the continuous variations in the pitch and volume of your voice are being captured over the duration of your recording. The second type of sampling is called spatial sampling and is used when digitizing photographs or other visual media. Spatial sampling involves capturing the luminance (light) and chrominance (color) in an image at some degree of resolution in order to create the resulting digital image’s pixel data. When digitizing video, both forms of sampling are used because a video signal varies both spatially and temporally.
Fortunately, you don’t need to have a deep understanding of the complex digital signal processing involved in these sampling processes, because it is handled by the hardware components that perform the analog-to-digital conversion. However, failing to have a basic understanding of these processes and the storage formats of the digital media they produce will limit your ability to utilize some of AV Foundation’s more advanced and interesting capabilities. To get a general understanding of the sampling process, let’s take a look at the steps involved in sampling audio.
Understanding Audio Sampling
When you hear the sound of someone’s voice, the honking of a horn, or the strum of a guitar, what you are really hearing are vibrations transmitted through sound waves over some medium. For instance, when you strum a G chord on a guitar, as the guitar pick strikes the strings, it causes each string to vibrate at a certain frequency and amplitude. The speed or frequency at which the string vibrates back and forth determines its pitch, with low notes producing low, slow-modulating frequencies and high notes producing high, fast-modulating frequencies. The amplitude measures the relative magnitude of the frequency, which roughly correlates to the volume you hear. On a stringed instrument such as a guitar, you can actually see both the frequency and amplitude attributes of the signal when you pluck the string. This vibration causes the surrounding air molecules to move, which in turn push against their neighboring molecules, which push against their neighbors, and so on, continuously transmitting the energy from the initial vibration outward in all directions. As these waves reach your ear, they cause your eardrum to vibrate at the same frequency and amplitude. These vibrations are transmitted to the cochlea in your inner ear, where they are converted into electrical impulses sent to your brain, causing you to think, “I’m hearing a G chord!”
When we record a voice, an acoustic instrument such as a piano or a guitar, or capture other environmental sounds, we use a microphone. A microphone is a transducer that translates mechanical energy (a sound wave) into electrical energy (voltage). A variety of different microphone types are in use, but I’ll discuss this in terms of one called a dynamic microphone. Figure 1.2 shows a high-level view of the internals of a dynamic microphone.
Figure 1.2 Internal view of a dynamic microphone
Contained inside the head case, which is the part you speak into, is a thin membrane called a diaphragm. The diaphragm is connected to a coil of wire wrapped around a magnet. When you speak into the microphone, the diaphragm vibrates in relationship to the sound waves it senses. This in turn vibrates the coil of wire, causing a current to be generated relative to the frequency and amplitude of the input signal. Using an oscilloscope, we can see the oscillations of this current, as shown in Figure 1.3.
Figure 1.3 Audio signal voltage
Returning to the topic of sampling, how do we convert this continuous signal into its discrete form? Let’s drill in a bit further into the essential element in an audio signal. Using a tone generator, I created two different tones producing the sine waves shown in Figure 1.4.
Figure 1.4 Sine waves at 1Hz (left) and 5Hz (right)
We’re interested in two aspects of this signal. The first is the amplitude, which indicates the magnitude of the voltage or relative strength of the signal. This can be represented on a variety of scales, but is commonly normalized to a range of –1.0f to 1.0f. The other interesting aspect of this signal is its frequency. The frequency of the signal is measured in hertz (Hz), which indicates how many complete cycles occur in the period of one second. The image on the left in Figure 1.4 shows an audio signal cycling at 1Hz and the one on the right shows a 5Hz signal. Humans have an audible frequency range of 20Hz–20kHz (20,000 Hz), so both signals would be inaudible, but they make for easier illustration.
Digitizing audio involves a method of encoding called linear pulse-code modulation, more commonly referred to as Linear PCM or LPCM. This process samples or measures the amplitude of an audio signal at a fixed, periodic rate called the sampling rate. Figure 1.5 shows taking seven samples of this signal over the period of 1 second and the resulting digital representation of the signal.
Figure 1.5 Low sampling rate
Clearly, at a low sampling rate the digital version of this signal bears little resemblance to the original. Playing this digital audio would result in little more than clicks and pops. The problem with the sampling shown in Figure 1.5 is that it isn’t sampling frequently enough to accurately capture the signal. Let’s try this again in Figure 1.6, but this time we’ll increase the sampling rate.
Figure 1.6 Higher sampling rate
This is certainly an improvement, but still not a very accurate representation of the signal. However, what you can surmise from this example is if you continue to increase the frequency of the sample rate, we should be able to produce a digital representation that fairly accurately mirrors the original source. Given the limitations of hardware, we may not be able to produce an exact replica, but is there a sample rate that can produce a digital representation that is good enough? The answer is yes, and it’s called the Nyquist rate. Harry Nyquist was an engineer working for Bell Labs in the 1930s who discovered that to accurately capture a particular frequency, you need to sample at a rate of at least twice the rate of the highest frequency. For instance, if the highest frequency in the audio material you wanted to capture is 10kHz, you need a sample rate of at least 20kHz to provide an accurate digital representation. CD-quality audio uses a sampling rate of 44.1kHz, which means that it can capture a maximum frequency of 22.05kHz, which is just above 20kHz upper bound of human hearing. A sampling rate of 44.1kHz may not capture the complete frequency range contained in the source material, meaning your dog may be upset by the recording because it doesn’t capture the nuances of the Abbey Road sessions, but for us human beings, it sounds pristine.
In addition to the sampling rate, another important aspect of digital audio sampling is how accurately we can capture each audio sample. The amplitude is measured on a linear scale, hence the term Linear PCM. The number of bits used to store the sample value defines the number of discrete steps available on this linear scale and is referred to as the audio’s bit depth. Assigning too few bits results in considerable rounding or quantizing of each sample, leading to noise and distortion in the digital audio signal. Using a bit depth of 8 would provide 256 discrete levels of quantization. This may be sufficient for some audio material, but it isn’t high enough for most audio content. CD-quality audio has a bit depth of 16, resulting in 65,536 discrete levels, and in professional audio recording environments bit depths of 24 or higher are used.
When we digitize a signal, we are left with its raw, uncompressed digital representation. This is the media’s purest digital form, but it requires significant storage space. For instance, a 44.1kHz, 16-bit LPCM audio file takes about 10MB per stereo minute. To digitize a 12-song album with the average song length of 5 minutes would take approximately 600MB of storage. Even with the vast amounts of storage and bandwidth we have today, that is still pretty large. We can see that uncompressed digital audio requires significant amounts of storage, but what about uncompressed video? Let’s take a look at the elements of a digital video to see if we can determine the amount of storage space it requires.
Video is composed of a sequence of images called frames. Each frame captures a scene for a point in time within the video’s timeline. To create the illusion of motion, we need to see a certain number of frames played in fast succession. The number of frames displayed in one second is called video’s frame rate and is measured in frames per second (FPS). Some of the most common frame rates are 24FPS, 25FPS, and 30FPS.
To understand the storage requirements for uncompressed video content, we first need to determine how big each individual frame would be. A variety of common video sizes exist, but these days they usually have an aspect ratio of 16:9, meaning there are 16 horizontal pixels for every 9 vertical pixels. The two most common sizes of this aspect ratio are 1280 × 720 and 1920 × 1080. What about the pixels themselves? If we were to represent each pixel in the RGB color space using 8 bits, that means we’d have 8 bits for red, 8 bits for green, and 8 bits for blue, or 24 bits. With all the inputs gathered, let’s perform some calculations. Table 1.1 shows the storage requirements for uncompressed video at 30FPS at the two most common resolutions.
Table 1.1 Uncompressed Video Storage Requirements
Color |
Resolution |
Frame Rate |
MB/sec |
GB/hour |
24-bit |
1280 × 720 |
30FPS |
79MB/sec |
278GB/hr |
24-bit |
1920 × 1080 |
30FPS |
178MB/sec |
625GB/hr |
Houston, we have a problem. Clearly, as a storage and transmission format, this would be untenable. A decade from now these sizes may seem trivial, but today this isn’t feasible for most uses. Because this isn’t a reasonable way to store and transfer video in most cases, we need to find way to reduce this size. This brings us to the topic of compression.