Home » Recording Resources » Back To Basics Recording Techniques » Understanding Digital Audio – Part 1

By John Shirley

These days you’re surrounded by devices that do digital audio recording—it’s the norm, and not just in your studio where you may have one or several of these: A computer—desktop or laptop—that functions as a DAW (digital audio workstation), recording to a hard drive; an older digi tal tape recorder like a 2-track DAT (digital audio tape) or an MDM (modular digital multitracker); maybe you have a stand-alone digital multitracker, some of which come with their own CD-burners. Perhaps you’ve transferred your old vinyl LPs directly into a stand-alone CD recorder. Then there’s your phone answering machine, pocket memo recorder, MP3 player, iPod, MiniDisc player/recorder—even your digital camera records digital audio. Did I say surrounded?

Back To Basics

With so many competing digital recording platforms, sample rates, bit depths, data compression schemes, and converter technologies, figuring out exactly what’s going on with digital audio systems is getting to be a harder and harder task. While most user interfaces have become friendlier, the complexity, power, and feature sets have also increased (as has the accompanying jargon).

Digital Audio and You

The term “digital audio” refers to the fact that acoustic energy is represented by numerical description. How does this happen—sound becoming numbers?

First, acoustic energy (sound vibrating in the air, sound you can hear) gets transduced into electrical energy (like the stuff that runs through the cables in your studio). That electrical representation of sound is called “analog” (electrical events that are analogous to the vibrations of the acoustic energy).

To represent this “analog” energy with numbers (digits), an A/D (analog-to-digital) converter is employed. Its job is to take measurements of the fluctuations of the audio signal and represent those measurements in the numbering system that computers employ. Since the computer chip and its related technologies are the primary means for the storage and manipulation of these descriptions, the numerical system used is binary, or base-two.

Bit depth

Each single placeholder in binary, called a bit (which is a contraction of the term bi-nary digi-t), can only represent one of two values: off or on, meaning—numerically—a 0 or a 1. Therefore, two places can be used to represent four values: 00 (zero), 01 (one), 10 (two), or 11 (three).

Each time another bit is added, the po ssible number of values is doubled. Two bits gave us four values: zero, one, two, and three.

Three bits give us eight values: 000 (zero), 001 (one), 010 (two), 011 (three), 100 (four), 101 (five), 110 (six), 111 (seven).

Our normal counting system is decimal or base-ten: we can count ten values (from 0 to 9) with only one digit, then we have to add a second digit to represent 10. As we continue counting, we eventually reach 99 and must add a third digit to represent 100…. In binary you don’t have ten values per digit, you only have two – so you add on digits a lot quicker as you represent larger and larger numbers.

To determine how many values are possible in any binary system, the formula 2ⁿ can be used, where n is the number or bits (called bit-depth or word-length).

Eight-bit systems (2⁸) can represent up to 256 values (from 000 to 255)
Sixteen-bit systems (2¹⁶) can represent up to 65,536 values (from 00,000 to 65,535)
Twenty-four bit systems (2²⁴) can represent up to 16,777,216 values.

Amplitude

Each digital word is used to represent the instantaneous amplitude of the audio signal. For a moment let’s consider the loudspeaker as a good visual model. When no audio signal is going to the speaker, it does not move and the cone stays at its resting position, neither out nor in. When audio is sent to it, the cone moves between outward (positively) and inward (negatively) from this resting position. How far it moves is a measure of amplitude of the arriving signal.

If a pen was connected across the cone and a roll of paper continually scrolled by, a drawing of amplitude over time would be generated. (This is similar to the way both a seismometer and lie-detector work). The resulting picture of the waveform with its ups and downs would look a lot like what is shown in the edit window of most DAWs.

This graph could be manually digitized by using a ruler to measure the amount of movement at certain points in time and recording these numbers in a table. The new data table can later be used to recreate the graph. How precisely the new graph recreates the original would depend on how accurate the measurements were and how many were taken.

Since the table cannot record an infinite amount of data, and the measurements can never be absolutely exact, the recreated graph cannot ever be “perfect.” That means that digitized sound cannot “perfectly” represent the original acoustic energy.

Quantization

Similarly, since digital audio uses discreet numerical values to represent momentary amplitude, no matter how many bits are used there’s never an infinite supply. Amplitude information must be rounded off to the nearest possible value. This process is known as quantization. (Note to MIDI freaks: MIDI quantization is the rounding off of rhythmic placement; digital audio quantization is the rounding off of amplitude.)

For example, if there are only 256 values available to describe amplitude (which is the case in an 8-bit system), there cannot be a value of, say, 234.569. A signal measurement of that sort would have to be rounded off and represented as a 235. The difference between the original, or “real”, amplitude and the quantized representation is called quantization error.

This sort of rounding error occurs across multiple values and creates noise in the signal, referred to as either quantization noise or digital noise. The amount is directly related to the size of the average quantization error and, therefore, the number of bits. This relationship can be used to calculate the theoretical ratio of signal to noise for a digital audio system. This ratio is an expression (in decibels, abbreviated as dB) of the difference in level between the strongest signal and the inherent noise floor. The greater the number: the better.

The following formula can be used to determine the S/N ratios of various bit depths:

S/N (in dB) = (6 x n) + 2, where n = number of bits
For 8 bits, S/N = 50 dB (6 x 8 = 48, plus 2 = 50)
For 12 bits, S/N = 74 dB (6 x 12 = 72, plus 2 = 74)
For 16 bits, S/N = 98 dB (6 x 16 = 96, plus 2 = 98)
For 24 bits, S/N = 146 dB (6 x 24 = 144, plus 2 = 146)

The digital audio on regular CDs (Compact Discs) is in 16-bit format, as per the “Red Book” technical industry guidelines that govern their data format. Most newer digital audio production devices, however, allow for 24 bits. Why record and work in 24 bits if the final mix will be to 16 bits for CD?

As soon as anything is changed in a digital audio file (or live stream), the signal is re-quantized. Therefore, effects, panning, eq, compression, cross fades and mixing all add more quantization noise. By recording and working in 24 bits, a higher S/N ratio is maintained throughout the process. Dither (see below) may be necessary in the final mixdown step to 16 bits.

Low-level distortion

Another issue related to quantization and amplitude values occurs with low-level signals. When a signal is strong it moves through most of the available amplitude range and can rely on the majority of available values to represent it. With a quieter signal, the higher values are never reached. The waveform can only use a fraction of the values. Therefore, there is not equal resolution at all levels.

As the signal level gets lower, the waveform is subject to greater misrepresentation, becoming more and more similar to a pulse wave. This is heard as strange zippering noise in softly fading reverb tails, to mention just one instance.

Several methods are now used to combat this obvious perversion of the waveform at lower levels. One of the more popular, known as dither, is to actually add more low-level noise to the signal. T his effectively masks (covers over/obscures) the problem, while adding a more natural sounding noise floor for fade-outs, reverb tails, or quiet musical passages.

Another way manufacturers have been dealing with these issues, and making the most of their systems, is by using various methods of companding (sometimes also called bit-mapping, or shaping). By using various schemes to change the way the amplitude values are spread across the full range, more can be made available to the lower level signals. This effectively lowers the perceived noise floor and the level at which the waveform becomes obviously compromised. These methods work even better in conjunction with dither.

High-level distortion (clipping)

When the level to be digitized exceeds the available values, it can only be rounded down to the highest available value in the system—all bits “set” or “on” at a value of 1. This means that anything above the maximum number will not be represented correctly. In other words, in 8 bits, a 287 would be stored as a 255, as would 257.4, and 301 (and any other number above 255). When the resulting waveform is displayed, it looks as if someone took out the scissors and cut off the tops of the waveforms. This type of distortion can be quite harsh, there is no area of “forgiveness” as there is in analog audio where engineers like to “push the signal into the red” which can sound pleasant or even exciting. Not in digital—a digital “over” sounds very obvious and is often unpleasant.

As stated in last month’s column, it is important to set recording and mixdown levels correctly to avoid the gain-related problems discussed here. Not to hot… or too cold… but just right!

Time (to learn about sample rate)

OK, for a while now the focus of this discussion has been on amplitude. While issues arising from bit-depth and quantization concerns are important, they are only half the story. The next part of the equation involves the time domain.

Sample Rate

Each time a value is recorded for instantaneous amplitude, it’s called a sample. Many samples must be taken of a waveform in order to capture enough data to accurately recreate the sound. The number of times per second the recording system digitizes incoming sound is known as the sample rate. This will determine the accuracy of how a system will capture or recreate time-based events. If the sample rate is not high enough for the event it’s trying to capture, that event will be misrepresented.

This happens quite often in television and film where the spokes of a bicycle, car, or stagecoach appear to be moving backwards, even though the object is moving forward. This is because of the relationship between the film or video’s frame rate (which is the movie’s equivalent of the sample rate in digital audio) and the rotational speed of the wheel. If a single spoke rotates all the way around at a lesser speed than the camera’s frame rate, then each successive picture will show the spoke just behind where it was before. The result is that it appears to travel backwards. This misrepresentation of time elements is called aliasing.

The Nyquist Theorem

Recently, I saw a commercial where a man uses his digital camera and a tripod to make a short film of himself levitating around a lobby. He sets up the timer function and jumps into the air over and over in various locations. By timing it right, the camera only takes photos of him while he’s in the air. When the pictures are shown in quick succession, it appears that he’s flying.

While this is a cool effect, it does not represent what really happened. To see that he went both up and down, the camera would need to capture pictures of him in both these states. A minimum of two pictures are necessary during each complete cycle of up/down to properly represent his basic movement.

This same standard in digital audio is known as the Nyquist Theorem, which states that there must be at least two samples taken of each cycle of a sound in order to avoid aliasing. In audio, aliasing is the misrepresentation of frequency information. Frequency is determined by how quickly sound energy moves through a single iteration of it’s positive and negative amplitude portions, called a cycle, and is labeled in cycles per second or Hertz (abbreviated Hz)). Though sometimes inaccurate, frequency is often described as how “high” or “low” a sound is. For musicians: middle c represents a lower fundamental frequency than the c two octaves above it. (They are 261.63 Hz and 1046.5 Hz respectively.)

Nyquist in action

The Nyquist Theorem, stated another way, says the highest frequency a system can handle without aliasing is equal to half its sample rate. Greater rates, therefore, translate into the ability to capture higher frequencies. The sample rate for the CD standard is set at 44,100 samples per second (44.1 kHz, or kiloHertz) to adequately represent sound to the upper boundary of human hearing (ca. 20 kHz).

Sound itself does not follow these pesky rules and many common musical instruments generate frequencies above the Nyquist limit for CDs. To avoid aliasing, analog-to-digital converters use a low-pass filter to remove any frequencies that would be too high to be properly realized. These filter designs can be both complicated and expensive, as the slope of the filter is quite severe. Such drastic filtering often also imparts unwanted phase irregularities. Therefore, the design and implementation of these filters is one of the reasons for the dramatic differences in both the cost and sound of various converter designs.

Greater sample rates are now widely available on recording gear, including 44.1, 48, 88.2, 96, 176.4, 192. These rates allow for both greater frequency range and better converter filter designs. When choosing a rate, it’s useful to consider the available disk space because higher rates and bit depths take up more. Another consideration is whether the final mixdown rate will be at 44.1 kHz for CD. If this is the case, some contend that it’s better to use either a 44.1, 88.2, or 176.4 kHz rate. Since the latter two rates are whole number multiples of the CD standard, down-sampling is a simple matter mathematically and the results are very accurate.

When DAT recorders came out, their sample rate was fixed at 48 kHz (at least as the default—on most machines it could be set to 44.1). While many are now quite good, some algorithms that were used to get from 48 (or 96) to 44.1 kHz have caused unwanted artifacts or loss of fidelity. Since the difference in quality or frequency range is so minimal between 88.2 and 96 kHz, I usually play it safe and use the 88.2 rate. This may now be no more than an old superstition. If you have the technology, time, and a fantastic listening environment you can judge for yourself.

So why record at higher sampling rates when you will be down-sampling to 44.1 kHz for CD anyway?

This is a concern similar to the greater bit depth discussion above. Recording to disk at a high sample rate can avoid some of the phasing and other negative side effects of the steep converter filtering. It also maintains a higher quality throughout the working process.

A final note

Now that so many companies offer recording platforms capable of both high sample rates and bit depths, it’s a good idea to clarify the meaning of it all.

First, many systems reduce track counts to offer these greater rates/depths. You may decide that you’d rather have the full 24 tracks at 24/44.1 than only 16 tracks at 24/88.2. This is a decision that you can make for yourself, judging your ears after some trial recordings at both settings, and based on your needs for a given project. Just be aware that companies advertise the greatest track counts, highest rates, and greatest bit depths without making it clear that the device may not be able to offer all of these at one time.

Second, since 192 kHz is more than four times as fast as 44.1 kHz, it takes up more than four times the space on a hard drive! Similarly, since 24 bits are 50% more than 16, recordings of this type will take up 1.5 times as much space. You’ll use up 5 Megabytes (MB) per minute (per mono track) when recording 16-bit/44.1 kHz audio, but nearly 33 MB/min. for 24/192! So when it comes to available disk space, more is better….

Finally, keep in mind your final delivery format. Be sure you can get there easily from the format you record in. Too many (or tricky) conversions is never a good idea, as it can introduce a variety of errors that adversely affect audio.

Next time we’ll discuss topics such as: interfaces, buffers, hard drives, artifacts, latency, clocking, jitter, transfer protocols and file formats.

John Shirley is a recording engineer, composer, programmer and producer. He’s also on faculty in the Sound Recording Technology Program at the University of Massachusetts Lowell. Check out his wacky electronic music CD, Sonic Ninjutsu,OR on iTunes.

Supplemental Media Examples

AUDIO

Sample Rate and Aliasing

The first six soundfiles are from a recording of a drum kit. Each successive version is resampled at a lower sample rate. Without the aid of an anti-aliasing filter, the aliasing frequencies are obvious.

First the soundfile at 44.1kHz: TCRM3_1a2a5a.wav

Then reduced by a quarter, to 11024Hz: TCRM3_1b.wav

Then at 5kHz: TCRM3_1c.wav

At 2.5kHz: TCRM3_1d2b.wav

At 1.25kHz: TCRM3_1e.wav

And finally at 422Hz: TCRM3_1f.wav

ANTI-Aliasing Filters

To avoid aliasing, the audio must be filtered so that nothing above the nyquist limit (half the sample rate) will be resampled. This is done with a lowpass filter.

First, let’s listen to the drum sample again at the original rate: TCRM3_1a2a5a.wav

Now the resampling at 2.5kHz: TCRM3_1d2b.wav

Now the original audio file is filtered through a steep lowpass filter (set around 1kHz) and then resampled at 2.5kHz again. Aliasing is greatly reduced, but the recording is obviously lacking in high frequency content (it sounds dull): TCRM3_2c.wav

Now let’s listen to another recording for further comparison (this time of an acoustic guitar).

First the original recording (at 44.1kHz): TCRM3_3a.wav

Now resampled at 7402Hz: TCRM3_3b.wav

Again at 7402Hz, but with the inclusion of the anti-aliasing filter: TCRM3_3c.wav

Bit Depth (quantization)

Now that we’ve heard how time/sampling factors affect digital audio, let’s look at the resolution of amplitude, as determined by bit depth.

First, let’s listen to a sax solo at 16-bits/44.1kHz: TCRM3_4a.wav

Now, the same recording requantized at 12-bits. The distortion is subtle: TCRM3_4b.wav

When quantized at 8-bits the distortion is much more obvious: TCRM3_4c.wav

At 6-bits the distortion is joined by some gating (truncation) of the quieter moments: TCRM3_4d.wav

At 4-bits the truncation is quite obvious: TCRM3_4e.wav

Finally, at 3-bits, the sax often has trouble getting above the least significant bit and amplitude is rounded off to nothing: TCRM3_4f.wav

Don’t be mislead by these bit reductions: the effects of quantization are the same at any bit-depth, just at different levels. To show this, here’s the same sample at 16-bits, but re-recorded at 76dB lower. The recording is then boosted by 66dB so that the results are obvious. (This can be done easily using the auto-scaling function in the TCRM_Bit Depth program): TCRM3_4g.wav

Note how this compares to our 4-bit example: TCRM3_4e.wav

So how would a 66dB rerecording and boost sound on the drums you ask…? TCRM3_4h.wav

Dither

The distortion and audible artifacts of quantization can be reduced by the addition of dither (noise) to the signal. The reasons for this are physical, mathematical and perceptual. Again, with the drums: TCRM3_1a2a5a.wav

Requantized at 6-bits: TCRM3_5b.wav

Now at 6-bits with dither noise added. At first it’s difficult to do, but try to “tune out” the noise mentally and focus on the drums. The quantization errors are masked by the dither: TCRM3_5c.wav

Let’s see how it would sound with less dither: TCRM3_5d.wav

*Special thanks to University of Massachusetts Lowell graduate students Gavin Paddock and Tim Brault for the use (and destruction of) their original recordings.

Pictures – Additional Info

Sample Rate

The pictures below demonstrate the effects of a given sample rate (in this case 44.1kHz) on various frequencies. Sine tones at a steady 0dBFS are used to make this more obvious.

First, a 2Hz tone. Note how smooth and rounded the curves of the sine are.

Now again at 4kHz. The “smooth curve” starts to get a bit bumpy.

At 8kHz bumpiness turns into a much more jagged waveform.

Finally, at 18kHz, not only has the waveform become more triangular but peak amplitudes and consistency in slopes have also been compromised.

Bit Depth (quantization)

OK, now to see what effect quantization has on the waveform. To be certain that we are viewing only amplitude functions, a 2Hz tone is sampled at 44.1kHz.

Here, the top sine tone was recorded at 0dBFS: the one below it was recorded at -76dBFS and boosted back to 0dBFS for easier visual comparison. Note the staircase effect of the quantization error.

Now take a look at a number of low level conversions.

Applications – Download!