Recording Drums? How many mics do you use?


Understanding how it all works.
By Dave Moulton

We’re all in this crazy business because we love music. And most of us who have gravitated into the recording part of this crazy business have done so because we are similarly hooked on sound. We really and truly get off on the stuff. We like what it does to us. We wallow in the sensory luxury of the really spectacular sonic magic that comes out of loudspeakers

Speaking for myself, I have always been fascinated by the “sound character” of all sorts of environments, machines, and other incidental noise sources, as well as music instruments. It seems to me that I mostly perceive things more in terms of their sound than their visual appearance or odor.

I have occasionally even toyed with the idea of writing a detective novel where most of the descriptions would be auditory rather than visual, as in (in my first noire attempt):

“After she slammed the door on her way out, I sank into what was left of stillness in my harshly echoing office, until beneath the rising surface of quiet I could once again hear the insistent rush of traffic below the windows, occasional horn stings and the rapid shrilling of a prowl car bulling its way through pedestrians, red lights and a gridlock as tangled as my mind. The reverberations of her anger slowly tailed off into urban night noise, Fiona Apple in my inner ear with a passing boomvan keeping time.”

Oh well. Even though Spenser has nothing to fear from me, I hope you get the idea. Things and emotions can be perceived in terms of how they sound.

The point is, I hear my world as much as I see it, and usually I am acutely aware of how it sounds at a perceptual level that seems to me to be more conscious than what the average person experiences. There was a stretch, for instance when I first really got into reverb, that I felt like I spent all my time hearing the spaces between notes and had the damnedest time keeping the notes themselves in mind.

Clients would be moaning about distortion on the guitar track and all I could hear was the really interesting spatial double decay of the snare hit from the combination of the overheads and the guitar mics! This is, of course, one of the curses of being a recording engineer.

I suspect that for many of our end users, much of the emotional impact of sound is mostly pre-conscious, and they are not as aware as they might be of the effect that the sound of a given situation is having upon them. And that has some interesting implications for people in our line of work, which I’ll get to in a minute.

Meanwhile, I also suspect that many of you are like me, or else you wouldn’t be reading this. All this sound stuff, these emotional meanings that sound has for us, are central elements in our recording craft. Mostly we take all this for granted.

The reason I’m telling you all this is to illuminate just how clear, powerful and effective our sense of hearing is. But what I really want to discuss is how that hearing system works, and how it is that it works so well that we aren’t even aware of it working. And once we know this stuff, we can really expand our craft as recording engineers. Avanti!

How do we hear this stuff?

Our perception of sound is so easy, so seamless, so unequivocal and clear, that it doesn’t seem like there is much to it. Guy/gal makes a cool noise, we hear it. Cool! What could be simpler?

Naturally, when we try to make a recording of said guy/gal’s sound and play it back out of loudspeakers we run into a little trouble, as you may have noticed. Guy/gal makes a cool noise, we record it, play it back. Not quite so cool.

Why is that? The obvious answer that most of us like to fall back on is that our equipment isn’t good enough. And so there’s a lot of blather about accuracy going around these days, and we worry about our gear. We reason that if the gear was “really accurate,” why, it’d sound exactly like that soprano digeridoo we’re trying to overdub!

There is another explanation. What we perceive isn’t exactly what went in our ears. In fact, it isn’t like what went in our ears at all! What we consciously “hear” is far removed from the physical stimulus called “sound waves” that entered our ears.

And because there is such a huge metamorphosis between our ears and our mind, it isn’t reasonable to assume that just because we think we’ve made a physically “accurate” recording of that soprano digeridoo, that our recording is in fact accurate for our perception.

Just because we’ve used “really accurate” microphones, consoles, recorders, and speakers doesn’t guarantee a whole lot, it turns out. We need to consider our hearing mechanism in a little more detail and understand a bit more about what it is doing. And along the way, perhaps we need to redefine what we mean by “accurate.”

So I’d like to devote a long article to the human auditory system and how “the way it works” affects our work as recording engineers and producers. (Long-time readers will know that these definitions are important to me—see my article ‘What Do We Mean By “Audibility”?’ in the January 1999 issue for an earlier attack on a tough subject of this type.)

The fact that most people get their emotional hits from music and sound at a pre-conscious level gives us some powerful mojo. If we can get control of the raw sonic materials that generate those emotions, we can really get our listeners going and they’ll never even have a clue as to why!

Maybe we can even come to rule the world.

What’s really going on here

You all know the basics about hearing. You know, for instance, about the holes on each side of our heads with the funny-looking flaps and the microphones at the inner end of those holes. The holes are called ears, the flaps are called pinnae, and the microphones are called eardrums.

We also know that somehow the air pressure change detected by those microphones gets sent to our brain, and that what came into our two ears gets combined so that we can figure out where the sound is coming from. Probably you also know some other stuff, such as that the limits of our hearing run from 20 Hertz to 20,000 Hertz, that bats and dogs hear much higher, that cats hear softer, yada yada.

If you’ve done your reading a little more carefully, you may know that the softest sound we can hear is called 0 dB Sound Pressure Level and that 120 dB SPL is the loudest sound we can stand, sort of. You may have heard stuff like 1% distortion is inaudible, but also heard that there are guys ’n gals out there that can hear 0.001% distortion. You may even know the implication of those numbers.

But what isn’t really discussed or perhaps even considered is how and why our auditory systems works the way it does as a system. We gloss over the difficult bit about how the sound gets from our eardrums to our brain (“Like, it just goes through the nerves, which are just like Monster Cable!”). Nor do we really consider how or why it evolved as it did, other than a tired line or two about cave persons needing to be able to localize the saber-toothed tiger just before being converted into deviled ham.

The auditory system

We’ve got the mechanical sensing system called the ears. Then there is a transducing system in the inner portion of each ear that converts the detected mechanical motion into neurological impulses. These impulses get sent to a portion of the brain called the auditory cortex via bundles of auditory nerves (which deserve some serious consideration all by themselves) and a series of intermediate stages in the nervous system and brain.

During this transmission process a lot happens to transform the neural information that was sent from the ears. Auditory neural impulses from the two ears get integrated together (exactly how, we don’t know).

These auditory neural impulses are also sent to the central nervous system as information to act upon and react to. Sensations of pitch, loudness, and direction are extracted and/or derived from this auditory neural information.

Finally, the evolved and transformed neural information is sent to the frontal lobes of the brain for perceptual activities like speech processing, identification, memorizing, and conscious perception. It is an extraordinarily complex system, and it does not yield to simplistic explanations about how we so easily and seamlessly perceive our beloved soprano digeridoo.

As observed by Zork-11 from Betelgeuse IV

What is this system trying to accomplish? Let’s consider it from the perspective of a visiting alien trying to figure this out.

First off, the physics of it are this: the auditory system is detecting the short-term pressure emissions given off by other organisms and the environment in general as a by-product of their regular activities, over a fairly broad range of frequencies and almost the entire linear range of pressures possible in the gas medium (air) in which we live.

The system permits us to detect, localize, identify, and (sometimes) communicate with a few of these other organisms (we call them Humans and Golden Retrievers) in a three-dimensional space around us. In addition, the system permits us to detect, localize, and identify the environment.

Think of it. We live in this transparent gas called air, and we are really good at detecting short-term patterns of very slight pressure changes in this gas over a huge set of ranges. And not only do we use this pressure-variation detection ability to determine what is going on around us, we also make up and generate little pressure variation patterns in this gas just for the fun of it (which we call music).

And we exchange precious metals between us in return for the fun of detecting such “cute” patterns of gas pressure variation. Whoa!

Zork-11 is impressed.

The things we hear

Let’s make a list.

We detect things happening around us, from all directions, all the time.

We identify the nature of the stuff happening quickly and easily, to a point where we can casually tell the difference between the pressure variations generated by the footsteps of Igor and Samantha as they walk around in the same room we are in.

We detect quickly (and mostly subconsciously) the presence of environmental features of our space, like walls, ceilings, etc., even though they don’t generate gas pressure variations. We do this by detecting the reflections of the pressure variations generated by the footfalls of Igor and/or Samantha. Amazing!

And when either Igor or Samantha (not their real names, by the way) generates pressure variations using the cute looking orifice located at the center of their respective heads, we have learned to quickly come to understand that it is time to (a) put out the garbage, (b) get the car fixed, or perhaps on a good day (c) join Igor or Samantha in ingesting a liquid derived from fermented fruits and/or grains. We call this “communication through the use of spoken language.” It is a totally amazing trick.

And finally, in a truly remarkable and almost totally incomprehensible way for our alien visitor and new friend Zork-11, those of us called recording engineers spend our lives fooling around with all these tiny gas pressure-variations for fun, and on a few rare occasions for precious metals as well!


Obviously, it is not the gas pressure variations that we are interested in. It is, rather, a complex range of information and emotions that is carried by such pressure variations that is what we are really interested in transmitting and perceiving.

It is useful to ask some questions about that information, such as:

– What are the features that allow us to distinguish one sound from another?

– How do we distinguish between the sounds of sources and sounds of reflections?

– How do we extract the sense of multiple pitches from a single complex wave?

– Why don’t we do this for all complex waves?

– How come we don’t get hopelessly confused by all the reflections from the environment? Come to think of it, how come we barely even notice them?

– How can we recognize Samantha’s voice on the telephone when it has been band-limited to a very small percentage of her original sound?

– What makes it possible for us to “hear” that a room has been freshly painted or that we’ve added a sofa?

The fact that all these things occur so naturally and effortlessly in our perception obscures the complexity of the underlying system. When we make recordings, the mechanical ears that we use (microphones) don’t have that same complex underlying system. As a result, they lose a lot.

When we play the resulting audio signals back through loudspeakers, they lose even more. Physically speaking, it is not a pretty picture! That it works at all is miraculous.

If we’re going to be really good at our craft of recording engineering it behooves us to get a handle on how us humans perceive this stuff and then do what we can to make the sounds our loudspeakers put out as useful, informative, and entertaining as possible for the human auditory systems possessed by our clients and their fans.

This involves some hard thinking. Most of the operation of the hearing system is concealed from us (that’s why I call it pre-conscious). So we have to work on our ability to infer what is going on by observing the relationship between what we perceive and what we know by physical measurement is happening. A lot of it, when we get into it, is pretty spooky.

Hard thinking is mainly a process of asking hard questions. For instance, the process of neural transmission from the ear to the brain isn’t instantaneous—in fact it takes something like 5–10 milliseconds! So we’re always perceiving a delayed version of what happened. Given that that is so, how can musicians play together? And how come we don’t notice the delay?

Another puzzler from the same dismal swamp: we don’t perceive the early reflections of a sound source in a small room for up to about 40 milliseconds; this is part of what is called the Precedence Effect. How come? Is this part of what we call masking, where one sound artifact conceals another?

And speaking of masking, did you know that under certain circumstances a sound can be masked by another sound that comes after it? How can this be?

A big question from the realm of stereophony has to do with the phantom image. How come there is one? Why don’t we get phantom images from two violins playing the same note? Why does this only seem to happen with loudspeakers? Zork-11 doesn’t get it!

How can we hear chords? Why don’t we hear overtones as chords? Why don’t we hear barometric changes (they’re gas pressure variations too)? How come reverb doesn’t confuse the hell out of us? How can we actually like the stuff? Why isn’t an anechoic chamber the best place to play music?

As we begin to pull the answers to these questions together, using what we know about the origins of the hearing mechanism and what it needed to do to help us survive in a Darwinian world of natural selection, we can begin to build up a more robust and sensible understanding of what is really going on with our hearing—and how to use that knowledge for fun and profit.

Ain’t science fun?

The Air Up There

Before we delve into the actual sensory mechanism of the ear, we need to talk a little about the medium in which we sense sound: air. We need to consider briefly the range of magnitudes of various physical characteristics of that air that result in our perception of sound. We need to talk about the nature of air itself and how it supports sound.

This air we live in is a compressible gas. That means the relative density of the air is quite variable, and the number of “air” molecules in any given space, or volume of air, may change. Such density changes as a function of temperature (cold air is denser than warm air) and altitude (“low” air is denser than “high” air). Air density can also be changed as a function of what we call displacement, or the insertion of some other material into some given air space.

When we displace air suddenly it is called “excitation.” The result is that the density of the air around the point of excitation changes. Further, such density tends to expand away from said point of excitation.

This physical process is what leads to the notion of “sound.” We “excite” the air molecules at some point in space, displacing them. They in turn displace other molecules. A wave of density flowing away from the original point of excitation reaches our ears some distance away, some time later. Voilà. We perceive a sound. That’s how it happens, plain and simple.

Now, there are some limits to this.

To begin with, the medium (air) has certain limits to its density. At the minimum, there are no molecules present. This is called a vacuum, and it happens in outer space (air has weight and is therefore attracted to the surface of the earth) or in an enclosed space from which we have extracted all air by means of a suction pump.

In a vacuum there can be no displacement (because there are no molecules). Therefore, there can be no sound. Period.

At the other end of things, when we make air more dense by compressing it, at some point it changes state from a gas into a liquid. Such a liquid is not viable for humans, so the issue of how sound might be transmitted in liquid oxygen is pretty much irrelevant for most studio work.

Meanwhile, the speed at which waves of air pressure can travel is constant (for any given temperature and altitude, or density). This is called the “speed of sound” and it is approximately 1130 feet per second at sea level and 70° F.

These qualities serve to describe something about the medium of air itself. For our purposes it is useful to consider the range of magnitudes within which “sound” as perceived by humans exists. I call this set of ranges The Audio Window.

This “window” is an interesting way to consider sound. In a general way, it frames the three most important “dimensions” within which sound exists and the physical limits of the audibility of sound for us humans. I have also found it useful to consider what lies “outside” the dimensions of the window: the physical behaviors of air that are similar to sound behavior but aren’t audible.

The three dimensions that we will consider here are frequency, amplitude, and time. I usually show them on three axes: horizontal (or X) axis for frequency, vertical (or Y) axis for amplitude, and front-back (or Z) axis for time.

Two of these “dimensions” are related: frequency and time. We consider them separately because of the way our hearing works. “A sound” consists of a group of frequencies occurring over a given time period usually greater than 50 milliseconds.

Because the array of frequencies plays such an important part in our perception of timbre, it is useful to consider them independently from the time dimension itself, where we can consider the relationship of events and their component parts (i.e. spectral, dynamic, and spatial envelopes).

About frequency

As I’m sure you know, the approximate range of frequencies that us humans detect is ten octaves (doublings of frequency), from 20 Hz to 20 kHz. Such frequencies are related to our sensory perceptions of “highness” and “lowness” of sound, and also less directly to our sense of pitch. When we say “frequencies,” what we really mean is “unique rates of change of density.”

Such an expression is the long way of saying “sine wave.” The waveshape known as a sine wave is a particular, very special shape describing the periodic change in density in the air over time. That shape is equivalent to seeing a spiral motion (like a spring, for instance) from the side. It represents energy vibration at a single unique frequency.

All other waveshapes involve multiple frequencies. They can be reduced, through a mathematical technique called Fourier Analysis, into an array of sine waves, each having its own frequency, amplitude, and phase.

This concept becomes very important when we consider low frequencies. When we go below a frequency of approximately 20 Hz, we can no longer detect that vibration of air density with our ears. So a sine wave of 10 Hz is inaudible.

However, a square wave of 10 Hz will be audible, as a series of clicks. This leads us to the insight that it is not the frequency of the wave that determines audibility, but rather the rate of change of air pressure.

At very high frequencies the inertia of the ear drum limits our ability to detect sounds. The change in pressure away from average pressure and back happens for such a short period of time that the ear drum does not have time to move in response.

So the range of “unique single frequencies” we can hear is limited at low frequencies by rate of change in air pressure and at high frequencies by the inertia of our ear drum.

Nonetheless, there are frequencies of air density change above and below our hearing range. Frequencies at or below our low frequency limit are called infrasound or sub-audio sound. Frequencies above our hearing limits are called ultrasound.

Infrasound (low frequencies) is fascinating stuff. The stuff between 1 and 40 Hz has some startling effects on humans. When it is at low amplitude levels it is not audible, but at high amplitude levels it is (at least down to 20 Hz).

It also has a related effect of becoming tactile, so that as it becomes loud enough it can be “felt” on the skin. This tactile quality has a very powerful psychological effect on humans. There is a pharmacological effect as well, in that the presence of infrasound stimulates the human organism (that’s us, dude!) to generate adrenaline, leading to a state of heightened tension and readiness for physical activity (leading to the so-called “fight or flight” syndrome). Our evocation of these functions via the use of subwoofers is obvious and spectacularly successful!

At the same time, the volume of air required to generate significant amplitude (and loudness) changes with frequency, so that as the frequency gets lower the amount of air needed to generate an equal amount of air density change (amplitude) goes way, way up. This is why woofers need to be big. A small driver simply cannot displace enough air to generate a meaningful pressure wave that is, say, 56 feet long (at 20 Hz).

Below 1 Hz, we encounter the realm of atmospheric density where the wavelengths are so long (1/5th of a mile at 1 Hz) and the volume of air required is so great that it only occurs as a function of weather related phenomena, such as wind and barometric pressure. Interestingly, wind can be thought of as direct current, and as such it doesn’t have a frequency and we don’t hear it. What we hear is the turbulence generated by it as the wind flows by. When wind howls or whistles, that turbulence has become periodic, like in any wind instrument (now you know why they’re called “wind” instruments, eh?).

Barometric pressure is the lowest frequency we encounter in air, with a period of about a week on average. We don’t hear it and we don’t feel it, although there is some indication that extreme barometric pressures and/or pressure changes do affect mood.

On the ultrasonic end of things, sound as such goes on up to about 1 Megahertz in air. The limiting factor is the size of oxygen molecules. When the wavelength of the density change becomes approximately equal to or shorter than the size of the molecule itself, that wave can no longer be propagated, because the molecule is not itself compressible.

That said, many mammals hear up to 100 kHz, and there is plenty of viable sonic information from musical instruments, etc. in the ultrasonic range. However, humans really don’t seem to make much, if any, use of this information. In spite of the many rumors and speculations, stuff above 20 kHz really does seem to be beyond us.

About amplitude

Meanwhile, while the rate of change is called “frequency,” the amount of change in density in air is called “amplitude.” It is related to our sensory perception of loudness.

Now, there are limits to the amount of variation in density that air is capable of, at least for all practical purposes. For a given space at some given point in time, there is some number of molecules in that space. In the absence of displacement or motion of objects within the space, the density distribution of molecules is approximately equal everywhere in the space. This is the condition we sense as “silence,” which is equivalent to a lack of motion of air molecules and consequently no changes in density that would occur as a result of such motion.

However, the molecules never stop moving completely—there is always random motion of molecules moving about in the gaseous space, colliding with each other with random incidence and direction (this only stops completely when the temperature gets down to what is known as “absolute zero”— 459° below zero degrees Fahrenheit). This is called “thermal noise.”

At normal room temperatures it is approximately an average fluctuation in density of 50–100 trillionths of an atmosphere. This is a small enough fluctuation that us humans mostly don’t hear it. It is somewhat below our “threshold of hearing” (which is approximately 200–500 trillionths of an atmosphere —200 trillionths of an atmosphere being labeled 0 dB SPL).

What is an atmosphere? It is the average air density at the surface of the earth at a given temperature (typically 70° F). It is expressed in terms of the pressure of such air on the ground due to the force of gravity, and is slightly less than 15 pounds per square inch. In the metric system it is quite elegantly expressed as “one bar.” I like it!

This leads us to the maximum amount of pressure fluctuation that can occur. For a sine wave that level occurs when 100% modulation occurs or when the density swing is equal to the average pressure, or one atmosphere (bar). This level is approximately 195 dB SPL. At this level, the “negative density” (rarefaction part of the wave) is effectively equal to zero, or a vacuum, so above this level the air will distort.

(Actually, it begins to distort at much lower levels. Some years ago I was told by an engineer doing noise reduction work on jet engines for Pratt & Whitney that air non-linearities begin around 125 dB SPL, while engineers at Bang & Olufsen tell me they have measured it at a somewhat higher level, around 160 dB SPL for 3% harmonic distortion).

There can be sounds louder than 195 dB SPL, of course—they simply can’t be sine waves. And in reality we can consider 195 dB SPL to be an upper limit for amplitude. Just so you know, the acoustic power needed to generate such a sound is somewhere around 1 Gigawatt!

In any case, this range of amplitudes is huge: 6 billion to one. Us humans find any amplitude greater than 2 ten-thousandths of an atmosphere (120 dB SPL) to be quite painful. We don’t go there often.

On the soft end of things, there has developed over the past century a miasma of low-level noise pollution throughout the civilized world (and thanks to airplanes, much of the rest of the world). As a result, we simply don’t encounter “still air” very often. Mostly, the air sort of rumbles and bumbles along at an average amplitude of about 40 dB SPL, which is about 20 billionths of an atmosphere.

Fortunately, such low-level turbulence and noise consists of low frequencies (less than 100 Hz), to which we humans are less sensitive than high frequencies.

About time and its relationship to frequency

The time axis is quite peculiar, and its implications for perception of sound are poorly understood in general. Nonetheless, it is an extraordinarily important dimension for us humans, and a rudimentary understanding of it is necessary if we are going to understand much at all about how we really hear.

The shortest sound we can hear is an impulse about 20 microseconds long. This is related, of course, to the upper limit of frequencies, about 20,000 Hz. We can theoretically hear sounds that are indefinitely long. However, from a perceptual standpoint they cease to be perceived as events and instead become “a continuum.”

I sort of arbitrarily put that longer limit at about ten minutes, though most of the time it is a good bit shorter. In a practical sense, a musical note that goes on for 30 seconds is about as long as we can stand before we begin to get really bored! Meanwhile, the sound events (like musical notes and spoken words) that we work with in audio are generally between 50 milliseconds (1/20th of a second) and five seconds long.

There are two other primary things to know about the time dimension. The first is that there is a fundamental neurological boundary for humans at about 50 milliseconds. Events occurring more quickly than that are perceived as a single or continuous event, not as a series of events. Events occurring less quickly are perceived as separate individual events. This holds true for vision as well (hence our 30 fps rate for film). So as I noted earlier, a 5 Hz square wave will be heard as a series of clicks, while a 50 Hz square wave will be heard as a continuous tone.

The other time phenomenon worth keeping in mind is our integration time. When we perceive “a sound,” we integrate all of the versions of that sound that occur within the first 50 milliseconds of the sound into a single holistic perception. So we sort of average all versions of a sound that occur during that period to produce our conscious perception of the sound source. We will discuss this at considerable length in later articles.

Finally, we have to keep in mind that frequencies are a subset of the time dimension. Individual pitched (periodic) sounds consist of an array of frequencies, as I mentioned above. The lowest such frequencies (fundamentals) usually fall within a four octave range from about 60 Hz to 1 kHz All frequencies above 1 kHz can be regarded as harmonics that enable us to determine timbre and differentiate sounds from each other.

What the Audio Window is and isn’t

So there you have it. All sounds that humans hear exist within this range of frequencies, amplitudes and times. There are other frequencies, amplitudes, and time periods that exist in air, but humans don’t perceive them as sound and they have little or no relevance for our audio production work.

So in a simple way you can think of this space, this confluence of physical dimensions, as our audio sandbox or playpen. Here’s where our work is done.

At the same time, keep in mind that there’s much more to it. Within this space there are many features and facets of sounds and our auditory system that make our actual hearing dramatically richer, more powerful, and more moving.

Zork-11: curious about ‘pitch’

When we talk about pitch, we are talking about musical notes like Middle C, A-440, or G-sharp two octaves below Middle C. Pitches have sensations of highness and lowness. Melodies are made up of sequences and patterns of higher and lower pitches. Chords are made up of stacks of pitches. We have twelve chromatic pitches to the octave. The human voice rarely exceeds four octaves worth of pitches, from Low C (ca. 65 Hz) to High C (ca. 1040 Hz). Pitches are the very stuff of music.

So what are these so-called pitches? Zork-11, our imaginary friendly alien visitor from the fourth planet of that amazing giant red star Betelgeuse, would like to know. Not having ears himself (he perceives everything as pulse trains of photons, hmmm…), he would really like to understand.

Our simple 8th grade answer is that pitch is like frequency. You know, you’ve got your A-440, which is, like, 440 vibrations a second, only we call it A. Know what I mean?

Unfortunately that doesn’t really cut it. You see, pitches aren’t really frequencies. They are subjective perceptions. They don’t exist in the physical realm, only in our brains. And we can hear an A-440 when there is NO frequency present anywhere near 440 Hz, the frequency we associate with that particular A.

There’s a related confusion. Musical notes (pitches) have timbre, which is derived from overtones among other things. Now, overtones are stacks of frequencies, just like chords. In fact, the first seven overtones are a dominant 7th chord.

How can we tell overtones from chords? What’s the difference? And how come the overtones don’t foul up our perception of chords? And what does this have to do with pitches?

Zork-11 is fascinated and very curious.

The physics of waveforms

In order to get at this particular thicket of auditory perception, we first need to consider physical waveforms. These are periodic cycles of air pressure change. Last month we mentioned the simplest one, the sine wave, which has energy at only one frequency. Take a look at the inventory of waveshapes in Figure 1, stolen from my book Total Recording (thanks to KIQ Productions for permission to reprint these pictures here).

Now, these are shapes of changing pressure over time. If the period of any of these waveforms is 1/440th of a second, we will say that the waveform has a pitch of A-440. What gives the wave its "A-440"-ness is the length of the period, not the presence of energy at 440 Hz.

Each of these waveforms, except for the sine waves, has energy at multiple frequencies (this is what overtones are). Look at the same set of waveforms viewed as energy distributed across the spectrum in Figure 2. For graphical ease I’ve expressed them with a period of 4 ms (250 Hz).

In each of these spectrographs (except for the 500 Hz sine wave) there is plenty of energy at 250 Hz, which is an out-of-tune B natural below Middle C. So it makes sense to think that the pitch should be related to 250 Hz. We can imagine that these waveforms could all very well have the pitch of a 250 Hz out-of-tune B.

But what about a B just below Middle C pitch that doesn’t have any 250 Hz in it? Let’s take a look at some waveforms and spectrographs taken from a TurboSynth window. Here the overtones aren’t labeled according to frequency, just which overtone (by number) that it is.

First, let’s look at our old friend the sine wave to get a feel for this (figure 3). Figure 4 is the expression of that same sine wave as a spectrograph.

Note that the only energy is at the first harmonic, called the fundamental. For a 250 Hz tone this fundamental is at 250 Hz. Each harmonic above this will be 250 Hz higher than the previous one.

Now look at the waveform (figure 5) and spectra (figure 6) for a complex wave.

Notice that there is no fundamental frequency present, just overtones 2–6. So there’s energy at 500, 750, 1000, 1250, and 1500 Hz, but none at 250.

What’s interesting about this (and what Zork-11 doesn’t understand) is that this also has a pitch of that out-of-tune B whose fundamental is 250 Hz. It sounds just as high (or low) as the 250 Hz sine wave, just, er, brighter! How can this be? The nearest energy is at 500 Hz!

Figure 7 shows another 250 Hz waveform, where the lowest energy is at 1 kHz. Its associated spectrograph is in figure 8.

This waveform also has a pitch of that same out-of-tune B. What should be clear from the above examples is that pitch seems much more closely related to the length of the repeating period than to the energy present.

But, Zork-11 asks, how do you hear this? He can understand how higher frequencies would sound higher, but how can a group of higher frequencies sound just as high (or low) as a lower frequency? How does the ear do this?

The answer is, Zork-11, we don’t really know.

However, we have some ideas.

Detecting pitch

The inner ear, called the cochlea, has a remarkable thin compliant membrane stretched out along inside it. This membrane is called the basilar membrane, and it has two really interesting features. First, it is resonant at different frequencies at different points along its surface, and second, it is infused with thousands of nerve endings (called hair cells) that are attached to the auditory nerve going to the brain.

The net effect of these two features is that different frequencies excite different hair cells, so that in a general (if incomplete) way we can think of the mechanism as causing each hair cell to be excited by a very specific range of frequencies. This leads us to a concept called the "place" theory of pitch detection.

The idea is that "pitch" is related to a "place" on the basilar membrane. A given place has a given pitch, or so the theory goes.

Obviously, there’s a problem with this theory. In our examples above, different places on the basilar membrane would be excited, but all yielding the same pitch sensation. Nonetheless, it’s worth asking the question: what do all of these different waveforms have in common, on the basilar membrane?

As I wrote in Total Recording (sort of edited here, with apologies for the blatant self-promotion): "Complex sounds consist of multiple frequencies, which in turn cause multiple points on the basilar membrane to vibrate… Here’s where the concept of neural templates comes in handy. If we think of each pitch as a unique array of "places" on the basilar membrane that vibrate at a single time, it makes a little more sense. Think of that array of places being stored in our memory as a single neural pattern. Then when we hear that pattern, we are able to successfully compare it to the template and say something like ‘Ah-hah! That’s neural pitch template B-below-middle-C!’"

And here’s the cool part: "We don’t need the ‘array of places’ to be exactly like the neural template in order to make the correct pitch identification; we simply need it to be a close enough ‘fit’ that we feel that it has the same quality of ‘pitch.’ In fact, the perceived array can be significantly different from the template and we can still identify the pitch as the same. It only falls apart when ‘places’ are excited on the membrane that are ‘incompatible’ with our pitch template (i.e. being offset just a little, out of the pattern, or having a significant portion of the array seem to resemble a different pitch template).

"Using this concept, then, we can think of a pitch as an ‘array of particular places’ on the basilar membrane. If we hear only a sine wave, we are exciting only one of those "places," and intuitively, using this concept, such a sensation is probably going to be a little vague. And the reality supports this: the sine wave is comparatively difficult to assign pitch to."

So, Zork-11, that is roughly how we think we detect this stuff we call pitch.

"But," Zork-11 quickly interrupts, "what about these things you call overtones and chords? How can they both exist at the same time? How can your basilar membrane tell the difference between a pitch template and a chord? They are both stacks of frequencies, right?" Zork-11 is getting agitated.

Why is a waveform not a chord?

A fundamental concept related to frequency is phase, which can be thought of as the progression of the waveform through its period. Now, any two vibrating sources (except loudspeakers, curiously enough) will vibrate at different frequencies, even when they are extremely closely tuned. The result is that they will go in and out of phase over time.

This leads to the phenomenon of beating, and it is a primary musical behavior. It is also inescapable: two separate vibrating sources will always vibrate at at least slightly different frequencies.

Meanwhile, a single vibrating source will have some waveform, which in turn is constituted of a stack of frequencies. Now, and this is the key part, Zork baby, for that waveform to be periodic, to repeat over and over, all of the frequencies in that waveform have to be locked in phase.

If they aren’t locked the waveform falls apart and constantly changes. When we consider a square wave, for instance, all of the frequencies cross zero going positive at exactly the same instant in time, for each cycle of the wave. We call this behavior "phase-locked." It is an essential ingredient of the concept of the waveform.

So when we have a complex wave, with a stack of frequencies, they are all phase-locked so that for each cycle of the wave, they all maintain the same phase relationships. This phase-locked behavior is the unique signature of a "single" sound source—it is how we know that we are hearing "one" sound, and not a group.

Exactly how we do this, Zorkarino, we don’t know either. My guess is that it occurs in the auditory nerve as a function of the transmission from the basilar membrane to the brain.

Anyway, the effect is profound. Try manually tuning a bunch of analog sine wave oscillators to mimic a stack of overtones: 250, 500, 750, 1,000, etc. It will be painfully obvious that you are listening to a "bunch" of oscillators. The difference between those oscillators and a single sawtooth waveform at 250 Hz is huge. The sawtooth wave is obviously a single sound, while the bank of oscillators is oh so obviously not (even when they all emit from the same loudspeaker)!

So regardless of exactly how we do it, here is the straight skinny, Zorko, about the difference between chords and overtones. Overtones are phase-locked, and they represent multiple frequencies from a single source, while chords represent a group of sources, each performing a different pitch in the chord. Obviously, if all of the sources have complex waves, it all gets even easier to understand and hear out the chord.

That relates to pitch?

Pitch is detected as a pattern of phase-locked frequencies on the basilar membrane, a range of "places" that are "correctly spaced." We memorize templates for these pitches—perfect pitch can be thought of as really good long-term memory of very exact "places" and "spacings."

If the frequencies are not phase-locked, they aren’t included in the template. For us they’re "other" sounds.

So, Zork-11, there you have the basics. Pitch is a subjective sensation of the highness or lowness of a sound. Depending on the relative loudness of the various overtones, it can be brighter or darker. Regardless of its timbre, it has a given "height" in the musical scheme of things.

This all works because of the remarkable combined capabilites of the basilar membrane to detect individual frequencies, the auditory nerve to detect phase-locked vs. free-running frequencies, and our brains to make coherent sense of what is, after all, a very confusing situation.


Zork-11, our friendly visiting alien, demands to know about loudness. Recently he’s been homesick for Betelgeuse, and that makes him, well, a little impatient! Cranky, you might say. So it’s important that I try to humor him and explain about loudness before he does a photon discharge.

Well, Zork-11, the quick answer is: loudness is the subjective perception of the magnitude of air pressure fluctuations in the air, roughly speaking.

It is how we humans perceive the energy and power that are being transmitted through the air. The more energy and power, the louder the sound.

Got that, Zork-11?

Pressure, power, amplitude, and loudness

What happens physically is this. Some kind of mechanical physical motion excites the air. The more energy and power that are expended by that physical motion, the greater the swings in air pressure from compression to rarefaction.

This in turn causes greater motion of our eardrums. The result is a change in what we hear, a change in the particular quality of sound that we call “loudness.”

There is an approximate correlation between the subjective quantity loudness, the physical force called pressure, and the amount of work accomplished per unit of time, called power. We can make the following generalizations:

- Power changes as the square of the changing pressure (doubling the pressure quadruples the power).

- Tripling the pressure multiples the power by 10 (which is the same as adding 10 dB, by the way).

- As the power is multiplied by ten, the sensation of loudness approximately doubles (it sounds “twice as loud”).

- Therefore, tripling the pressure approximately doubles the apparent loudness.

Pretty straightforward, eh, Zork? He warbles in agreement.

Zork-11 ponders the hugeness of it all

It gets interesting when we begin to contemplate the huge range of loudnesses we can perceive.

Truth is, we can actually hear something like 15 or 16 doublings of loudness. So the loudest sound we can stand sounds about 50,000 times as loud to us as the softest sound we can hear. And that represents a change in pressure of say, 14 or 15 triplings of pressure, or a pressure range of 8 million to 1!

It may not seem like a lot to you, Zork, coming from Betelgeuse, but that’s a big pressure range for we humans. Why, we have trouble even imagining it!

And the power ratio is really big! 8 million squared is, like, 65 trillion! That’s a heap of watts!

So the key issue with loudness isn’t how we perceive it, but how we manage to perceive it reliably over such a huge range! Back when we looked at the audio window, if you recall from our January issue, we spent some time considering the physical implications of that range.

At the soft end we are very close to being able to detect the sound of a single molecule bouncing into our eardrum, while at the loud end of things we can almost tolerate the sound of airwaves creating vacuums. So in terms of loudness, we can detect levels pretty much all the way across the useful range of sound pressure levels. And that range is huge!

What makes it really remarkable is how constant the rate of perceived loudness change is. A change in amplitude (or power) of 1 dB (12% amplitude change) is roughly the smallest loudness change humans can pick out. And that 12% threshold holds up pretty well across the whole 8,000,000:1 range! Wow!

How we perceive loudness

The multiple processes via which we perceive changing loudnesses are highly complex and definitely non-linear. There is the eardrum to consider, where sound is converted into mechanical motion. Then there is a linkage of bones in the middle ear that serve to amplify the mechanical signal and transmit it to the inner ear (the cochlea), where the mechanical signal is converted into a compound array of neurological signals being sent to the brain.

We discussed roughly how this works in February 2001 (‘Hearing Highs and Lows’). Suffice to say that on the basilar membrane in the cochlea, sound is detected along a resonant membrane, so that different parts of the audible spectrum are detected at different places in an array of about 30 critical bands. Meanwhile, the basilar membrane is infused with thousands of nerve endings that respond to motion in their vicinity.

As sounds increase in amplitude, two things happen. First, the rate at which the nerve endings fire neural impulses increases, so that loudness increases (roughly) with the general firing rate of the nerve endings. Second, the area of the basilar membrane that resonates will become greater, and therefore more nerves will fire. So at the basilar membrane, loudness correlates to the neural firing rate and the total number of neurons firing.

“But wait a minute,” says Zork-11. “How many nerve endings are there, really? You said thousands, right?”

“Right,” I reply. “Around 30,000, I think, but a lot of them actually send nervous impulses to the ear from the brain, so I’m not sure they all count. Figure 15,000 maybe?”

“But you said that humans can hear amplitude over an 8,000,000:1 range,” Zork-11 exclaims, his hydrochlorinator rasping slightly. “How can you do this with only 15,000 nerve endings? 30,000 even?” Zork-11 has been paying attention even though he didn’t always appear to be.

Getting complicated

As amplitude increases, the number of nerve endings affected increases geometrically, so a small change at high levels involves a much greater extra number of nerve endings than a similar small change at low levels. But the number of nerve endings on the basilar membrane and the range of firing rates is insufficient to do the whole job of distinguishing loud from soft across the dynamic range we are talking about. Other mechanisms must be at work as well.

Part of the answer has to do with the concept of critical bands. These are the various regions on the basilar membrane (each about 1/3rd of an octave wide) that seem to function approximately as a unit, so that each such band carries independent information.

The sensation of loudness, it turns out, is affected by the number of such critical bands that are excited by a sound source. So as the spectrum of a sound increases, its loudness does too, even if its amplitude doesn’t. Pink noise at 80 dB SPL will sound louder than a sine wave at 80 dB SPL.

Similarly, time affects the loudness as well. A very short impulse will sound 20 dB softer (i.e. 1/4 as loud) as sustained noise at the same level.

A third issue, one that I’ve been fooling with a lot recently, is distortion. It is a truism in audio that perceived loudness increases as a function of distortion. This is probably related to the critical band process mentioned above, but I suspect it’s something more as well. The perceived onset of even quite low-level distortion components seems to increase loudness quite a bit, to a point where I’ve speculated about the concept of “virtual loudness” based on the addition of some non-linearities (er, distortion) to the audio signal.

What all this means, of course, is that loudness varies in some really complex ways. Time, spectrum, and linearity all affect how loud a sound is perceived, independent of its amplitude. So that simple correlation between amplitude and loudness that I started out by describing falls apart pretty quickly, and can only be thought of as true for signals with similar spectra, linearity, and impulse content (we call it “crest factor”).

Getting really complicated

It turns out that in addition to all the complexity of the basilar membrane related to number of nerves firing, etc., we have two compressors in each ear! First, the eardrum itself, functioning as a muscle, serves as a slow-acting compressor by changing its compliance in response to different perceived amplitude levels.

Second, the three bones that transmit motion from the eardrum to the cochlea function both as an amplifier stage and also as a comparatively fast-acting limiter at high levels—the bones slip apart but are held in approximate place by bone cartilage. In combination these two mechanisms provide something up to 30–40 dB of variable level reduction, depending on the nature of the signal.

What’s wild about these compressors is that we have real trouble hearing them work! Even though they can turn down the mechanical level reaching the basilar membrane by up to 30–40 dB, they don’t seem to reduce the loudness we perceive!

I’ve tried to hear them, believe me, listening to all kinds of signals in all kinds of environments. So I suspect our brain gets into the act. My guess is that as the compressors in our ears pump away, regulating levels on an on-going basis, our brain compensates for this regulation so the “apparent” or “perceived” loudness doesn’t change.

This implies some sort of elaborate interactive feedback/feedforward process that turns down the mechanical level in our ears while it turns up the “mental” level in our consciousness! Whoa, Zorko!

Zork-11 shriggles his pleasure and approval of the concept.

Not only that, but the brain also feeds back neural information to the basilar membrane to condition and “sharpen” low level signals, helping us to hear them and reducing “neural noise” on the membrane.

The net result is that we have an apparently effortless and seamless perception of loudness from the level of single hydrogen molecules bouncing in space right up to the onset of Armegeddon III. This perception is smooth, compelling, and extraordinarily reliable. In many respects our hearing blows test gear away, for its speed, reliability, and noise reduction—not to mention bandwidth and resolution.

Levels in the acoustic/analog/digital realms

This is a good part of why audio gives us such fits. Our hearing capacity, although complex, variable, and multifaceted, gives every appearance of being very smooth and straightforward over a truly gigantic range. To help put this in perspective, I like to think about audio signals in relation to our human hearing limits.

We have this lowest level that we can hear (actually, that level varies a great deal with frequency—we’re talking about 1 kHz in this case) called the “threshold of hearing,” which is approximately 2/10,000ths of one millionth of an atmosphere. It’s also called 0 dB SPL.

Let’s take a look at what would happen if we aligned the noise floors of analog and digital signals to coincide with that threshold.

The noise floor for analog signals is about –100 dBu (actually it may be easily 10 dB higher if we factor in the equivalent noise floor of a good, very quiet microphone!). This would mean that 0 dBu would yield a level of 100 dB SPL. Audio systems with power supplies of ±24V can reproduce peak levels of up to +27 dBu, which means they could reproduce a source signal of 127 dB SPL. Not too shabby.

If we fix the Least Significant Bit (dithered of course) of a digital signal to that same 0 dB SPL/-100 dBu level, a 16-bit system will run out of bits at -4 dBu and 96 dB SPL. A 20-bit system will make it all the way to 120 dB SPL and +20 dBu, while a 24-bit system will make it to 144 dB SPL (4 dB above the Threshold of PAIN!!!) and +44 dBu (156 Volts RMS!).

Assuming loudspeakers with a sensitivity of 1 Watt yielding 90 dB SPL, the 16-bit signal will only call for 6 dBW (4 Watts) for playback, but the 20-bit signal will require 30 dBW (1 kiloWatt) and the 24-bit signal will require 54 dBW (250,000 Watts)!

At present we don’t tend to align the relationships between these levels formally. In many respects we leave them badly misaligned, for some practical reasons.

The closest we come to a formal relationship between acoustical and electrical levels is occasionally to try and calibrate our speakers so that an RMS noise signal at some “nominal” production level (usually thought of as “0 VU”) yields 85 dB SPL (this is a film industry standard).

However, we haven’t really fixed a relationship between that “nominal” production level and 0 dBFS (the maximum level yielded by the Most Significant Bit). Relationships vary between 0 VU = -20 dBFS and 0 VU = -6 dBFS! What we have done is to align 0 dBFS with the onset of analog clipping (or, more elegantly put, the limits of the power supply), so that 0 dBFS, when converted to analog, is at the maximum level the power supply of any given piece of gear will attain without distortion.

When we consider the implications of this, it becomes obvious that we throw away much of the goodness of high resolution. Assume that 85 dB SPL is aligned to -12 dBFS (which is approximately +15 dBu with a 48 volt power supply). That means the maximum system level (for one speaker) is 97 dB SPL, regardless of how many bits we’re using. Pretty shabby!

Meanwhile, the Least Significant Bit of 16-bit signals correlates well with our threshold of hearing at 1 dB SPL, while the LSB of 20-bit signals comes in at –23 dB SPL and the LSB of 24-bit signals comes in at –47 dB SPL. These both are way below our threshold of hearing, probably to a point of silliness. As I said, misaligned.

The elusiveness of amplitude and loudness measurements

In reality, Zork, we can’t ever even measure the amplitude of a sound, much less its loudness. All we can do is approximate it by some sort of averaging over time.

Pressure, or amplitude, is a static physical value like pounds per square inch or dynes per square centimeter, and it simply exists at one or many points in time. Sound comes from changing the pressure or amplitude over time. Whatever measurement we make will either be inaccurate in the short term (if we make it accurate for the longer term) or vice versa.

So when we bandy about these various levels like dB SPL or dBFS, remember that they are time-based approximations. Some of our approximations (like VU meters and their kin) are fairly long-term, while others (peak-reading meters, for instance) are short-term. The same sound will have different measured levels depending on the temporal behavior of the measuring system. Which is the “correct” level? None of them!

Meanwhile, loudness is really subjective. It can’t be calibrated with any precision, Zork, and when we make the statement that a doubling of loudness is equal to ten times the power, well, that’s a gross oversimplification; we only sort of know what we’re talking about.

Loudness for humans

The subjective quality “loudness” has emotional “meaning” for us. It relates to issues of the emotional intensity of sound, as well as its nearness to us. Variations in loudness (dynamics) carry much, perhaps most, of the ebb and flow of emotional intensity in musical performance. While melody, harmony, tempo, and timbre may tells us much about the emotional quality of the music, loudness and dynamics suggest the “intensity” or “magnitude” of the emotional feeling at any given moment.

Recorded acoustic music seems to vary over about a 40–50 dB range (perhaps a 30:1 range of loudness), while a symphony orchestra might make it to 60 dB. Pop electric music has a much narrower range (15 dB?) and seems to focus on the LOUD end of things. Emotionally, pop music focuses on INTENSE! It’s a bit limited in that regard.

If you want to make music come out loudspeakers, Zork-11, loudness will be one of your primary tools. To manage it well, you’ve got to come to grips with amplitude in all of its forms—including time and spectrum, as well as human expectation.

Happy crescendos!

Dave Moulton is is alive and well in Groton, MA., trying to determine the loudness of one hand clapping. Complain to him about anything at



Kef America

The Magazine | Featured Review | Resources & Info | Readers' Tapes | Editors' Blogs | News | Shop | About Us | Contest | Subscriptions | Contact
Terms and Policy | Advertise | Site Map | Copyright 2014 Music Maker Online LLC | Website by Toolstudios
RSS Newsletter Refer a Friend Q&A Q&A