What Is Perfect Sound and How Can I Hear It?

October 20, 2021

In the audio industry, nearly every company makes big claims about delivering the best sound, but hardly, if ever, is there a clear explanation of what that means. This begs the question, how does one define “perfect sound”?

This article will outline objective standards for what perfect sound means and most importantly how it will impact both your work as a music producer and your enjoyment as a listener.

While some of the concepts detailed below may seem quite complicated, it is important to note that there are software products like Neptune that do the heavy lifting for you.

Table of Contents
Perfect Sound in a Perfect World
Industry Standards for Professional Audio
1. The most important factor is frequency response.
The problem all speakers have
Headphones can be corrected more accurately
2. Crossfeed is an essential part of human hearing
Other factors that might be effecting your audio system
1. Distortion Characteristics
2. Phase and time response
Benefits of a Perfect Audio System
1. Perfect Mix Translation
2. Accurate Listening
3. Easy Collaboration
Conclusion

Perfect Sound in a Perfect World

The simplest definition of a perfect audio system is a system that accurately reproduces the source audio.

Ideally, the audio waveform from your computer should be perfectly delivered by your speakers or headphones to your ears.

Given the nature of stereo audio and modern studio design our target sound system will be slightly different from the “perfect” system described above.

This brings us to the commonly accepted industry standards for professional audio systems.

Industry Standards for Professional Audio

As it turns out, there are industry standards that have been outlined by the EBU (European Broadcast Union). These standards are used by most acousticians as a guide for creating professional studios.

Most of the commercial audio you’ve heard, whether that be a popular song, sound effects in a video game, or a movie score, was either created, mixed, or mastered (or a combination of the three) in a studio with these standards.

So, in order to hear the audio as the artist, producer or engineers heard it, we need to recreate the sound that would be heard in that setting.

If you are interested in the details, I recommend you read through the standards laid out by the EBU, but I will go through some of the key defining features of the “ideal listening setup” here.

1. The most important factor is frequency response.

The single most important factor for a high-quality sound system is a flat frequency response. A flat frequency response implies that the amplitudes (or loudness) of all frequencies will perfectly match the source audio.

This can be measured using a calibrated omnidirectional microphone that listens to the system’s output as it plays a sine wave sweep and plots each frequency’s relative amplitude. Ideally, the graphical representation of this would be a perfectly linear line extending from the lowest audible frequency (20Hz) to the highest (20kHz).

So why does having a flat frequency response matter?

Having a flat frequency response is important primarily because it gives your mix the highest chance of translating to other systems.

For example, if you're mixing in a studio with speakers that don't produce a lot of bass, you will likely overcompensate by adding too much bass to your mix. When your now bass heavy song is played in a bass heavy environment, the bass will sound overpowering. This effect will also occur vice versa. If you're mixing in a studio with too much bass, you will likely turn it down too much. In this scenario, your mix will sound thin on systems with good bass response and very thin on systems that don't produce much bass.

Although the example above specifically references bass response, this same effect will manifest at all frequencies across the audible hearing range.

A perfectly linear frequency response is nearly impossible to achieve with speakers. The reason for this is that a room’s reverberant qualities inevitably influence the sound that travels directly from the speakers to your ears. Although it is possible to dampen or attenuate a room’s reverberance, doing so in a cost-effective way can be challenging to say the least.

The problem with speakers

According to the EBU specifications, the frequency response target is +/-3dB from flat when smoothed at 1/3dB per Octave. It is important to note that a system can fit this criterion and still have a raw (unsmoothed) response that deviates dramatically from this range.

It is equally important to realize that when you are listening to your speakers, you are hearing the raw response and not the averaged response.

For example here is a snapshot of the frequency response of my current setup, both smoothed and unsmoothed.

As you can see, the smoothed graph falls well within the recommended range as defined by the EBU. The raw response, however, deviates on average +/-5dB from flat.

At the time of writing this article, I have not seen a better raw frequency response measurement. In any case, this system preforms exceptionally well for a speaker system.

Therefore, a raw response of +/-5dB from flat is probably the best you can ever hope to achieve with speakers. This response is only possible because of a combination of high-quality room treatment, large room volume, and most importantly, digital signal processing correction. For reference, here is the response before DSP.

To reiterate the main point, it is basically impossible for even the highest quality speaker systems to measure perfectly due to the effects of reverb. However, now that we have an understanding of the standards and limitations of the frequency response for speakers, we can focus on alternative ways to increase listening clarity through headphones.

Headphones can be corrected more accurately

One major advantage of headphones is that they propagate sound directly to your ears without any external influence from the room. By switching to headphones from speakers, we have already eliminated essentially all of the problems associated with reverberation.

All that is left to do is make sure that what is arriving at the eardrums of the listener is correct. We will discuss the headphone measurement process in a future article, but long story short, we are looking to normalize the response so that the headphones deliver a flat frequency response.

Here is an example of what this looks like graphically for Apple EarPods 3.5mm.

What you are seeing here is, quite literally, a perfect frequency response (albeit with some HPF below 45Hz). The difference between the headphone’s response above and the speaker’s response from earlier is that the raw response of the EarPods is significantly flatter. As we know from our discussion earlier, the discrepancy is due to the absence or presence of reverb.

At this point, it should be clear that achieving a perfect frequency response with speakers is not a realistic expectation, but it is possible with in-ear headphones.

Now that we have addressed the frequency response problem, let’s now look at the fundamental difference between stereo presentation on speakers compared to headphones.

2. Crossfeed is an essential part of human hearing

The largest fundamental difference between speaker and headphone listening is the presence or absence of crossfeed. Crossfeed is a phenomenon that occurs naturally when listening on speakers whereby sound from each speaker travels to both ears, with sound reaching the opposite ear at a slight time delay and only at mid and low frequencies.

To reference an important point made earlier, all professionally produced audio is created in control rooms with speakers. Therefore, if we want to hear what the artist, producer, and engineer heard on speakers, we need to replicate the effect of crossfeed on headphones.

According to the EBU, the recommended listening setup places the speakers in an equilateral triangle with the listener at a distance of 6-12ft. In order to accommodate both near and far-field styles of listening, our crossfeed uses a distance of 6ft. With some basic geometry along with the speed of sound constant, we can figure out the difference between how long it takes sound to travel to each respective ear.

The delayed signal also needs to be filtered in order to simulate the shadow effect of the head. The majority of research on this subject points towards an optimal cutoff frequency that lies anywhere from 800Hz to 1600Hz. From our beta testers we found that 1600Hz provides the most realistic simulation.

The effects of a proper crossfeed implementation can be either subtle or pronounced depending on the input. For example, a song that contains a lot of stereo information (and especially hard-panned elements) will sound drastically different with crossfeed processing whereas a song that is mostly mono will not sound much different. The perceptible changes can be understood by the two graphics below.

Other factors that might be effecting your audio system

1. Distortion Characteristics

The next most important measurable characteristic of a sound system is its distortion characteristics. Distortion is defined as any additional sound that is produced that is not inherent to the audio itself.

Distortion can be categorized as either linear or nonlinear and further classified as either harmonic or inharmonic. Distortion is caused by a number of things including but not limited to, speaker design, amplifier limitations, digital to analog conversion, etc. Acceptable levels of distortion, as defined by the EBU, are less than 1% for frequencies lower than 16kHz and 3% for frequencies lower than 250Hz. Ideally, for perfect sound, there would be zero distortion. However, because speakers and headphones have physical limitations and DSP (as of now) cannot correct for distortion, it is in our best interest to choose a system with the lowest relative distortion levels.

As it turns out, this eliminates all speakers from the conversation. Due to the physical demands of speakers being far higher than that of headphones, speakers tend to have higher levels of distortion. In accordance with this logic, the physical demands of over-ear headphones are higher than in-ear headphones so many well-designed in-ear headphones perform better than their over-ear counterparts.

This exact phenomenon can be observed with Apple EarPods which, according to measurements taken by RTINGS.com, have lower distortion levels than almost all of the other headphones that they have measured. As so many companies use low distortion levels as a primary justification for their sometime outlandish headphone prices, Apple EarPods are, by comparison, exceptionally undervalued.

2. Phase and Time Response

Although the phase and time response of a system is the least important variable for perfect sound, it provides another example of why headphone listening is more accurate than speaker listening.

Most free field setups feature two loudspeakers each with a tweeter (producing high frequencies >2000Hz), one or two woofers (producing low-mid frequencies 40Hz - 2000Hz), and one or two subwoofers (producing sub frequencies <40Hz).

In the example of a three-way system with two subwoofers, there exist eight different sound propagating locations, with the contents of each arriving at the listening position at different times.

There is not much relevant research regarding how perceptible these time differences are, but for perfect sound reproduction, it is clear that we want all frequencies arriving at the eardrum at the same time. With headphones, most of this work is done for us because all frequencies are propagated by the same woofer.

Benefits of a Perfect Audio System

1. Perfect Mix Translation

In order to maximize your mix’s clarity and legibility on all devices, you need to work in a neutral mixing environment.

The reasoning for this is that if your mix sounds good with a flat frequency response, it will sound relatively good on a system that is colored (bass boosted club, bright car, etc).

2. Accurate Listening

A flat frequency response brings with it improved detail retrieval, dynamics and when combined with proper crossfeed, enhanced spatial perception. When working with the appropriate environment it is easier to choose the correct reverb or the right snare.

3. Easy Collaboration

When working with a remote team it is incredibly important that all contributors are hearing the same thing. When your team is using a software such as Neptune, you can be confident that you are not only hearing a perfect reproduction of the source audio, but that your team is hearing exactly what you are.

Conclusion

Hopefully, this article has shed some light on what variables are relevant in the pursuit of perfect sound reproduction. It is one thing to read about it, but another to experience it for yourself.

If you are interested in what perfect actually sounds like try Neptune for 14 days free.

Back to blog

Country/region