Here are some files for research purposes. Please be aware that in some form or other, this is all copyrighted material.
I’ve long had a problem with pan-potted stereo. This is the common way of mixing popular styles of music in which a stereo (usually) image is constructed out of multiple tracks of audio. The left/right intensity of each signal is adjusted to make the instruments spread across a soundstage so the the total result is musically appropriate. This is all well-known and is pretty much instinctive for a mixer.
The problem is that this sort of mix only works for a relatively small ‘sweet spot’. If the listener moves very far off the center line between speakers, the image will collapse in the direction of the closer speaker. This is the Haas Effect (or law of first arrival, or precedence effect). The relative delay between sources have a greater perceptual effect than the relative level. There’s really only one solution and that’s addition some delay to the more distant speaker. This comes with all sorts of problems, most of which become painfully obvious when the mix is heard in mono.
There’ve been a number of attempts to make panners that give a more stable image when the listener is off-center. All that I’m aware of accomplish this by making a small room and then placing the signal in this room. Under the hood, there’s a a number of FIRs and a convolution process that moves the source around in this virtual room. The problem is that this ‘room’ is pretty obvious and rarely is appropriate for the material being mixed. Is there some way to introduce a ΔT without being obvious? Furthermore, can this be done in a way that feels natural to a mixer and fits easily into existing workflow?
I set up some rules for myself:
The signal must always sound ‘dry’, with no added sense of room.
The strongest part of the signal must always be undelayed, so that the image does not appear to move forward and back, relative to other things in the mix.
The source must be rapidly-pannable, with no obvious artifacts.
The result must be reasonably mono-compatible..
From the mix position, there must be little or no obvious difference from the sound of a traditional panner.
There should be no perceivable ‘slap’ effect (as the result of too much delay)
The image must be more robust when the listener moves off the center line.
I also used my experience as a classical recording engineer for a few ideas. There are lots of ways to hang microphones and they all have an effect on the robustness of an image. In truth there are combined techniques that give the best of all worlds, but let’s simplify to techniques that use only two mics. Here they are:
Coincident techniques - X/Y, M/S, Blumlein. Conceptually all mic capsules are at a single point. There’s no ΔT between channels, so there’s not much of value here.
ORTF (you’ll know this one, François). - A pair of cardioids at an angle of 110 degrees and separated by 17cm. Enough time information to create a more robust image, but still pretty mono-compatible. If I have only 2 mics, this is the technique I use.
Decca tree - There are actually 3 mics here, but the important thing is that the widest spacing is 2 meters. This gives a real sense of width and is very robust.
Wider omni mic spacing - Even more width, but real problems if used without other mics.
One of the things I discovered is that you can’t really match the geometry of these mics and an ensemble. And of course these placements all introduce a room sound into the recording. But they’re still instructive.
Two parts of the panner - Intensity and delay.
The intensity part of the panner is identical to the most common panner and uses a sin/cosine law:
Where 0 <= ϴ <= 90
Gain L = cos(ϴ)
Gain R = sin(ϴ)
The delay part of the panner is shown by this diagram:
In other words, if the pan position is on the extreme left, right delay is at a maximum (but the gain is at 0). This matches what happens in an intensity pan. If pan position is in the center, then the delay of both channels is 0. As pan moves off-center, delay in the weak channel is introduced.
The max delay value is variable and depends on the inter-mic distance specified in the plugin.
How does it meet our criteria from above:
Signal sounds dry. Yes.
Strongest part of the signal is undelayed. Yes.
Source must be rapidly pannable. There’s a compromise here. The intensity part of the pan is very quick, since we’re only gliding coefficients. The delay part runs a little behind. Depending on how far we pan from previous position, it may take a couple of seconds to catch up. We can’t use a crossfading delay, since there’s obvious comb filtering as the crossfade happens. So we use a gliding (interpolating) tap. There can be a very drastic Doppler effect—much larger than we can stand—if the delay changes rapidly. If we keep it to a couple of cents, it’s pretty hard to notice. I’m not sure there’s any other way to solve this. In practice I don’t know that it’s a problem. Most things in a mix aren’t dynamically panned, so a delay doesn’t matter. If there is dynamic panning, then the mix is likely to be so busy it’s hard for us to notice a little cheat.
Mono-compatibility. Reasonable. When we’re near the center line, the gain is stronger, but the delay is shorter. This pushes cancellation/reinforcement effects to higher frequencies (out of the way). As we move off the center line, delay increases, but the gain of the delayed tap goes down. There’s a mono switch on the plugin so you can play with it.
From the mix position, it still sounds like a traditional panner. I think so.
No perceivable ‘slap’ effect. None with the maximum delays I allow in the plug. The image still fuses. It doesn’t take much more delay before it starts to pull apart. This is probably acceptable in other applications, as in the input to a reverb. But for dry panning, this is about as far as it can go.
Robust image when the listener is off the center line. Not dramatic, but still noticeable. It largely depends on speaker separation and the position of the listener relative to those speakers. If separation is low, the effect is small because the delay time between speakers is low. As that separation increases, the effect of a more stable image becomes apparent.
I’ve set up a few examples with the pans in the positions shown in the diagram. The physical positions of the listener should be roughly as shown, although this will depend on the listener and the distance between the speakers. In my room, the speakers are 2.2 meters apart. You’re probably not going to hear much if the speakers are closer than a meter or so. The larger distances are required for the Haas effect to take precedence over intensity differences.
I’ve set up examples with ‘mic’ distances of 0 (same as a regular intensity pan), 17cm (ORTF), 1 meter and 2 meters. The source material is panned midway between left and center and then midway between center and right.
Distance 0 cm
Haas effect should be easily noticeable
Width 17 cm
The Haas effect is still present, but less strong. To my ear, the image is a little wider. I also notice that the image appears panned a little farther from the center, even though intensity is the same. This is clearly related to the delay. I’ll address this in a couple of summary examples
In my room, this is the mic width at which the image remains on the proper side. The exaggerated distance from center still exists, but appears to be no worse than at 17cm
This might be the widest practical distance, although the plugin allows another meter. Positioning is better than 100 cm.
Non-linear pan at 100 cm
In all the examples above, the delay amount was linear, relative to pan position. This seems to have moved the apparent position a little too far. Here’s an example with a slight exponent to the delay length and I think it’s more accurate.
Here’s a slow pan at a mic distance of 200 cm, then then same pan in mono. Cancellation effects are pretty noticeable, but I’ve heard a lot worse. For each halving of mic distance, the effect moves up about an octave.
Play with the Plugin
Download here . Plugin supports Stereo, LCR, Quad, 5.0, 5.1, 7,0 and 7,1. It’s currently AAX-only, so you’ll need to run it under Pro Tools. You’ll also have to install by hand. It goes in /Library/Application Support/Avid/Audio/Plug-Ins. You may have to give yourself permission to write to that folder. For tests, you should create a mono track and then map the output to stereo. Pop some material in and launch the plugin. Make sure that Microphone Width is non-zero (at 0 it’s just an intensity panner). Pan Speed controls how fast the delay glider goes. Mono Sum can be turned on so you can easily hear any compatibility issues.
This plug will also work in several surround formats. In that case, you’ll see an additional Front/Back pan knob. The surround stuff needs some more tuning and I haven’t done anything with immersive formats yet.
Immersive vs. Stereo vs. Binaural
I’m interested in an enveloping experience that ideally works in both speakers and headphones/earbuds. For now, I’m going to use an excerpt from one of my own pieces. This is the 4th movement (of 8) from Variations. It is mixed in Atmos, so the real version requires a room with both surround and height speakers. The stereo mixes here are generated from the Atmos renderer. There is no way to tune the binaural output, so it’s generic and its effectiveness will vary greatly among individuals. If it works perfectly, you should hear music around and overhead.
These are “high-quality” mp3 (320kBPS). Please give them a listen on speakers and with earbuds. I’m very interested on your observations. I’ll share my own after the links.
On speakers (stereo) the stereo downmix is very good. It’s wide and stable. Considering this is an automatic downmix , the balance is really very good. I could improve on it manually, but it’s close enough that I wouldn’t bother. The binaural mix is good on speakers, but it still sounds a little peculiar. Levels aren’t quite right and there’s something odd about it. Could be some crosstalk cancellation or something. It’s perhaps a little more open, but on speakers I’ll stick with the stereo.
On headphones, the stereo downmix also works well. It’s evenly spread left to right. The binaural mix is interesting. I don’t get any height sensation at all. I do hear a more open stereo effect and occasionally get some bits that are a little in the rear. The effect is most pronounced on reverb tails. I think that I do prefer the binaural mix on phones. Note: Dave Griesinger did a little work in this area 30 years ago. I never had the slightest sense of surround from his work. He told me (with a rather disparaging tone) it’s because I had a big skull and small pinnae. The technical details of my head shape are correct and that’s probably why I’ve rarely heard the binaural effect others rave about.
New test with FB360 Ambisonics
As promised, I’ve taken the 12-channel speaker feed from my Atmos mix and run it through a 3rd-order Ambisonics process. This used Pro Tools ambisonic busses and encoded them through the FB360 plugins installed with Pro Tools. Other than the busses (which carry B-format), this is not native to Pro Tools. The FB360 plugins are actually provided by Facebook, but they appear to be standard for many ambisonic workflows. I do block their ability to ‘phone home’, since I don’t trust FB with any information. There are very helpful templates in Pro Tools to help get started.
One of the outputs of the process is a stereo binaural downmixer. This is what I used, since I’m not particularly interested in the speaker output. I first experimented with a noise source, panning it through a 3D space. As with most binaural processes, I get very little sense of height and only a little sense of signal in the rear. I don’t know that it’s a flaw in the system: more likely simply that my head shape does not match the generic HRTF used for binaural conversion.
Since I was using the 7.1.4 output as source, I set up the soundfield to match my speaker configuration as closely as possible. Virtual distance to each was 2 meters and the 4 height speakers were at an elevation of 67 degrees. I know that François has concerns that I don’t do a full remix in ambisonics. But it’s about 100 tracks and I’m not going there. I think that this experiment tells me what I need to know about making a stereo equivalent that’s appropriate for headphones and—ideally—speakers. So here ‘tis.
On speakers, the stereo downmix works very well. I still don’t get any height sensation, but the image is wide and in many ways sounds better than the Dolby downmix. It may be that my ears are tired, but it sounds a little crushed. There could be some internal limiting or there may be some minor spectral changes from FIRs that the FB plugins add in the virtual room. They say—without detail—that they add some reflections but ‘you won’t hear them’. That of course makes me a little uneasy, but this still works very well. It could also be because of the filtering that’s inherent to ambisonics. I don’t understand the process fully, but in my reading it appears that the nature of the channel extraction filter must change when the wavelength is shorter than the diameter of the human head. This is in the range of a kilohertz and is a very sensitive part of our hearing.
On headphones, the downmix works well but I don’t find it to be superior or inferior to the Dolby process. It’s just a little different.