ONLINE - ISSUE 9
Reprinted from Audiophile Audition
Alternatives to 5.1 for Multichannel Music
"Resolved: That the speaker-based 5.1-Channel ITU Standard is fine for movie surround sound but inappropriate for SSfM (Surround Sound for Music). A better alternative would be... what?"
(Intro from John Sunier: I requested a cross-section of opinions from experts in SSfM who participate in the generally highly technical sursounders' mailing list email@example.com about the controversy touched on by many of us relating to better approaches existing for the reproduction of multichannel music than the same 5.1 channel system of Dolby Digital and DTS used for home theater. Many of the members work with and support the more sophisticated Ambisonics multichannel format. I asked them for their ideas about possible alternatives and to explain in simple language. I thought these ideas would also appeal to independent-thinking Positive Feedback Online readers.)
First some definitions:
B-format Ambisonics Four audio channels provide all the information required to recreate a soundfield in three dimensions: one is "left minus right (Y component)," one "front minus back (X)," one "up minus down (Z),"all using closely spaced figure-eight mics. The fourth or (W) channel is the mono pickup from an omnidirectional mic which is the overall sound pressure at the same point. An Ambisonics decoder translates these signals to feed any number of loudspeakers from four up. Ambisonics is therefore not "speaker-based".
UHJ The compatible stereo matrixed mixdown of multichannel B-format Ambisonic recordings. The entire Nimbus catalog is UHJ as well as some other UK labels.
Ambiophonics Attempts to place the listener into the same space as the performers on standard stereo recordings of stage musicby accommodating to individual external ear and head characteristics, minimizing interaural correlation, abandoning the traditional stereo loudspeaker equilateral triangle, recreating early reflections and reverberant fields via DSP, eliminating all front-loudspeaker crosstalk, and reducing the home music theater wideband reverberation time to less than .2 seconds.
Binaural Recordings made with a human head replica with accurate outer ears and an omni mic in each ear. Designed for headphone-only playback, and when successful can give listeners a "you are there" soundfield impression beyond any speaker-based surround system.
TriField An "unmixing" algorithm which provides balanced left, center and right speaker feeds from any stereo recording.
I think the basic problem with 5.1 is that it favors one direction over others. It assumes that there is a "front" and "rear", and that you should be sitting looking at a stage or screen. That is far too limiting for contemporary music and audio. Ambisonics has the advantage that it treats all directions as equal in importance and needing equal sound quality. The number of channels is not the same as the number of speakers.
I believe B+ ambisonics plus a stereo channel is the delivery medium of choice. A single media can be produced and this can be delivered to any number of speakers. A record manufacturer and studio only have to make one mix and one media. This can then be played in stereo, 5.1 or ambisonically and even ambiphonically.
The minimum number of speaker has to be set 60 degrees apart. With a distance of greater the 60 degrees we are not able to fuse phantom images between speakers and get a hole in the middle. I find it not pleasant to hear the holes that 5.1 has in reproducing the soundfield. The greater the number of speakers the better the fusion of the images, until we come to WFS which will use many speakers.
For the reproduction of music I find that the addition of height information helps to eliminate the hearing of speakers and reduces the confinement that occurs when this dimension is not present. B+ Ambisonics also allows the listener to adjust the soundfield for his/her room and to adjust the ambience to their taste.
Thomas Chen ThomasChen@aol.com
Minimally-realistic horizontal-only reproduction of 360-degree sound fields (music or otherwise) for people with heads that can turn will require at least six identical full range speakers that are equidistant from the center of the listening area, and equally-spaced from one another. Full spherical surround reproduction will require a minimum of eight identical full range speakers that are mounted on the surface of an imaginary sphere that is centered around the listening area, and are equally spaced from one another. Ambisonics provides a convenient but not exclusive way to record signals that can be optimally reproduced in these ways.
The 5.1 "movie" speaker layout can be used for music with excellent, but albeit sub-optimal, results if: 1) there is only one listener and his/her location relative to all speakers (including distance and coordinates) is exactly known; 2) electronic means, such as delays, EQ, ambisonic transcoding, and level adjustments, are used on all speakers to achieve exact phase alignment, level and EQ matching of all channels; and 3) the listener always only faces forward while listening.
Acoustic design of concert halls has shown that a "good" hall generates for the audience a sound field with three distinct facets. To create a convincing illusion in your living room, an audio system must be able to reproduce all three facets.
One facet of the sound field is the direct sound that travels from the musicians to the listener - stereo can be good at reproducing this (as long as the musicians are placed between the speakers). Another facet is the reverberation, not perceived as originating from any particular direction; 5.1 speaker-based channels can be good at reproducing this. The third facet is discrete reflections, also called lateral reflections, reproducing this facet is much more challenging.
Discrete reflections travel from the musicians to the listener via the floor, ceiling or a wall; some travel via two or three reflections. Discrete reflections can arrive at the listener from any direction. Those that arrive within about 30 ms of the direct sound are not perceived as separate sounds but, instead, provide the listener with a sense of "space". Those that arrive after about 30 ms are perceived as echos and provide the listener with a sense of "envelopment".
Because discrete reflections can arrive from any direction, to reproduce them a surround sound system must be able to reproduce sounds from every direction. Given a limited number of speakers, what separates the surround sound wheat from the chaff is the ability to reproduce sounds from between the speakers. This is not, as many think, determined by the channel separation, but by the mixing style. To date all movie sound tracks have been mixed using the pair-wise mixing style. It is not 5.1 speaker-based channels that are the problem; the culprit is this pair-wise mixing style!
To understand why pair-wise mixing is a bad idea in surround sound, please perform the following very simple experiment. Play, in stereo, a pair-wise mixed CD that has good phantom images (almost all CDs use this mixing style). Turn your chair through 90 degrees. If you still hear stable phantom images when both speakers are to one side then you are a space alien because humans cannot do this. Pair-wise mixing did not work in the quadraphonic era and it will not work now with 5.1. Such an absolute statement can be made because the way that humans localize sound has not changed.
Note that you performed this experiment with a CD. CDs have complete separation between the two channels, so clearly good separation is not sufficient for accurate imaging in surround sound. The lesson here, I would suggest, is that surround is not just stereo with more speakers. What is needed is an entirely different approach. There are several possible approaches (Ambisonics, Ambiophonics, Transaural, binaural) which each have advantages and disadvantages. Personally, I believe that Ambisonics strikes the best balance between the pros and cons.
Of course in an ideal world we'd all have something like at least 8 channels of audio, with a speaker in each corner of the room (top and bottom) for periphonic sound. But realistically speaking, the problem with 5.1 is less the speaker layout, than the way the surround sound is mixed and produced, as well as the lack of proper calibration and setup of the average system. Surround sound can be pretty darn nice even on a 5.1 system, if you're playing back something like Ambisonic G-Format on a properly calibrated system. Not perfect, but very enjoyable, and a lot better than simple stereo (all else being equal).
So the two main issues are:
At today's DSP prices, every surround receiver should come with a microphone plug where you can plug in a measuring microphone. Then the system should run a frequency-sweep impulse test out of each speaker, and do proper room and speaker phase correction for the system. If this were done in a mass market product, it could be fairly affordable. Then if the sound sources were G-Format Ambisonic mixes, we'd have made a huge step forward.
Ronald C.F. Antony
It's not hard to see why a great many people are of the opinion that while 5.1 may be fine for movie theaters, it is not the best solution for musicsurround. It should not be forgotten that 5.1 was developed to deal withcertain problems that beset analog movie theatersand these problems simply are not present in a modern digital distribution chain such as is available with technologies like DVD-Audio. The Center Front was required in movie theaters because front speakers too far apart led to a hole in the middle and loss of movie dialog. The LFE was required to carry low frequency EFFECTSsuch as T.Rex footfalls and asteroids crashing into the Earth - and these do not occur in music: not even in "heavy rock" (sorry, I couldn't resist). The effects channel was necessary to avoid intermodulation effects that would have occurred if sub-bass had been added to other channels. It is arguably beneficial to use the LFE in a digital distribution environmentfor sub-bass movie effects onlyto avoid problems with headroom, but as every channel in a DVD-Audio disc, for example, is capable of delivering the entire audible range, there should be no need for it in normal musical applications.
So we can say that the LFE is not required for music and could instead be used for something else more interesting - and some record companies such as Telarc, Chesky, MDG and Divox, use it for height information. We have seen that the CF is, also, not particularly necessary. Indeed many music engineers, brought up on a virtual front stage delivered by two speakers at sixty degrees, prefer the "virtual center" that this configuration provides over stuffing a signal up the center front so that it sticks out like a sorethumbbecause the trouble with simply panning sounds to a CF channel is that they are no longer integrated with the front stage. A third SPEAKERnot a third CHANNELat center front can, however, be used to decode a stereo front stage more accurately, and in a completely integrated fashion,using a technology such as Trifieldeither in the studio or in the playback system.
But all this is tap-dancing around the fundamental problem. Certainly, it must be remembered that whatever better ideas we may think we have, we are stuck with 5.1 and its descendants for the foreseeable future, and we need to learn to live with it and remain compatible with it until it goes away, which may be some time. The real problem is that we have become used to something much more insidious than mere 5.1: we have come to believe that 'one-to-one mapping' is all there is. 5.1 is an example of this, but it is only a special case.
Since the days of quad, we have been led to believe that the ultimate in surround sound involved capturing sound with n microphones, transferring their signals via n channels, and replaying them with n loudspeakers placed in something like the directions the mics were pointed in. What the world was waiting for, according to this idea, was the availability of distribution media with n high-resolution channels to do the job properly. This, I am afraid, is complete rubbish.
What we really ought to be doing is to capture and/or mix the sound in the most artistically and technologically satisfactory way possible. This might involve one mic or many, depending on the project and the intent of the production team. The resulting signals should be transmitted in the most effective and efficient way possible: this does not mean one channel per microphone, it means representing a multi-dimensional soundfield in the most efficient way. And finally at the other end, these efficient distribution channels need to be decoded to drive an appropriate number of speakers, the feed for each speaker being derived as a function of its location in the listening environment.
The most obvious and best-known method of doing this is Ambisonics, though it is no doubt not the only method of achieving this goal. However, using simple first-order Ambisonics as an example, we can see how this concept works in practise. To begin with, envisage the three-dimensional soundfield being captured by a suitable microphone array or multitrack recording with multi-channel panpots or a combination of the two.
The resulting signals are now encoded into a series of sum and difference signals, essentially similar to a three-dimensional development of the Blumlein X-Y technique: one channel carries the sum of all the dimensional signals, ie Left+Right+Front+Back+Up+Down, while other channels carry the differences: L-R, F-B, U-D. This signal set is called B-Format, and is an extremely efficient way of representing the surround-sound we hear - note that it only uses four channels to carry everything needed to recreate a full three-dimensional (with height) soundfield.
At the receiving end, the B-Format signals are decoded to suit a multi-speaker array that is practical for the listening environment. More speakers may be better; height might be nice; but essentially given some basic ground-rules, with a reasonable number of speakers in reasonable places, you can get excellent resultsin an ordinary living room, or a home theater, or even in a movie theater or auditorium. The incoming B-Format signals are decoded for the speaker array, and the speakers can bemore or less where you like, within reason.
This is all very well, but what about the 5.1 compatibility I mentioned earlier? This is taken care of, thanks to the work behind a paper presented by R&D staff from Meridian Audio Ltd at the recent AES Banff conference. Built into MLP, the lossless packing system at the heart of DVD-Audio, is the ability to code hierarchical surround information (such as Ambisonic information derived from B-Format) and flag it in the metadata, such that the result can be played direct into a 5.1 loudspeaker system with no special equipment whatsoever and be completely compatible. However, with a suitable decoder, switched in automatically by a flag in the datastream if desired, the same information can be decoded for the listener's specific speaker array, using Ambisonic technology. The transmitted signal delivers a hierarchy of information, that can be decoded according to the equipment availableor simply used as standard 5.1 speaker feeds. The best of both worlds.
A simple version of this concept exists in the Trifield technology I mentioned earlier. Trifield, based on Ambisonic research by Dr Geoff Barton and Michael Gerzon, essentially decodes a 2-channel stereo mix for three loudspeakers. As in the more sophisticated system described above, the original multichannel/multimic or whatever source, mixed to stereo, is carried via two channels, but decoded to three loudspeakers that work together to generate a fully integrated stereo sound stage that is superior in imaging and image stability than either two-channel, two-speaker stereo or three-channel "panpotted mono". But Trifield is simply the beginning. Because with hierarchical coding, you can have your 5.1 cake and eat it tooright now.
(Richard Elen has contributed to leading professional and consumer audio journals. Formerly Marketing VP at Apogee Electronics, he is now head of Creative Services at Meridian Audio Ltd in the UK. This article does not necessarily represent his employers' view on the subject.)
I proclaim these psychoacoustic truths to be self evident: Human sound localization is determined in three and only three ways (not counting bone conduction)
1. Time, including phase and transient edge, differences between the ears. (ITD) Includes the precedence effect.
2. Amplitude differences between the ears. (ILD)
3. Single and double pinna direction-finding effects.
Each of these mechanisms is only effective in a specific frequency range but they overlap and the predominance of one over the other also depends on genetics, the nature of the signal, i.e. Sine wave, pink noise, music, venue, etc.
For a full range complex sound such as music experienced live, all three mechanisms are always in play and normally agree. By definition such an experience is said to be realistic. If the three mechanisms are not consistent then we often make errors in localization such as in most headphone listening where the interference with the pinna mechanism usually results in internalization even if the ILD and ITD are perfect.
Let me discuss the relative strengths of the three mechanisms listed above. Snow and Moir in their classic papers showed that localization of complex signals in the pinna range above 1000 Hz was superior by a few degrees to localization that relied solely on lower frequencies. That is, their subjects could localize at high frequencies to within one half a degree but only to one or two degrees at lower frequencies. The accuracy of localization in general, declines with frequency until at 90 Hz or so it goes to zilch. Remember this when we get to discuss crosstalk.
It is important for understanding the workings of stereophony or 5.1 that you are convinced that all three mechanisms are significant and I would suggest, with Snow and Moir, that the pinnae are first among equals. You should satisfy yourself on some of this by running water in a sink to get a nice complex high frequency source. Close your eyes to avoid bias, block one ear to reduce ILD and ITD and see if you can localize the water sound with just the one open ear. Point to the sound, open your eyes, and like most people you will be pointing correctly within a degree or so. With both ears you should be right on despite having a signal too high in frequency to have much ITD or ILD. But with your two pinnae agreeing and the zero ILD clue, the localization is easily accurate.
Again, if a system such as stereo or 5.1 cannot deliver the ITD, ILD and pinna cues intact without large errors it cannot ever deliver full localization verisimilitude for signals like music. If the cues are inconsistent localization may occur but it is fragile, it may vary with the note or instrument played, and such localization is usually accompanied by a sense that the music is canned, lacks depth, presence, etc. Mere localization is no guarantee of fidelity.
Let us now look at the stereo triangle in reproduction and the microphones used to make such recordings and see what happens to the three localization cues. Basically stereophonics is an audible illusion, like an optical illusion. In an optical illusion the artist uses two dimensional artistic tricks to stimulate the brain into seeing a third dimension, something not really there. The Blumlein stereo illusion is similar in that most brains perceive a line of sound between two isolated dots of sound. Like optical illusions, where one is always aware that they are not real, one would ever confuse the stereophonic illusion with a live binaural experience. For starters, the placement of images on the line is nonlinear as a function of ITD and ILD, and the length of the line is limited to the angle between the speakers.
I want to get to the ILD/ITD phantom imaging issue. But let us first get the pinna issue tucked away. No matter where you locate a speaker, high frequencies above 1000 Hz can be detected by the pinna and the location of the speaker will be pinpointed unless other competing cues override or confuse this mechanism. In the case of the stereo triangle the pinna and the ILD/ITD agree near the location of the speakers. Thus in 5.1 LCR triple mono also sounds fine especially for movie dialog. In stereo, for central sounds, the pinna angle impingement error is overridden by the brain because the ITD and the ILD are consistent with a centered sound illusion since they are equal at each ear. The brain also ignores the bogus head shadow since its coloration and attenuation is symmetrical for central sources and not large enough to destroy the stereo sonic illusion. Likewise, the comb-filtering due to crosstalk, in the pinna frequency region, interferes with the pinna direction finding facility thus forcing the brain to rely on the two remaining lower frequency cues. All these discrepancies are consciously or subconsciously detected by golden ears who spend time and treasure striving to eliminate them and make stereo perfect. Similarly, the urge to perfect 5.1 is now manifest.
Consider just the three front speakers in 5.1. Unless we are talking about three channel mono, we really have two stereo systems side by side. Remember, stereo is a rather fragile illusion. If you listen to your standard equilateral stereo system with your head facing one speaker and the other speaker moved in 30 degrees, you won't be thrilled. The ILD is affected since the head shadows are not the same with one speaker causing virtually no head shadow and the other a 30 degree one. Similarly the pinna functions are quite dissimilar. (In the LCR arrangement the comb-filtering artifacts now are at their worst in two locations at plus and minus 15-degrees instead of just around 0-degrees as in stereo) Thus for equal amplitudes (such as L&C) where a signal is centered at 15 degrees, as in our little experiment, the already freakish stereo illusion is being strained. Finally, the ITD is still okay and partly accounts for the fact that despite the center speaker there is still a sweet spot in almost all home 5.1 systems. Various and quite ingenious 5.1 recording systems try to compensate for some of these errors but the results are highly subjective and even controversial.
Surround Sound Localization
Let us consider surround sound localization. Obviously, if a mono signal is placed at 110 degrees it can be localized using pinna, ILD, and ITD even when facing forward. Between the two rear surround speakers you have effectively a stereo pair spanning 140 degrees. In such a situation, if there is a lot of high frequency energy, the pinna will localize to the speakers and it will be difficult for some individuals to hear sound directly behind or in the central rear region. (The new rear surround channel can fix this, but the LCR anomalies as above will then apply.) However, if there is a real ITD and a real ILD between the rear speakers it is theoretically possible to hear a wide stage to the rear as in the stereo illusion. However the crosstalk, and thus the comb-filtering, is extreme at this angle and it starts at a lower frequency thus interfering with the ILD at 800 Hz or lower. If there is an ITD this can help but then the speakers must be properly placed or delay adjusted. Obviously, if this was a good way to make a stereo stage, front or rear, it would have been done this way long before now.
Finally, let us see what happens when we try to image from the front side speaker to a speaker at 110 degrees on the same side while facing forward. In the case of the pinnae, the pinna facing the speakers can localize to each speaker discretely if the signals are different. If they are correlated or identical the brain will use some other cue to localize. There may be some gifted individuals who can localize high frequency phantoms between the speakers using one pinna but I can't do it. The higher frequencies also go around the head to produce a head shadow and this at least allow the brain to decide the source is at the loud side.
If there is a time difference, then the two signals from each speaker reach the exposed ear canal and add together to produce garbage and a head shadowed version of this time garbage also reaches the far ear. Basically, regardless of the recorded TD, the ITD the brain perceives is always the ITD based on one's ear spacing. However this is sufficient to localize to the louder side but makes localization between the speakers wishful thinking.
Deja vu. If there is a level difference, then the two signals from each speaker reach the exposed ear canal and add together to produce garbage and a head shadowed version of this level difference garbage also reaches the far ear. Basically regardless of the recorded LD, the ILD the brain perceives is always the ILD based on one's head shadow. However this is sufficient to localize to the louder side but makes localization between the speakers wishful thinking.
That the above scenario is more or less correct is attested to by the fact that the industry keeps adding more speakers to correct these defects. We have the rear speaker, now height speakers, and the 7.1 and 10.2 proposals, etc. There are now several psychoacoustically valid reproduction and even recording methods to avoid all the physiological problems enumerated above. Among them are Ambisonics, Wavefield Synthesis and Ambiophonics. Of these Ambiophonics is most suited to small rooms and one-couple listening and can play existing LPs, CDs, DVDs and SACDs.
It's tough for me to respond to a statement like "5.1 speaker-based channels are fine for movies but inappropriate for music-in-surround. Instead we should be employing six (or less) channels in... configuration." In my humble opinion, a preferable way to phrase it would go something like: Which speaker/channel configuration delivers the most realistic reproduction of an actual sonic event (if that is one's goal)?
As a music lover, recording engineer and audiophile record label owner, I disagree with your assessment of 5.1 channel surround sound. I wouldn't think of changing my production process. And not because I don't believe in any of the other formats you mention but rather because I produce recordings with a completely different set of assumptions. My recordings/releases are not meant to recreate a live event in a documentary fashion. I have deliberately chosen to use high-resolution recording technology, a multichannel recorder, lots of stereo pairs of microphones and the ITU 5.1 channel standard for delivering music to consumers because I can go so far beyond what happens during a live event.
In short, the musical experience is enhanced through technology so that the listener hears something that is hyper-real... detailed, dynamic, immersive and engaging. As you know I have done a lot of binaural and minimally-miked recording. In comparing the very open, distant, reverberant sound of that type of tracking to the close yet deep sonics that I capture on our DVD-Audio discs, I much prefer the musicality and intimacy of the later. 5.1 channel systems, when properly setup and calibrated, deliver an absolutely incredible music experience... with the right software.
Mark Waldrep, Ph.D.
For those who after reading this are interested in the Sursound mailing list:
Several hi-res multichannel labels have been experimenting with alternatives to typical 5.1including the old quadraphonic display of 4.0 channels, using either the center channel or LFE channel or both for side or height information, and the most extreme alternative: using the center and LFE to feed two front speakers directly over the present front L & R , creating an equal vertical square pattern (while retaining the surrounds in their normal positions). These "2+2+2" DVD-As from MDG & Divox will be reviewed in Audiophile Audition shortly and shared with Positive Feedback Online readers.