dac

You are reading the older HTML site

Positive Feedback ISSUE 65
january/february 2013

Mountains and Fog: the Sound of Digital Converters, Part 1
by Lynn Olson

I've been listening to the latest round of delta-sigma DACs—at substantial price points of more than $6000—and for reasons I don't completely understand, they sound alike, and quite different than the $1100 Monarchy N24 that I've had for the last ten years. The technologies are pretty different: the N24 uses the TI/Burr-Brown PCM1704 ladder/R-2R converter, passive current-to-voltage conversion, with zero-feedback vacuum-tube amplification. (This is NOT the same as adding a "tube buffer" stage to a conventional voltage-output DAC chip; that approach just adds 2^nd-harmonic syrup to a signal already damaged by a mediocre opamp inside the DAC chip.)

The delta-sigma DACs I've been auditioning use a completely different architecture; delta-sigma chips from Crystal Semiconductor, Analog Devices, and the ESS Sabre 9018, along with all-solid-state electronics. Some of these use opamps, some discrete FETs and bipolar transistors, but they all have a common sound—what I'm beginning to think of as the "delta-sigma" sound.

What's going on here? I hear it in my iPod Touch driving Sennheiser HD580 headphones, I hear in my Marantz AV8003 pre-pro that I use for movies, and I hear it in the latest group of DACs. Yes, they all sound modern and up-to-date; smooth, pleasant, free of the grit-and-grain that plagued early digital, and play 88.2, 96, and 192 kHz digital with the greatest of ease. But there's something missing. I don't have any fancy audiophile words for it, but it's an absence of life, of sparkle, of that intangible sense of being right in the room with the performer.

If a singer in singing right at the ragged edge of their vocal range—and doing it intentionally to create a sense of tension—it is (much) less noticeable with DACs using delta-sigma chipsets. The impression of physical texture—a hand lightly brushing across the head of a drum, the sensation of wood and steel and weight from a grand piano, the odd and fascinating tonal meanderings of an oboe—is diminished or absent. In the highest-quality DAC with the ESS Sabre 9018 converter, it's subtle, and takes a quick comparison with the PCM-1704 DAC to hear the difference. In the most other delta-sigma DACs, the loss is not subtle at all, and the performers sound bored, like they're just phoning-in the performance. The nuances and over-the-top aspects are absent, replaced by a sort of monotone quality.

After a while, I kind of get used to this sound—the audiophile virtues of clarity, dimension and smoothness are certainly all there with the best of the high-end DACs—but then switch back to the Monarchy using the Burr-Brown PCM-1704 DAC and go "AH!". All the vividness and tone color suddenly returns, and the performers, instruments and hall-space sound real again.

After several weeks of this, I started to wonder. What's going on here? Is this just me? Have my tastes gotten so idiosyncratic I've completely departed from the mainstream of audio? Well, no.

Karna hears the same thing too, just not using the high-falutin' audiophile language I use. She thinks my setup (with the Monarchy N24) sounds "live", like a performance right in front of you, while the others might be "accurate" as audiophiles understand it, they don't sound "live"—they sound canned, electronic, recorded, not real. She thinks—and I have to agree 100% - that audiophiles have it all wrong with this obsession with "accuracy". The so-called "accurate" systems almost never sound "live"—they mark all the tick-boxes on a checklist, but they just don't sound like real performers playing real instruments in a real space. (This was underlined by hearing two different sets of performers playing outdoors in the Pearl Street Mall in Boulder, Colorado. What we heard didn't sound like an audiophile system at all; the musicians playing a wide variety of acoustical instruments sounded loud, exciting, thrilling, real music, with real people standing around and applauding.)

With my sense of "live" versus "accurate" re-calibrated by Karna and what we both heard in Boulder, I carefully re-read the glowing reviews that the top-of-the-line DACs received in the audiophile press, trying to find a common thread.

I started to wonder if it's the associated equipment used by the other reviewers. After digging through trivia like cables and stands, I looked at the amps and speakers they were using. Bingo! THAT was the difference. All the reviewers that wrote the glowing reviews were using high-powered (200 watts or more) Class AB transistor amps with plenty of feedback (very low distortion, high damping factor, etc.).

My amplifiers are quite different. I'm not claiming "better" (that comes down to taste and power demands of the loudspeakers), but very different. My electronics are all-triode throughout, have zero local or global feedback, and where regulation is used, it is shunt regulation that also does not use feedback. In the transistor world, series regulation is more common, and the series regulators typically have very high degrees of internal feedback. The forward path of the transistor amplifiers rely on local, or more commonly, global feedback (around the whole amplifier) to linearize the Class AB output devices, decrease output impedance (damping factor), extend bandwidth, and stabilize gain.

Rather than get into a religious argument about feedback—which has been going on in the audio community for fifty years now—let's just say that non-feedback and feedback amplifiers operate in fundamentally different ways. In non-feedback amplifiers, static THD specifications range from quite poor to fairly good, while THD in feedback amplifiers ranges from moderately low to unmeasurable. In non-feedback amplifiers, particularly vacuum-tube amplifiers with direct-heated triodes, the emphasis is on device linearity, as well as circuit design to optimize operating points for lowest distortion and best HF performance. Class AB operation is not an option in the non-feedback world; the sharp transition in gain as the output devices switch on and off result in unacceptable distortion at low levels.

The (inevitable) tradeoff is much lower power levels. A 20-watt amplifier is a big amplifier in the direct-heated-triode world. A 60-watt amplifier is at the outer limit of the possible; beyond that, all you have are transmitter triodes that operate at 1kV or more plate voltage, and very likely forced-air cooling. The common audiophile trick of using parallel tubes has the problem that the transfer curves of the paralleled tubes do not exactly match, resulting in small lumps in the overall transfer curve (which translates to high-order distortion products).

In other words, when you see a lot of transistors or tubes (more than two per channel), it's done for power, not linearity. Manufacturers like to brag about device matching, but that is matching for DC parameters, not dynamic matching, which is very difficult, since exact pairs are very rare. The more pairs of devices an amplifier has, the less likely it is they will dynamically match (have the same transfer curves). Whether we like it or not, in the absence of feedback, the most linear amplifiers will have two output devices per channel—whether the amplifier is MOSFET, bipolar transistor, pentode, or direct-heated-triode.

A maximum output of 20 to 30 watts/channel, along with the requirement for four matched 300B's (two for left channel, two for right channel), is not one that most audiophiles want to make. From the perspective of most audiophiles, it's too much money for too few watts, no matter how pristine those watts might be.

Speakers that have true (Theile/Small) efficiencies in the 82 to 88dB/watt/meter range—which is the vast majority of speakers on the market—are not a good match for a 20-watt amplifier. The Ariels, which are 92dB efficient, are just about efficient enough for the 20-to-30 watt Karna amplifiers, providing peaks of 105dB at the listening position. If the Ariels were 10dB less efficient—and most highly-reviewed audiophile speakers are—they would need 200 watts to play at the same levels, and direct-heated triodes would be out of the question.

Only transistor amps can provide that kind of power, and they are typically feedback amplifier with Class AB output stages. Most of the so-called "Class A" transistor amps are actually Class AB amplifiers with sliding bias; a true Class A transistor amplifier with a 200-watt output would require an equipment rack all its own, along with forced-air cooling. If the heatsink doesn't burn your hand, it's a Class AB amplifier, not matter what the manufacturer says.

Each of my 20-to-30 watt Karna amplifiers dissipate 80 to 100 watts of heat from the pair of 300B plates; scale that by ten times, and you have a pair of large room heaters that have to be powered by 240-volt AC power. See why real Class A operation is not very popular, and is widely abused as marketing slogan when applied to transistor amplifiers?

60-watt Class AB pentode-with-feedback amplifiers are kind of midway between the two groups, although subjectively, they sound a bit different than either. Pentodes have 3 to 5 times as much high-order harmonic distortion as direct-heated triodes, requiring the intervention of feedback to bring distortion down to reasonable levels, as well as increasing the damping factor. Bipolar transistors and MOSFETs have even higher levels of upper-harmonic distortion, as well as much sharper Class AB switching, but are easier to use with feedback since the circuitry is direct-coupled and transformerless, which greatly simplifies application of feedback. (The combined phase shifts of the output transformer and internal RC-coupling limits overall feedback to 20dB or less. By contrast, pentode OTL amplifiers may have as much as 40 to 50dB of feedback).

Direct-heated triode amplifier are off in their own little oddball corner of audiophilia; the whole debate over single-ended versus push-pull is small change compared to more basic things like device physics and circuit design. The physics of the amplifying device dominates the sound, like it or not. Everything else is seasoning and garnish. The same is true for loudspeakers; the transducer, and its underlying principle of operation, dominates the sound.

And now, finally, we come to the DAC. My feeling is the converter itself, and its principle of operation, is what dominates the sound of digital sources. Of course lossy-compression digital sounds (much) worse; what do you expect when 90% of the sound is discarded and thrown away at the record end, and has to be guessed (interpolated) on final reconstruction?

And of course Sony and Philips set the bar too low with a sample rate of 44.1kHz and a bit depth of 16 bits as the hard-and-fast standard for Compact Discs. The problems with the 44.1kHz sample rate are well-known; 22.05kHz is much too close to the upper edge of the CD bandwidth of 20kHz, requiring an absurd lowpass filter with 96dB of attenuation in small fraction of an octave. Since analog filters like this are nearly impossible to build (and tune accurately), designers of CD players adopted the principle of 4x (or higher) oversampling and digital filtering by the late Eighties, which relaxed the analog filter to 1^st through 3^rd-order lowpassing.

A more serious problem is low bit depth; although (barely) acceptable for a consumer medium, 16-bit resolution proved unacceptable in the studio, since each track of a multitrack master is recorded, and later mixed, at different levels, and 16-bit systems have no extra resolution to throw away when track 1 is -3dB down, track 2 is -10dB down, and so on. You get a distorted mess when they are all combined. By the mid-to-late Eighties, studio practice went first to 20-bit systems, and then by the early Nineties, to 24-bit systems.

The difference between 16, 20, and 24-bit doesn't sound like much, but it is huge. Let's look at the output of a standard CD player or DAC. Maximum output level, by industry convention, is 2 volts RMS with full-modulation signal on the disc. 2V rms has a plus peak at +2.828 volts, and a minus peak at -2.828 volts, for a full-swing dynamic range of 5.656 volts.

That's the maximum signal. What's the smallest? Unlike analog, there is defined minimum signal that the disc can record. Below that, there is nothing—digital zero. 16 bits provides a total of 65,536 levels (2 to the 16^th power). Divide 5.656 volts by 65,536 levels, and you get a smallest-possible level of 86.3 microvolts. Ain't nothin' below that, Jack. Just zero.

(By way of comparison, 1 bit = 2 levels = 6dB, 16 bits = 65,536 levels = 96dB, 20-bit = 1,048,576 levels = 120dB, 24-bit = 16,772,216 levels = 144dB. In practice, real-world ADC and DAC chips fall short of the claimed resolution by 2- to 5-bits.)

At first glance, it would seem like 96dB of dynamic range is plenty. A superb studio-quality master tape deck delivers a S/N ratio of about 80dB, and LP's are less than that. At the time of its introduction in 1982, the Philips/Sony Compact Disc outmeasured all consumer media by 20 to 30dB. Perfect Sound Forever, right?

Not so fast. The digital steps get coarser and coarser as the signal level goes down, exactly the opposite of any competently designed analog system. The coarseness of the steps corresponds to a rapid rise in distortion—and not harmless 2^nd-harmonic, but ugly high-order distortion, turning sine waves into ragged stair-stepped square waves.

Many of the earliest digital recordings were sourced from analog mastertapes, with built-in tape hiss, as well as gently controlled overload limiting combined with very effective lowpass filtering. By accident, analog master tape provides nearly perfect signal conditioning for first-generation digital systems. The tape-recorder preconditioning provides smooth, truly random noise at the lowest levels, overload protection at the highest levels, and protection from slewing with a very effective lowpass filter around 20kHz.

By the mid-Eighties, it was realized that random-noise dithering was a requirement for all digital-audio systems. At the minor expense of about 3dB of noise, the low-level distortion can be not just masked, but completely eliminated! How is this possible?

(For a more detailed discussion of dither and how it removes low-level distortion, look here: http://izotope.fileburst.com/guides/Dithering_With_Ozone.pdf )

Let's look at the smallest bit (commonly called the Least Significant Bit, or LSB). This is the 86.3 microvolts mentioned earlier. It's either on or off; there's nothing in between. So how do we modulate what is effectively a fractional bit? Bits aren't fractional, after all—they're discrete numbers, no fractions allowed. A single bit is very precisely quantized in both time and magnitude.

This is where dithering comes in. On the record end—and it MUST be done during the encode process—the LSB is randomly toggled on and off. If we now add a fractional signal level, like 43.15 microvolts, it changes the duty cycle of those random toggling events. Instead of the LSB being on 50% of the time and off 50% of the time, it's now on 75% of the time, and off 25% of the time. Thanks to dithering (adding noise), intensity is translated into the percentage of time the LSB is on or off, giving fractional modulation.

This is a clever form of pulse-width modulation. At the level of the LSB, the system smoothly glides from PCM (pulse code modulation) to a type of quantized PWM (pulse width modulation). It is almost, but not quite, a free lunch.

Instead of digital silence, digital recordings now have a floor of random noise at the bottom of the dynamic range. More recent innovations in dithering use what is called a "triangular noise function", in other words, noise that is filtered so the center of energy is around the Nyquist frequency of 22.05kHz, which provides the maximum rate-of-toggling of the LSB—in effect, creating a PWM carrier centered around 22.05kHz to convey information below the LSB level.

However… dither + PCM systems require averaging to make them work. This aspect is frequently overlooked in discussions of dithering. If you average over 1 second, as is frequently done in published S/N measurements, you're averaging over 44,100 samples, which does a really good job of smoothing things out.

But… the human ear does not use averaging over 1 second. That is merely a convenient number that makes the measurements look good. It is not actually how we hear. A more likely number for subjective averaging might be somewhat less, perhaps in a 10 millisecond to 50 millisecond range. To the ear, 1 second is a very long time; it translates to a reflection from an object 573 feet away, and is heard as a discrete echo.

10 milliseconds corresponds to 441 samples at the CD rate, and 50 milliseconds to 2,205 samples. There's not as much averaging going on as published specs for dithered systems indicate. This might be one of the subtler reasons that high-rate (88.2kHz, 176.4kHz, etc.) systems sound better; there's more averaging going on, allowing the PWM aspect of the modulation system to work more efficiently. In perceptual terms, the more efficient the averaging-over-time, the lower the distortion will be—and a two or four-fold reduction in low-level distortion is a big deal.

It was discovered in the late Eighties that random dither noise MUST be statistically independent for each channel. This was discovered the hard way when digital consoles were introduced in the professional market. When the engineer lowered the faders, the soundstage shrank down to mono, destroying the illusion of a real space. To save money, the digital mix-board used the same random-number generator for all the tracks. When each channel got an independent random-number generator—not cheap back in the late Eighties—the mono-ing problem disappeared.

The statistically-independent problem crops up elsewhere in the digital-audio chain. Dither is required for nonsynchronous sample-rate conversion (for example, 96 to 44.1kHz), as well as digital equalization and other types of signal processing. This isn't just at the record end. This is anywhere in the signal chain between the original ADC to the final DAC. Whenever digital dither is added to a stereo signal, it MUST be statistically independent for each channel. Otherwise, as the signal fades down in level, the stereo impression will fade to mono.

Although dither takes performance from unacceptable levels to good-to-excellent, the bit-resolution of the complete digital system is still a key parameter of overall performance. Analog, although plagued by noise and distortion, has no definable bit-resolution, although low-level distortions like Class AB transitions can greatly compromise performance.

Remove the noise from analog, and it is better; remove the noise from digital (this includes PCM and DSD) and performance becomes unacceptable. That's not a small difference. It's night and day.

Here's a poetic metaphor: think of analog noise like a layer of fog over mountains. You're flying over the layer of fog, and the mountains show through every now and then, and are sometimes right above the layer of fog. The mountains are reality: the motions of the microphone diaphragm before it is amplified has resolution limited only by quantum events, far, far below electronic noise in the amplifier chain. As far as we are concerned, the downward resolution is unlimited.

We can't see it all, of course, because it is hidden by noise. But noise in a well-designed analog system is completely uncorrelated to signal; it is independently generated by various analog noise sources. So the fog of noise floats above the mountains, hiding the details, but the mountains are really there, and details come through now and then.

One of the biggest differences between instrumentation and human perception is we have memory and expectation. We not only remember the sound of a singer, a piano, and a saxophone, we have an expectation of what they should sound like, and are disappointed when they sound "wrong". We can reconstruct the sound from fragments; the visual system does this all the time, and so does the audio perceptual system. If we only see glimpses of the mountain range through the fog, we have a good feel for what's underneath the fog, particularly if we've seen the mountains without fog—as in a live performance, for example, with no electronic intervention at all.

This why we can hear 10 or 20dB "into the noise", while a measurement instrument cannot. The instrument has no memory, and no expectation. It only records the top of the fog layer. If there are mountains underneath, that's just a ripple in the noise level. This is why dither is so essential to digital systems; they let the listener hear a distance into the fog.

What distinguishes analog systems is the independence of reality—the acoustic, analog reality of real sounds in a physical space—from electronically generated noise. The quiet hiss from the master-tape, the rush of noise from the record surface, are just an overlay of fog that masks a real event, with the analog proportionality retained all the way through from microphone to loudspeaker.

Digital is not the same. Without dither, not only would there not be fog, but a glistening white floor, with blocky objects coming out of it. Dither removes the hard floor and the blockiness, and replaces it with the fog of noise. To some degree, like analog, the dither can resolve things below the noise floor—but we are not talking about the nearly infinite resolution of analog. No. This is the area where the differences between dither implementations comes in, and the additional requirement of noise-shaping for some types of converters.

Ladder (or R-2R) ADCs and DACs originally came out of the NASA and aerospace missile-telemetry world; Thomas Stockham re-purposed a rack-sized converter from Honeywell for the first-generation "SoundStream" 50/16 system in the Seventies.

http://en.wikipedia.org/wiki/Thomas_Stockham

As device integration reduced costs and improved performance in the Eighties, prices for 16-bit converters dropped, and 20-bit converters of much higher performance were introduced in the late Eighties.

The drawback of building a high-performance ladder converter is cost: it requires extremely precise resistor arrays that have to be laser-trimmed in production, which slows down the pace of the production line. Ladder DACs also require multiple power supplies, and the higher-performance versions required additional external analog circuits—current-to-voltage converters, lowpass filters, buffers, etc. All of this added to cost, and makes them impractical for low-power, single-supply (3, 5 or 12-volt) applications.

At the same time, Sony and Panasonic designed radically different "1-bit" ADCs and DACs for the portable, battery-powered market (remember the Sony Discman?), and then introduced the converters into the broader audio market. Instead of an expensive ladder of resistors (which need to have parts-per-million accuracy) associated with an array of FET switches (16 switches for a 16-bit DAC, 20 for a 20-bit DAC, and so on), there's just one switch, and it only has two levels, ON and OFF. There's no resistor ladder at all. How does a single switch deliver 16, 18, or 20-bit accuracy? How do you get from 2 levels to a million or more?

If the converter is many times faster than the original sample rate, you can improve the resolution. The Sony 1-bit converter, by toggling the switch on and off 64 times within a single CD/Red Book 22.7 microsecond sample, can provide 64 possible levels (within that single sample), or put another way, 6-bits of resolution.

If we look inside that single 22.7 microsecond sample, the maximum possible level is represented by (spaces inserted for readability):

… 1111 1111 …

The lowest possible level is:

… 0000 0000 …

Both PCM and the single-bit technique look the same at the highest and lowest levels. What about the exact middle level (in analog, zero signal)?

… 0101 0101 …

… 1010 1010 …

This is where it parts company with PCM. In PCM, those two numbers would represent completely different values; in the technique used by single-bit DACs, the two numbers are the same. All the DAC does is simply add all the ones together in a single sample, regardless of order. In a single 22.7 microsecond sample, if there are 64 possible ones, there are 64 possible levels, no more, no less. The switch is either on or off, with no intermediate values; all we can do is arrange the modulation pattern of the ones and zeroes.

Well, 6-bits is pretty far short of 16-bits. Where do the other ten come from, or to put it another way, where is the finer resolution going to come from?

Well, we could run the single-bit DAC even faster. How much faster? Unfortunately, for 16-bit resolution, that would be 65,536 times faster, or 2.89GHz. That's pretty fast. Maybe the switch can go that fast, but the data storage size is comparable to uncompressed HDTV. For 20-bit resolution, it gets really ridiculous, 46.24GHz, deep into microwave territory, and too fast for silicon integrated circuits.

So just raising speed isn't going to do what we want. We run out of switch speed, and more importantly, data storage. This is where "noise shaping" comes in.

I am not clear on how noise shaping really works. I welcome any and all comments from the PFO community. Please enlighten me. The AES papers are filled with digital jargon and impenetrable math, and the white papers from the manufacturers don't make a lot of sense to me. I'm a tech writer by trade, and when something doesn't make sense, I'll struggle along until I can find somebody that can explain in human-understandable terms.

To the best of my understanding, noise shaping is digital feedback wrapped around the switch (or switch array), aided by dither-noise. The digital feedback forces the switch to generate a pattern of ones and zeroes that accumulates to the correct analog value. 8^th-order noise-shaping is actually 8 layers of feedback wrapped around that switch, with each layer improving the S/N by 6dB. The "noise"—which is actually all of the error terms in the conversion process – is pushed above the audio band.

http://en.wikipedia.org/wiki/Noise_shaping

As an old analog guy, I can easily imagine a lot can go wrong with many layers feedback wrapped around the most nonlinear element possible, a switch, which is either "on" or "off". Except for the extremely brief switching transition, there is nothing in between. Conventional feedback theory assumes a moderate degree of linearity in the element to be servo-corrected, otherwise, the feedback can go into limit states, or the comparator at the input of the system can get saturated with a large error term.

Apparently odd things do happen in single-bit converters; these are innocently described as "idle tones", although they can appear in the middle of musical stimuli, happen at unexplained times (and are difficult to reproduce), and are severe enough that commercial single-bit DACs used shutoff switches when digital silence was detected (which made THD+Noise measurements look better).

One of the things that makes digital difficult to understand is that the same words mean different things in analog and digital engineering. In the analog world, "noise" is assumed to be random and uncorrelated with the signal. If it is correlated, it is usually categorized as a form of distortion, or noise modulation. In the digital world, noise can have many different origins, may or may not be signal-correlated, and if correlated, may have a deterministic yet chaotic relationship to the desired signal.

Dither noise is essential for many parts of digital signal processing, most importantly, at the encoding end, as well as any subsequent resampling or equalization. It is essential that the dither noise is uncorrelated with the signal, and not only that, but independent for each channel. In practical circuits, dither-noise is created by pseudo-random noise algorithms, but these are carefully chosen to be very long (and not audible) and as close to random as possible.

The "noise" from noise-shaping appears to be rather different. It's the accumulated errors from forcing the switch to attain desired analog values with a given period of time. From our earlier example of 64 samples with the single 22.7 microsecond Red Book sample, there are only a finite number of possible patterns of ones and zeroes. It's a large number, but not infinite.

The discrepancy between the desired analog value, and the pattern the feedback system actually generates, is called "noise", but it's not the kind of noise we see in analog systems. It has a casual relationship to the analog signal, but is very complex one; in mathematical terms, it is a chaotic system, deterministic but inherently unpredictable. Chaotic behavior is the signature of high-order feedback systems; it is seen in weather systems, population dynamics, and financial markets (which is why quantification of "market risk" is inherently unreliable).

Professor M.O.J. Hawksford at the University of Essex has published a series of AES papers on the mathematical and real-world problems of single-bit ADCs and DACs.

http://www.essex.ac.uk/csee/research/audio_lab/malcolms_publications.html

The idle-tone and overload-stability problems of single-bit converters were sufficiently intractable they were abandoned by the early to mid-Nineties, and replaced by hybrid converters using 5- to 6-bit switch/resistor ladders with PWM modulation techniques and noise shaping. These were called "delta-sigma" converters, or "sigma-delta" converters. Confusingly, both names mean the same thing. The greater linearity of the underlying 5- to 6-bit converter (compared to a single-bit switch) relaxed the demands on the digital-feedback noise-shaping system.

http://www.scalatech.co.uk/papers/jaes496.pdf

http://www.diyaudio.com/forums/digital-source/15439-how-does-delta-sigma-dac-work.html#post179844

The requirement for feedback-based noise shaping is still there, though: the ESS Sabre 9018, which is probably the most advanced delta-sigma converter in current production, most likely operates at 11.2896MHz, or 256fs. (Published data on the internals of the 9018 is not readily available; the 256fs speed is a best-guess. It might go all the way up to 45.1584MHz.) The ESS 9018 uses a 6-bit internal switch array (again, best guess from scanty information), so 6-bits + 8-bits from 256fs operation gives 14-bit resolution for each 22.7 microsecond Red Book sample. Add noise shaping, and the performance goes up to the 20-bit level (THD+Noise specified at -120dB). ESS specifies the Dynamic Range of the 9018 converter at -135dB, for a total range of 22.5-bits.

http://www.esstech.com/PDF/ES9018%20ES9012%20Product%20Brief.pdf

The Philips TDA154x series and Burr-Brown PCM 63, 1702, and 1704 operate in a completely different way from the delta-sigma DACs that dominate the market today. They are "flash" DACs with single-pass conversion; there's no feedback of any kind. The signal goes straight through the switch array and that's it, off to the analog world for amplification. Once the difficult part of current-to-voltage conversion is accomplished, the rest is easy—basically, a high-quality microphone preamp and line driver, not the most difficult thing to do in the analog world.

The TI/Burr-Brown PCM 1704, which is the pinnacle of non-feedback ladder DACs has performance at the 17-bit level (THD+Noise specified at 0.0008%, or -102dB). TI/Burr-Brown specifies the Dynamic Range at -112dB, for a total range of 19 bits.

http://www.ti.com/lit/ds/symlink/pcm1704.pdf

That looks bad for the PCM 1704, doesn't it? Although both are "24-bit" parts, the ESS 9018 measures about 3 bits better. Not only that, the 9018 is a stereo DAC that costs about $40, while the 1704 is a mono DAC (you need two) that costs $75—and is not recommended for new designs.

http://www.ti.com/product/pcm1704

The 9018 can be used in voltage mode, with only a simple buffer to the outside world (although best performance is obtained with a current-to-voltage converter), while the 1704 requires a current-to-voltage converter—and a very high-performance one at that, since the PCM 63, 1702, and 1704 have square-wave components that extend beyond 20 MHz (Matt Kamna and I have measured that for ourselves, it's real). That makes the requirement for the PCM 1704 current-to-voltage converter far more severe than the one-size-fits-all 5532 or 797 opamp that is seen in many high-end DACs. It actually requires a slew rate of 1000V/uSec—at low distortion.

Delta-sigma DACs rely on digital feedback; even with very high operating speeds in the MHz range, they would fail to meet 16-bit performance without noise-shaping systems. Spend a little time with the Hawksford papers describing noise-shaping:

http://www.essex.ac.uk/csee/research/audio_lab/malcolms_publications.html

http://peufeu.free.fr/audio/extremist_dac/files/1-Bit-Is-Bad.pdf

http://sjeng.org/ftp/SACD.pdf

Noise-shapers are extremely complex and difficult to analyze under dynamic conditions; far more complex than is possible in the analog world, where parts-tolerance errors and assorted stray capacitances at high frequencies would make the system go unstable. There's a reason that analog engineers don't get too complicated with feedback systems; what works in the simulator will go crazy in the real world. This keeps both the magnitude and complexity of analog feedback systems to a realistic minimum. In digital, though, the sky's the limit, with astonishing high-order systems that change dynamically with signal.

Much of the ESS presentation at the 2011 Rocky Mountain Audio Festival detailed the strange and unexpected behavior of high-order noise-shaping and interpolation in (competing) delta-sigma DACs. I was at that presentation, and was stunned at all the bizarre artifacts that ESS had documented, like 20dB jumps in the "noise" floor at certain, very precise DC levels as the delta-sigma DAC was slowly swept across the full dynamic range.

http://www.youtube.com/watch?v=1CkyrDIGzOE

If you haven't seen this presentation, you should. It'll knock out any complacent ideas that "DACs are audibly perfect" out of your head. There are lots of things wrong with delta-sigma DACs—most stemming from the fact they are low-bit devices using extremely complex feedback systems to synthesize high-bit performance.

That puts the 3-bit difference between the class-leading ESS 9018 and the class-leading Burr-Brown 1704 in a different light. The performance of the 1704 is the native performance, akin to a zero-feedback amplifier. What you hear is not an algorithm, but the device itself.

The ESS 9018, along with all the other delta-sigma converters out there (including the latest Burr-Brown products), realize their performance with extremely complex digital-feedback algorithms. You are hearing the algorithm, not the 5= or 6-bit converter, and a lot of very strange things can happen with that algorithm. ESS spent several years and a lot of engineer-hours trying to find out what the "golden ears" were hearing—and found, measured, and then corrected several different problems. Given the complexity of noise-shaping techniques, though, there could still be some surprises to be discovered.

What I can say on a subjective basis is the ESS 9018 is the closest of the delta-sigma family to the sound of the best ladder converters, but it still isn't quite the same. Perhaps it isn't the converter itself; maybe the residual difference comes down to the difference in current-to-voltage conversion. I'm not familiar with the RF emission spectra of the ESS 9018. It operates at much higher speeds than the sub-MHz PCM1704, but there could be internal components that take the sharp edges off the output of the DAC.

[End of Part I. Stay tuned for Part 2 in Issue 66 of PFO, March/April of 2013!]