Ruminations on  “Absolute Fidelity”, Stereo Imaging, and Beyond High Fidelity

Revised January 17, 2015

ABSTRACT

An audio system’s playback volume directly affects the perceived depth of virtual stereo images.  Stereo system’s virtual images don’t duplicate the real images of live performances. Virtual stereo images are generally more delineated and focused than real images but that may actually enhance one’s ability to connect emotionally with the individual performers and music.

Absolute Fidelity

If one defines the “Absolute Fidelity” of a playback system as its ability to make a critical listener believe they are sitting at the recorded live event, then I think judging the fidelity of a playback is akin to a blind man looking into a black hole and reporting what he sees.  To begin with, even if the microphones were placed where the listener sat at the live performance in order for the listener to make the live versus recorded fidelity comparison, there’d be no way for the listener to know how much of the quality of the reproduced sound was due to the quality of the recording and media or the fidelity of the playback system to the recorded media.

Many authorities on high fidelity believe that only recordings made with a binaural dummy-head-microphone setup (DH) with two microphones whose pickup patterns match that of the human ear and played through headphones really allow one to approach the experience of being at a live performance.   Potentially digital signal prossesing (DSP) that emulated the functionality of the Xbox 360 Kinect Sensor or other technologies adjust the sound heard in the headphones track the movements of the listener's head and adjust their sound to how it would change when the listener was moving their head the live performance. However, that doesn't mean that dummy-head recordings can't sound quite good on stereo speakers that are positioned the usual position in front of the listener, and without prior knowledge of the  DH microphone setup most listeners probably wouldn't know that the recording wasn't true stereo. Recordings with more than two microphones such as pan pot multi-mono can’t possibly be faithful to the goal of placing the listener at the position of the microphones at a live performance because a listener’s two ears head can only be at one point in space at any given instant in time. The word Stereo in Greek means solid and the dictionary definition of a stereo image states that stereo means that the degree of 3-D (stereo/holographic imaging) experienced via a stereo image, whether visual or auditory, is determined by the amount difference seen between the left and right eyes or heard by the left and right ears respectively.  Therefore, neither distant objects nor distant sound sources convey much stereo information, i.e., because each eye and ear is seeing or hearing practically the same thing.

The sound sources at live performances are essentially point sources that are real images and don’t change their apparent positions laterally or vertically very much with relatively small movements of one’s head, but a stereo pair of speakers present the listener only with virtual images which are subject to relatively greater changes with relatively small lateral or vertical head movements. The pickup patterns of most microphones differ significantly from the pickup pattern of the human ear and are major contributors to the differences between real and virtual images.  During playback the acoustics of the playback venue's ambiance and audience, if any, when heard over a stereo pair of speakers is heard as being all in front of the listener as apposed to around and behind the listener and are superimposed on the acoustics of the performance venue, which is another reason for that the virtual stereo image over a pair of stereo speakers to be significantly different from the real images of live performances. But binaural recording and playback when done correctly can truly reproduce the acoustics and image of the live event. I believe speakers with drivers that are either omnidirectional, dipolar, bipolar, and/or aimed/fed electronically processed signals that interact with the acoustics of the arbitrary listening venue in order to provide the listener with a synthetic sense of the ambience of the venue of the live performance are not conducive to recreating the absolute sound, because the acoustics of any particular playback venue have no relationship to the acoustics of the live performance's venue.  For any or all of the foregoing reasons one shouldn’t be surprised that virtual stereo images with their  hyper-focus/localization differ from live images.

Our localization of a sound sources is also affected by our view (eye-brain mechanism) of the positions of the performers and sound sources at a live performance (even when amplified) by what I call the “ventriloquist effect.”  Our visual cues seem to dominate where we localize sound sources as they do when a ventriloquist “throws his voice” to the moving mouth of his dummy, i.e., the apparent source/locality of the voice we hear depends solely upon who’s mouth we see moving.  If one listens with their eyes closed, the source of the dummy’s voice may still seem to move to once we’ve “seen” it move with our eyes open which is an example of just how much our visual cues dominate our localization of a sound source than our aural sense.  The point I’d like to make is that the traditional simplistic idealist notion of “the absolute sound” (TAS) of live performances of unamplified acoustic instruments and voices as a reference to the fidelity of an audio system is impossible to realize in the real world of music reproduction regardless of much we’d like to believe it to be true, at least where virtual stereo imaging is concerned. There have been many attempts to highten image specificitiy though the use of crosstalk cancellation over stereo speakers, because it seems that for many listeners image speciicity is the holy grail.  But on the other hand there have been attempts to make headphones seem to image more in front of one as if they were speakers. The only reference for how much virtual stereo imaging (holographic imaging) is reproducible from a given recording is listening to the recording on the system that makes it sound the most stereo, i.e., live performance images aren’t direct references for the hyper-focused virtual stereo images of recorded music.

However, having said that, I believe that many other aspects of the absolute sound of live performances are directly relevant to the fidelity and musicality of reproduced live performances but again with the caveat that we have no way to distinguish the quality of the media from the quality of the playback, but we can judge the net result of recording, media, and playback.  It’s only that we can’t distinguish with any particular recording which aspect of the playback fidelity is which.  I think the audio system which makes the most recordings sound most enjoyable probably has the highest fidelity and the greatest musicality with the greatest ability to emotionally connect the listener to the performance.  Because the deviations of recordings and media from perfect fidelity tend to be random (except for inverted polarity*), on average the highest fidelity audio system should be the one that makes the most recordings sound musical and emotionally involving, but since the only thing that matters is when the music meets your ears; it’s still a subjective standard at best.

Some aspects of fidelity are absolute polarity, phase response, frequency response, dynamic range, harmonic distortion, intermodulation distortion, Doppler effects, time-alignment, transient intermodulation distortion, noise, timing errors, etc., but not the ruff-harsh sound of comb filtering.  I’ve excluded comb filtering effects because live performances of multiple sound sources frequently have a natural amount of comb filtering that’s virtually impossible to distinguish from the potential comb filtering of the playback of its recording unless one was present at the live performance sitting where a “coincident pair of stereo microphones” were placed to serve as a reference because only microphones that are the same distance from a sound source won’t add comb filtering to the recorded sound.  Recordings of relatively small groups of musicians for judging fidelity usually have less natural comb filtering than recordings of massed instruments and voices that may simultaneously sound the same notes at different distances from the microphones or at slightly different times and pitches. A listener needs to be the same distance from the left and right channels or he or she will hear the effect of comb filtered of images of stereo sound sources.

Using depth cues to judge the distance of a sound source is quite a different thing than the actual experience of the two different sonic images of a sound source when we’re experiencing real depth via the two different sonic images at each of our two ears that’s analogous to our stereo vision via two different images at our eyes that our brain fuses into a single stereo (In Greek stereo means solid) image with depth. And whether or not speakers are against a wall has nothing to do with their ability to present the depth of a stereo (And again the word stereo is taken from a Greek word that means solid or 3-dimensional).  In fact my speakers are against the front wall and have the best "virtual depth/stereo image" that I've heard.  Once one has seen speakers against a wall then perhaps closing their eyes can overcome their disbelief that an image of the sound sources can seem to be originating from behind the wall.  I can prove that by having listeners who've never seen or heard my stereo first listen blind folded.

Most audiophiles are aware of the Fletcher-Munson Curve/Effect which describes the effect of a sound’s loudness on its perceived frequency response that’s the reason some components have loudness compensation switches that increase the relative amount of the bass frequencies and in some cases the high frequencies to the midrange frequency’s in order to compensate for playback at lower sound pressure levels than live, because the frequency response of our hearing tends to be attenuated at both frequency extremes as sound pressure is reduced. Using loudness compensation is a matter of taste, but it alters the frequency response of the reproduced music relative to the frequency response of a live performance whose sound pressure level was the same as that of the reproduced music.  Similarly early stereo recordings that were pan potted “pan-pot stereo”(Pan-pot stereo is a series of monophonic tracks that are paned to place them from the extreme left and right to anywhere in between on the virtual soundstage) aren’t true stereo at all, and would never be confused with 3-D images any more than a panorama photograph would be confused with a 3-D photograph. Inferred depth perspective is analogous to a one-eyed person who judges the distance of a familiar object by its size and how their eye muscles accommodate to focus but who nevertheless doesn’t experience an object’s real depth.  Even people with two eyes have very little depth perception of distant objects because each eye is seeing almost the same thing, which is why binoculars with widely spaced objective lenses increase the depth perception of distant objects, over that of the naked eye.  Those same principles apply to distant versus close sound sources at a live musical performance because each ear receives nearly the same sound from a distant sound source.

Unless one is up close to a small group at a live performance or perhaps to a large orchestra each ear is hearing very close to the same thing, just as each eye is seeing pretty much the same thing when viewing objects at a distance, thus neither the sonic nor visual experience offer much stereo imaging, because in either case the amount of perceived stereo depends upon the differences between what’s heard at the left and right ear or seen by the left and right eye.  If you doubt this proposition, then next time spacing between their two eyes at a live performance close your eyes and try to place the various sound sources.  At home two stereo speakers create a virtual soundstage whereas a live performance is its own soundstage with sound sources that are essentially point sources.  At home with two stereo speakers both reproducing the sound sources of stereo microphone recordings, relatively small movements of ones head toward one or the other speaker may dramatically change the apparent lateral position of the virtual images of the various sound sources relative to when the listener is equal distant from both stereo speakers. This is due to the relative amplitude changes of the sound source from each speaker and because we judge the positions of sound sources on the soundstage by their relative loudness in each ear, e.g., when the sound pressure of a sound source is equal at both ears its image is in the middle of the virtual soundstage.  And also when one isn’t the same distance from both speakers, sound sources will arrive at different times from each speaker and thus may sound harsh or rough due to comb filtering.

Image Depth Perspective

I’d like to bring to your attention a rarely considered element of fidelity, specifically how the playback loudness level affects the perceived depth perspective of virtual images, i.e., the relative distance of near and far sound sources from the listener, because I believe that a systems playback level directly affects the perception of an image’s depth perspective. There are several kinds of aural depth cues that we use to determine the relative distance of a sound source:  a sound source’s loudness relative to other sound sources, its frequency response at different distances from the listener, reverberation, and echo times/effects.  Therefore, even a mono recording may have aural depth cues that permit us to infer the relative distance of one sound source to another.  One effect of playing at a level different from that picked up at the microphone(s) (mono recordings may have multiple tracks from multiple microphones mixed into a single mono track) is because Fletcher-Munson Effect affects our perception of frequency balance and therefore our perception of the distance of a sound sources.  When I was in the tenth grade, I lived in San Diego, California, Point Loma specifically, west of San Diego Bay across from downtown San Diego proper.  Sometimes at dusk I’d view the moon rising over the downtown skyline through my 10 power binoculars.  The 10 power binoculars made the moon seem 10 times closer and like wise makes the downtown skyline appear 10 times closer.  But that dramatically exaggerated the size of the moon relative to the skyline, because if instead of viewing skyline through the binoculars I’d physically moved ten times closer to the skyline and viewed it without the binoculars it would have appeared 10 times closer (bigger) as it did through the binoculars but the moon wouldn’t appear any bigger.  The effect of viewing the skyline and moon trough the binoculars was to foreshorten the apparent visual distance between the skyline and the moon by a factor of 10 because the 10 power binoculars magnified everything equally.  And had I been looking through the wrong end of the binoculars (the opposite end of the binoculars) the exact opposite would have occurred, i.e., the moon’s apparent distance from the skyline would have appeared to be 10 times further than with the naked eye.  Because the spacing of the two video camera lenses is usually wider than the average person's spacing between their two eyes there's also a tendency for 3D video and cinema images, like the widely  spaced objective lenses of some binoculars, to have exaggerated 3D effects that tend to improve the perception visual of details which enhances the overall viewing experience.

Volume controls on amplifiers have exactly the same effect on the depth perspective of sound sources’ virtual images because just as binoculars affect visual depth perspective, they make everything louder or softer by an equal amount.  Because analogous to moving 10 times closer to the skyline of San Diego. For example, when one moves to a seat twice as close to the front of the stage at a live performance they haven’t also moved twice as close to the rear of the stage.  Therefore, I believe that every recording should have a test tone generated by a speaker at the front of the soundstage that has a stated db level (sound pressure) picked up at the microphones that could be matched with a sound pressure meter at the listening seat of an audio system.  That would help the listener to match of the playback level at their listening position to the sound pressure level at the microphones and thereby hear the depth perspective that they’d have heard if they’d sat where the microphones were.

I want to digress for just a moment and mention another aspect of imaging whose fine details are beyond the scope of this piece.  When one reads a review of speakers that states the speakers can image outside of the left and right space between the speakers that’s a bit of smoke and mirrors because on a true stereo recording sound sources don’t have inverted polarity or phase distortions to trick the listener’s ear-brain into hearing virtual images that appear anywhere except at either at one or the other speaker, between the speakers or with in a cone with its apex at the listener’s ear who’s angle subtends the top and bottom most and left and right speaker drivers unless one's two ear don't have the same phase/amplitude response, in which case all bets are off.  Therefore, stereo miked recordings may seem to have sound to the left and right and above and below the speakers if its image is behind the speaker plane of the speakers. Accidental or deliberate phase and time distortions can result in some very interesting aural effects such as virtual images to the sides or behind one’s head that can make for a quite different and interesting listening experiences.  But so far if only two stereo speakers are used those affects invariably alters the sound’s timbre and causes comb filtering effects.  Q-sound and Roland Sound Space recordings are the only two of the many different proprietary ways of altering stereo images in the manner just described that I’ve heard. If in fact speakers could image outside of themselves, then where’s the single speaker that can image  outside itself or for that matter sounds like stereo?  There are also two other artistic choices known as the Phil Spector Wall of Sound and Reverse Wall of Sound that alter stereo imaging and timber.  The idea of the former is to put everything except the lead instrumentalist(s)/vocalist(s) out of polarity so that the lead performer(s) will stand out in bas-relief against a two dimensional “Wall of Sound”  and for the latter the reverse is done to place the virtual images of the lead performer(s) further back in the virtual soundstage.  However, having said that, I not going to argue with artistic choices of record producers, but fidelity to live acoustic performances, it’s not except perhaps in the case of Phil Spector’s “happenings."

Since the speed of sound is the same at all frequencies, all frequencies emanating from a given sound source simultaneously will arrive in sync at the listener’s ears although they are comb filtered for sound sources not equally distant (not on a midline between the listeners ears) from the listener’s left and right ears.  For an even more realist virtual image, if one wanted to duplicate relative geometry of the stereo microphones to the soundstage at home one needs to adjust their listening position to the distance between their speakers so as to match the angle subtended by the microphones and the left and right boundaries of the soundstage. However impractical that would be, it’s even more problematic due to the non-point source radiation pattern of most speakers with non-coincident drivers, because there’s only one point in space (the proverbial sweet spot) where the time-alignment of said multiple driver speakers is in sync and everywhere else there’s comb filtering except for sound sources with a frequency range that originates even if from a single driver in both channels.  Speakers that don’t have all their crossed over drivers time-aligned will comb filter the frequencies in the overlapping region of its crossover(s).  A further requirement to minimize the comb filtering potential of two channel playback is a listening position that’s equal distant from both channels.  For  Multi driver non-time-aligned speakers with all drivers producing the same frequencies will comb filter all frequencies.  I wouldn’t be surprised if the rough and harsh sound of comb filtering/inverted polarity wouldn’t account for most of what’s frequently objected to by those who prefer analog.  And in some instances, the high frequency phase shift and or high frequency out of phase crosstalk of some analog recordings may partially mitigate the early time arrival of mis time-aligned tweeters as well as the effects of some microphone techniques such as Crossed Figure-Eight Blumlein Pair Microphone recordings that produce out of phase crosstalk of the ambient sound field and detract from Absolute Fidelity.  This may be one reason for the popularity of single driver speakers that act as point sources.  The result is that judging the fidelity of digital versus analog media is made more difficult than it might otherwise be by the comb filtering of non-time-aligned speakers and some microphone setups. If one adds a subwoofer(s) that are low-passed at the relatively long wavelengths of bass frequencies to a single driver system or a time-aligned system, a small amount of mis-time-alignment is much less audible and may not be that detrimental to perceived fidelity while being quite helpful to augmenting the relatively limited bass response of most single driver speakers or speakers with restricted bass response.

And even a coincident pair of stereo microphones that pickup true stereo can’t capture what a listener who sat where the microphones would hear because of their spacing and that there’s noting in between them to affect the sound the way the listener’s head does.  The pickup pattern of most coincident microphones that are configured with an angle between them of 90 to 135 degrees results in a greater amplitude difference between the left and right microphones of sound sources that aren’t equal distant from them than the amplitude difference at a listener’s ears substituted for the microphones.  And again, binaural dummy head microphone setup with the microphones situated on opposite sides of a dummy head where a listener’s ears would be or the real head of a listener can record what a listener would have heard who sat where the microphones were.  Binaural playback requires headphones or speakers placed opposite the listener’s ears to reproduce what the microphones picked up so that the listener is virtually transported back to the time and place of the original performance.

The main aspects of a sound sources that our ear-brain mechanism uses to determine where a sound in front of us is coming from is difference between its level at each ear, high frequencies that are heard as more directional than low frequencies, and the sound’s time arrival at each ear.  That means that transients (whose leading edge is high frequencies) and first arrival sounds (The Haas Effect) imparts more directional information for our ear-brain than steady state sounds.  The major difference between the live imaging of the sound sources of  live performances and virtual stereo images is that they are essentially point sources whereas the virtual images of sound sources by stereo speakers is that they are created from only two sound sources, speakers or headphones.  When one moves their head left or right at a live performance there’s not much of a change in the perceived image placement because at the usual listening distances the sound sources are essentially point sources that present the same sound to both the left and right ears unless the sound comes from the extreme sides.  As a consequence if one closes their eyes at a live performance, it will be difficult to localize the lateral position of sound sources.  But not so with a virtual stereo image because the left and right balance of the virtual image of every sound source that determines its relative position between the speakers is determined by the ratio of the sound pressure between the left and right speakers which changes much more when one moves their head toward one or the other of the speakers.  The shadowing and comb filter effects of a listener’s head varies with the relative the position of point sources relative to the listener.  All real sound sources are picked up or recorded by two or more microphones, that even if the microphones are coincident, will have some degree of comb filtering and crosstalk, such that when the sound is picked up and reproduced by a stereo pair of speakers as a virtual stereo image create secondary shadowing and comb filter effects at a listener’s ears. That magnifies the effect of a listener’s head movements relative to the speakers that produce the virtual stereo images relative to the lessor effect on the real image of moving one’s head relative to the point sources of live performances, which is just one more of the reasons why virtual stereo images differ from the real images of live performances.

Here’s a list of some of the attributes of audio systems I think are important to a listener’s ability to feel connected to the virtual sound stage’s imaging that are affected by the recording techniques, media, and playback system, i.e., the localization-perspective of the sound sources in the virtual stereo 3-dimensional sound stage:  stability, density/weigh, holography, delineation/specificity, and polarity*#The Absolute Reality of at the ear Absolute Fidelity is that there are only two things that don’t affect the sound of an audio system, the brand of batteries in its remote control and whether or not you’re wearing a digital watch, but I’m not really sure about the latter.

I liken the heightening of  virtual stereo imaging over live sound images with its additional sonic detail to the heightened depth of field of some still camera lenses and HD video, wherein the viewer simultaneously sees the objects in the foreground and distant objects in focus with its additional visual detail that they couldn’t see in focus simultaneously with their naked eye.  Heightened depth of field may simulate 3-D images especially when an object in the foreground moves rapidly against the sharply focused background.  I think that’s one of the main reasons that most people like HD video.  And given the lack of visual images for sound only playback (as apposed to DVD-A that has sound accompanied by video) the heightened localization of virtual stereo images over most live images allows one to hear individual performers easier than at live performances and thereby may enhance one’s emotional involvement with the individual performers and music.  Could that be the main reason that many listeners prefer virtual stereo images with their hyper-focus and increased sonic detail that’s Beyond High Fidelity over the relatively less detailed and more monophonic presentation of most live music?  I believe that high fidelity playback must faithfully reproduce the media it's playing, but after that it's up to listeners, if they desire and when possible, to adjust their system's sound to ameliorate poorly recorded media/tune their system's sound and musicality to their particular tastes. 

Hear the Music Not Just Notes™, not just a slogan, but an Essential Quality of High Fidelity Music Systems for Music-Loving Audiophiles Who Seek the Greatest Possible Emotional Involvement With Music

By way of an introduction, I feel the need to state what many others have been probably expressed far better.  For most music-lovers, content trumps practically everything else when making an emotional connection with music.  But sometimes a truly musical audio system not only encourages a closer emotional involvement with one’s favorite music, but also serendipitously may result in their discovery of new music genres that expands their musical horizons and taste.

I believe there are many necessary but not necessarily sufficient qualities that must that an audio system’s individual components must synergistically engender for it to truly produce emotionally involving music.  My list isn’t in any particular order of importance because ordering that would require entirely different discussion that would depend upon each individual audiophile’s hierarchical list of musical priorities.  So by way of a list of sorts, I believe there are many “amazingly spectacular sounding” audio systems that reproduce music with a startlingly etched image and clarity that seems to materialize out of the dark matter in intergalactic space, (as described in my Ruminations …above), amazing clarity, proper image scale, stunning transients with both macro and micro dynamics, flat frequency response, extremely low distortion, with vanishing low noise, great “transparency”, deeply extended authoritatively quick and controlled bass, detailed high frequencies, a smoothness that’s seductive, and last but not least, well recorded media.  And yet no matter how impressively amazingly spectacular those “hi fi systems” might sound initially with specifications on paper and sonically that would seem to do everything that any audiophile would want, (especially to non-audiophiles who’ve only rarely or maybe never heard what a high quality sound system can do and are easily captivated by the system’s Wow Factor!) yet somehow those systems ultimately become emotionally uninvolving.  Music-loving audiophiles, who really appreciate music’s raison d'être, know that it’s the ability of an audio system help them to make the closest possible emotional connection to music, perhaps even if only rarely, to experiencing goosebumps, is all that really matters! (Except its cost, of course…)

So what’s missing that makes such a potentially great system only reproduce what would ultimately seem to be only a collection of notes rather than highly emotionally involving music that fails to convey the emotional intention and purpose of the music?  The first time I became aware of that effect was when a friend brought over Sony’s first 1-bit CD player and after hearing it I remarked, “I hear the notes but where’s the Music?”

Some music-lovers call it “continuousness.”  Some call it “pace and rhythm or timing” and some simply refer to it as music’s “Boogie Factor.“ It seems to me that some of those “hi fi systems” truncate the attack and decay of individual notes which makes individual notes stand out in an unnatural bas-relief against the exaggerated silence between the notes, but obscures the natural pace and rhythm or timing of the music, which may make for a more spectacular sounding but ultimately less emotionally involving.  I think that it probably couldn’t be said any better than by the title of the Duke Ellington song It Don’t Mean a Thing if it Ain’t Got that Swing.  This is an easy trap, especially for a tweaking music-loving audiophile, because frequently a tweak or “upgrade” can make a rather ordinary sounding system sound quite spectacular and it’s only with extended listening that one might become aware of what might be a lack of musicality or missing entirely.  I know from first ear experience that the more spectacular my system sounds the easier it is to be seduced into thinking that I’ve made great improvements to its musicality when in reality it may sound even less musically involving than pre-tweaked or “upgrade.”

* Thirty Years of Digital and the 92% Solution

* Polarity Think Piece:  A Speculation on Perception of Detail

By George S. Louis, Esq., CEO Digital Systems & Solution, President San Diego Society (SDAS), Phone: 619-401-9876, 888-588-9542 toll free, Email:  AudioGeorge@Audio George.com, Website:  http://audiogeorge.com

Perfect Polarity Pundit™, September 15, 2011, Revised April 29, 2014
Copyright © 2014 George S. Louis