The Human Ear

This is a research summary I wrote a long time ago for A-level Biology, but I thought it was very relevant to piano tuning.



How does a piano tuner or musician distinguish between two notes? Pitch is defined by the American National Standards Institute as “that auditory attribute of sound according to which sounds can be ordered on a (musical) scale from low to high.” This page describes how the human central nervous system (CNS) orders these different sounds on a musical scale, or more objectively, how the CNS distinguishes between a “higher” sound and a “lower” sound.

The terms “pitch” and “frequency” of a sound are used to describe two different things in psychoacoustics. The frequency of a sound is the number of sound pressure waves per second of the sound stimulus, given in Hertz. The pitch of the sound is what the CNS perceives when given that stimulus, as ordered on a musical scale. Hence the use of the term “auditory attribute” and not just ‘attribute of sound’ in the above definition.

This principle is not unique to the auditory system. For example, referred pain is where a stimulus in one part of the body (e.g. the liver or heart) is perceived by a person as pain in another (e.g. the head or the arm respectively). One is conscious only of what the CNS relays to the frontal lobes (or wherever one’s consciousness is), which may or may not make one aware of the exact nature of the stimulus. After an introduction to the principle elements of the auditory system, we will begin to unfold models relating to how the auditory system processes what we perceive as pitch.


Figure 1 – Diagram of the Human ear.
A diagram of the human ear
(Taken from Washington University in St. Louis)

  • The outer ear acts as a passive amplifier by funnelling sounds into the middle ear. The pinna also modifies sounds slightly depending in which direction they came from relative to the head, modifications which we’ve learnt to associate with sound coming from different directions.
  • The middle and inner ear are labelled in the diagram below. The middle ear is separated from the outer ear by the tympanic membrane. It consists of the malletus, incus and stapes, which are bones that conduct the vibrations of the tympanic membrane to the oval window.
  • The oval window is a window into the cochlea, and the vibrations of the stapes are transduced into vibrations of the cochlear fluid which displaces the basilar membrane. The diagram below (figure 2) attempts to illustrate this, the fuzzy arrows being the vibrations of air/fluid. Hence the vibrations are passed from the air (outer ear) through both the solid (middle ear) and liquid phases (cochlear fluid), and back into a solid in the basilar membrane.
  • The inner ear is the cochlea. This is a spiral like structure (photographed in fig. 3) consisting of the elements labelled below.


Figure 2 – The Middle and Inner Ear, showing how vibrations are conducted through to the cochlear fluid and cause displacement of the basilar membrane.



(Modified from Bristol University)

Figure 3 – Photograph of cochlea with different compartments filled with coloured fluid.



(Picture from Washington University in St. Louis)


  • Stereocilia on cells that are joined by small fibres to the tectorial membrane (labelled in figures 4 and 5) oscillate in time with the displacement of the basilar membrane. When they’re moved in a certain direction, potassium ion channels on the tips of the cilia open and the cells, known as a inner hair cells, depolarise releasing vesicles on the other end of the cell across a synaptic cleft to cause a nerve impulse to be sent along an afferent neuron (figure 5):


Figure 4 – Cross sectional illustration of one wind of the spiral of the cochlea, showing position of hair cells in organ of corti:



(Modified from McMaster University – page no longer exists)

Figure 5 – Electron Micrograph showing stereocilia on hair cells:



(Taken from McMaster University – page no longer exists)

Figure 6 – Diagram showing how movement of stereocilia opens ion channels (actin filaments joining adjacent cilia are pulled taught and open ‘trapdoors’ that block ion channels):



(Diagram taken from Bristol University)


  • The outer hair cells also have stereocilia, but so far we only understand that they’re responsible for amplifying the vibration of the basilar membrane by shortening and lengthening (Xue S, Mountain DC, Hubbard AE, 1997).
  • The basilar membrane (henceforth, ‘BM’) decreases in thickness and increases in diameter towards the apex of the cochlea, so that at lower frequencies the basal end vibrates, and at higher frequencies the more apical end vibrates.
  • There are approximately 3,500 inner hair cells in the human cochlea, all lined up along the BM. Because of the gradation in thickness/width of the basilar membrane, it is tuned, therefore each of these hair cells will respond maximally to a different frequency, referred to as the characteristic frequency of the hair cell or of its afferent nerve.
  • The axons of these afferent neurons form auditory nerve, and their cell bodies make up the spiral ganglion.
  • The information is relayed to the cochlear nuclei, then to the superior olivary nuclei, via the lateral lemniscus on to the inferior colliculi, to the medial geniculate nucleus and then via the auditory radiation to the primary auditory cortex in the temporal gyrus (fig 8). This mouthful of a pathway is illustrated in figure 6 below.


Figure 7 – Pathway of sensory information from the cochlea:



(Diagram taken from Bristol University)

Figure 8 – View from underneath the human brain showing location of the inferior colliculi:



(Picture taken from Bristol University)

Figure 9 – Dissected brain from below, showing location of medial geniculate nucleus:



(Picture taken from Bristol University)

Figure 10 – Coronal section of brain showing position of auditory cortex:



(Picture taken from Bristol University)

The unique characteristic frequency of any given point on the BM and the hair cell on that bit of membrane could be said, then, to be the tool used by the CNS to define at what frequency an external stimulus was given. For example, a pure tone stimulus given at 500 Hz would maximally excite the hair cell closest to or at the point of the BM that has the correct thickness and width to vibrate maximally at precisely 500 Hz. Figure 11 represents the BM uncoiled from the cochlea, illustrating that different frequencies excite different places along the BM.

Figure 11 – Representation of the BM uncoiled from the cochlea, illustrating that different frequencies excite different places along the BM.




(Taken from Tohoku University)

Figure 12 – Graph showing which frequencies (x) cause most resonance at different positions of the BM (y):



(Taken from a paper by Worrall, 1998)


Therefore at two different pitches, two different points on the BM will vibrate maximally and therefore two different neurones will fire maximally.

A pure tone is a sound wave which has a sinusoidal function when pressure variation is plotted against time. In other words the variation in pressure at a fixed point in space, referred to as the sound wave, decreases and increases at an even rate.

The hair cells always induce firing of an action potential in the afferent nerve fibre at the same phase of the waveform, so that the nerve fibre will fire at a time integer defined by the wavelength of the waveform. For example, a 500 Hz pure tone (which has a sinusoid waveform) has a wavelength (period) of 2 msec, therefore the nerve may fire at 2, 4, 6, 8, etc. msec intervals. This is known as phase locking. Phase locking occurs for frequencies up to 4 or 5 kHz, after which the precision with which the initiation of the nerve impulse is linked to a particular phase of the cycle becomes comparable to the period of the waveform, and can therefore no longer be faithful to it. When this happens the ‘spike’ pattern (pattern of single action potentials), that normally represents repetitions of a particular phase in the waveform, is occluded because firing begins to occur at any phase of the waveform.

A difference limen frequency (DLF) is the smallest possible difference in frequency that the auditory system can detect. A frequency modulated difference limen (FMDL) is the same thing, but this time the two pitches in question are present in the one tone, and the tone modulates from one to the other. DLF’s and FMDL’s are determined by temporal information (i.e. phase locking) at frequencies up to 4 or 5 kHz and at low frequency modulation rates, and by a place mechanism at higher frequencies or high modulation rates. This could be expected, because the exact frequency can be deduced from the temporal spacing of the phase locked nerve firings only up to 4 or 5 kHz after which the only possible method deducing the frequency of the stimulus relates to where the point of maximal excitation is on the BM, in other words, purely on which hair cells are most excited.

Experimentally, the smallest DLF’s we can detect are at around 500 Hz, and are of the order of 1 – 3 Hz, (about 1/30 of a musical ‘tone’ interval at mid-treble frequencies – more than enough frequency resolution for a piano tuner). Earlier models (Zwicker, 1970) for frequency discrimination of pure tones rely on changes in the excitation pattern. The excitation pattern is the pattern of activity of the neurones (i.e. afferents from the cochlea) as a function of the characteristic frequency of the neurons being exited. Zwicker’s model is a place model, in that the perceived pitch corresponds to the places of maximal excitation along the BM.

The auditory filters are bandpass filters assumed to exist in the peripheral auditory system. The critical bandwidth of an auditory filter is the width of the band of frequencies over which the filter is effective. If Zwicker’s place model were correct:



  • DLF would changes with frequency in the same way as the critical bandwidth of the auditory filter i.e. how small the DLF is depends on how sharply tuned the auditory filter is. This is not the case – DLF’s vary more with frequency than the equivalent rectangular bandwidth (the ERB is a measure of the critical bandwidth).
  • Randomised changes in level (stimulus amplitude) should substantially increase DLF’s. This also is not the case.
  • The shorter in duration tone pulses become, the wider the range of frequencies the energy is spread over. Therefore, according to this model, DLF’s should increase markedly with decreasing duration of short pulse tones below a certain critical duration. This is the case, but experimental evidence (Moore, 1972, 1973a) shows that the increase in DLF with due to decrease in tone pulse duration is much less than predicted.



This is likely to be because DLF’s are determined by phase locking, the precision of which decreases with increasing frequency above 1 or 2 kHz, and is absent above 5 kHz (hence Zwicker’s model is more accurate for pure tones above 5 kHz).

Moore and Sek showed between 1992 and 1995 that the FMDL increases with increasing frequency of modulation. They proposed that (in the words of Moore himself) “the mechanism for decoding the phase-locking information is ‘sluggish’ and can not follow rapid oscillations in frequency. Hence, it plays little role for high modulation rates.” Where this is the case (i.e. above about 10 Hz FM), or where phase-locking can’t occur (i.e. above about 5 kHz), temporal analysis can’t be used to define FMDL’s, and therefore they must be defined by changes in the excitation pattern.


In general:


  • The pitch of tones below 2 kHz decreases with increasing sound level.
  • The pitch of tones above 4 kHz increases with increasing sound level.



Verschure and van Meeteren in 1975 showed that these changes in pitch are normally less than 1%, but increase to up to 5% the further up from 4 kHz and down from 2 kHz the frequency value of the stimulus is.


Complex tone: “A tone composed of a number of sinusoids at different frequencies.” (B.C.J. Moore). Almost all naturally occurring sounds are complex tones. A note played on a piano, other instrument, or sung, will consist of a sound wave at the fundamental frequency, that we’d refer to as the actual ‘note,’ but also another sound wave at twice the frequency, and others at three times, four times, five times etc. the fundamental frequency. These other sound waves are referred to musically as ‘harmonics,’ and in psychoacoustics as partials (i.e. parts) of a complex tone. For example, if the fundamental frequency is at 440 Hz (musically, A below top C), then the harmonics will be 880 Hz (‘top’ A), 1320 Hz (E above that), 1760 Hz (next A up), 2200 Hz (C# above), and so on. Interestingly, the above notes form the chord of A major. We find musical intervals that are related mathematically pleasing to the ear – on this mathematical basis Pythagoras based the tuning system that we continue to use today.

When the stimulus to the ear is a complex tone, it has been discovered that the frequency of the fundamental is not determined by the excitation of the point on the BM that has the characteristic frequency of the fundamental. Instead it is determined by the frequencies of the partials. For this reason, even if the fundamental is missing in the stimulus, it can still be perceived – in this case it is referred to as a residue pitch. There are two models for how partials define a fundamental:


  • Pattern Recognition. The brain receives information on all the different partials that are detected by the hair cells. It then compares this pattern of partials to a map that has assigned each different fundamental to a given pattern of partials, or several patterns of partials. The perceived fundamental/residue is the one that has the greatest coincidence of harmonics with the partials detected on the BM. For example, the stimulus might be a complex tone of fundamental 400Hz. Ignoring the fundamental, the brain will recognise the partials at 800, 1200, 1600, 2000 etc. Hertz . The brain has learnt that these partials correspond to a fundamental of 400 Hz, so one perceives a note of pitch frequency 400 Hz.
  • Temporal Theory. This proposes that residue pitches are produced (and fundamentals defined) by upper harmonics that are not very well resolved but interfere on the BM. The value of this pitch is determined by the time pattern of the waveform at the point on the BM where the partials interfere. For example, a pure tone of frequency 3000 Hz would have a constant, sinusoidal waveform. In contrast, a 3000 Hz partial of a complex tone of fundamental 500 Hz will cause the BM to vibrate at 3000 at an amplitude that varies down to almost nothing, at a frequency of 500 Hz, the fundamental. Phase locking allows measurement of the time intervals between amplitude peaks in the waveform which will now be 333’ ms, 666’ ms, 999’ ms, 2000 ms, 2333’ ms, 2666’ ms, and so on. This is the time pattern that can be used to define the pitch of the fundamental, or a residue.


Experiments have proved that neither model can account fully for the properties of pitch perception of complex tones at all audible frequencies. The temporal theory as an absolute model fails to explain that residue pitches can be heard when there is no possibility of the partials of a complex tone interacting in any part of the peripheral auditory system. If the definition of pitch of a complex tone were purely due to temporal patterns, the auditory system would still perceive a residue pitch from a complex tone with indistinguishable partials. The pattern recognition model can’t stand alone because when the harmonics of a complex tone are too high to be resolved, a residue pitch can still be heard.

Therefore psychoacoustics now takes an integrated approach to pitch determination both involving temporal and place pattern information. It incorporates the place theory in that individual harmonics that excite different parts of the BM cause different neurones to fire, and hence forth many components of the stimulus are separated into different channels of information along the auditory nerve. In brief, the neuronal spike intervals (i.e. time between one action potential and the next in the afferents) are analysed separately for each channel, to find the most common interspike intervals. Then the time intervals from different channels are compared. As mentioned earlier, the time interval with the highest reoccurrence is usually the reciprocal of the fundamental frequency. The interspike intervals that are most well represented are fed into some form of decision mechanism, that finally outputs one interval from all those it receives. The decision mechanism may influenced by memory, attention, preceding stim
uli, context, conditions of presentation, and other factors. The perceived pitch will then be the reciprocal of the chosen interspike interval.

Summarising the above model, the components of the complex tone are channelled according to the characteristic frequency of the neurones they excite along the BM, and then the interspike intervals are analysed separately for each channel, then across channels, and finally one is selected, the reciprocal of which is the perceived pitch of the fundamental. Of course, neither the piano tuner or the musician is concious of such calculations – the results are passed on to higher levels in the brain together with information about the nature of the complex tone, which will help the listener distinguish timbre, as in what instrument is being played or what is creating the sound, also temporal information on a much broader time scale to give rhythm, together with information about volume, all these factors being processed, analysed, put in context and compared to stored information. The result is what we’re concious of, whether it be a piece of music for the listener, a mistake to be corrected by a musician or the cue for a piano tuner to turn a tuning pin one way or another.

5 thoughts on “The Human Ear

  1. I agree with you about tuning pianos to A 432hz although unfortunately as tuners we are generally expected to tune pianos to A 440 concert pitch as this is what the public wants. I tune my own to pianos to A 432 but very seldom would a client request such!

    Great blog thoroughly enjoyed the article.

    Jamie Fox
    Dublin Piano Tuner

    • One chap suggested A 432 to me once, I was skeptical about his reasons but interested. He seemed to think 432 Hz was the resonant frequency of water molecules whereas 440 was the devil’s pitch or something. Any good resources on this?

      • Your friend is right … as evidenced by the fact that the devil has all the best music! If we switched to 432 everything would become frightfully dull.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>