The first template-based system for audio chord recognition
was developed by Fujishima [2]. This method is the first one
that considers chords not only as sets of individual notes, but
rather as entities whose structure is determined by one root
and one type. The chord transcription process is based on the
extraction from the signal of Pitch Class Profiles (PCP) or
chroma vectors. The chroma vectors are 12-dimensional vectors
where each component represents the energy or salience of
one of the 12 semi-tones within the chromatic scale, regardless
of the octave. The temporal evolution of these chroma vectors
is called chromagram: it has been widely used in literature
for chord or key estimation [5], [7]. In Fujishima’s approach,
324 chords are detected, each of them modeled by a binary
Chord Type Template (CTT). The chord detection is performed
by first calculating scores for every root and chord type, then
selecting the best score. The scores are computed from chroma
vectors and hand-tuned variations of the original CTT. Two
matching methods between PCP and CTT are tested: the
Nearest Neighbor Method (Euclidean distance between chroma
vector and hand-tuned CTT) and the Weighted Sum Method
(dot product between chroma vector and hand-tuned CTT).
The hand-tuning is done by trial-and-error and accounts for the
chord type probability and the number of notes within the chord
type. Two postprocessing methods are introduced in order to
take into account the temporal structure of the chord sequence.
The first attempt is to smooth over the past chroma vectors
to both reduce the noise and use the fact that a chord usually
lasts for several frames. The second attempt is to detect chord
changes by monitoring the direction of the chroma vectors.