GCT535- Sound Technology for Multimedia Tonal Analysis Graduate - - PowerPoint PPT Presentation

gct535 sound technology for multimedia tonal analysis
SMART_READER_LITE
LIVE PREVIEW

GCT535- Sound Technology for Multimedia Tonal Analysis Graduate - - PowerPoint PPT Presentation

GCT535- Sound Technology for Multimedia Tonal Analysis Graduate School of Culture Technology KAIST Juhan Nam 1 Outline Pitch Perception Perceptual Pitch Scale Log-Scaled Spectrum Tonal Analysis Chroma Feature Key


slide-1
SLIDE 1

GCT535- Sound Technology for Multimedia Tonal Analysis

Graduate School of Culture Technology KAIST Juhan Nam

1

slide-2
SLIDE 2

Outline

§ Pitch Perception

– Perceptual Pitch Scale – Log-Scaled Spectrum

§ Tonal Analysis

– Chroma Feature – Key Estimation – Chord Recognition

2

slide-3
SLIDE 3

Frequency Scale in Spectrogram

§ Linear frequency scale

– Great to see the harmonic structure of a single tone. – However, it is not the most intuitive way to visualize musical signals

3

Piano (Chromatic Scale) Beatles “Hey Jude”

time [second] frequency−Hz 10 20 30 40 50 500 1000 1500 2000 2500 3000 3500 4000 time [second] frequency−Hz 1 2 3 4 5 6 7 8 2000 4000 6000 8000 10000

slide-4
SLIDE 4

Human Pitch Perception

§ Human ears are sensitive to frequency changes in a log scale

– Pitch resolution: just noticeable difference (JND) increases as the frequency goes up – Place theory: resonance position along the basilar membrane in cochlea

4

Response of the basilar membrane to a pair of tones

From CCRMA Music 150 slides (Thomas Rossing)

slide-5
SLIDE 5

§ Frequency bandwidth within which one tone interferes with the perception of another tone by auditory masking

– Constant at low frequency but linear at high frequency

Critical Bandwidth

5

From CCRMA Music 150 slides (Thomas Rossing)

slide-6
SLIDE 6

0.5 1 1.5 2 2.5 x 10

4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 frequency (Hz) normalized scales ERB Mel Bark

§ Mel scale

– Based on pitch ratio of tones (mel from “melody”)

§ Bark scale

– Critical band measurement by masking

§ Equivalent Regular Bandwidth (EBR) rate

– Critical band measurement using the notched- noise method

Psychoacoustical Pitch Scales

6

m = 2595log10(1+ f / 700)

Bark =13arctan(0.00075 f )+3.5arctan(( f / 7500)2) ERBS = 21.4⋅log10(1+ 0.00437 f )

Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/

Comparison of Pitch Scales

slide-7
SLIDE 7

Musical Pitch Scale

§ Equal temperament

– 1: 21/12 ratio between two adjacent notes – Music note (m) and frequency (f) in Hz

7

f = 440⋅2

(m−69) 12

m =12log2( f 440)+ 69,

https://newt.phys.unsw.edu.au/jw/notes.html

slide-8
SLIDE 8

time [second] MIDI note number 10 20 30 40 50 20 40 60 80 100 120 time [second] frequency−Hz 10 20 30 40 50 500 1000 1500 2000 2500 3000 3500 4000

Frequency Mapping Using Spectrogram

§ Mapping linear scale to a perceptual (log-like) scale

– Locate center frequencies according to the frequency mapping – Linear interpolation on the center frequency with the corresponding bandwidth skirt

8 Center Frequency Band width

Log-Frequency Spectrogram Linear-Frequency Spectrogram

slide-9
SLIDE 9

§ The mapping can be formed as matrix multiplication

– Each column of the mapping matrix contain the interpolation coefficients

§ Limitation

– Simple but time frequency resolutions are still constrained on STFT

100 200 300 400 500 600 20 40 60 80 100 120

Frequency Mapping Using Spectrogram

9

×

time [second] MIDI note number 10 20 30 40 50 20 40 60 80 100 120

Y = M ⋅ X

(M: mapping matrix, X: spectrogram, Y: scaled spectrogram)

time [second] frequency−Hz 10 20 30 40 50 500 1000 1500 2000 2500 3000 3500 4000

=

slide-10
SLIDE 10

Mel-Frequency Spectrogram

§ Mel scale is a popularly choice

– Example: MFCC

10

Linear-Frequency Spectrogram Mel-Frequency Spectrogram

time [second] frequency−Hz 1 2 3 4 5 6 7 8 2000 4000 6000 8000 10000 time [second] Mel bin 1 2 3 4 5 6 7 8 50 100 150 200 250

slide-11
SLIDE 11

Constant-Q transform

§ Use a set of sinusoidal kernels with:

– Logarithmically spaced frequencies – Constant Q = frequency/bandwidth

11

slide-12
SLIDE 12

Comparison of Different Time-Frequency Representations

12

Spectrogram (short window)

time frequency

Spectrogram (long window)

time frequency

Mel Spectrogram

time frequency

Constant-Q transform

time frequency

slide-13
SLIDE 13

Example of Constant-Q transform

13 time [second] MIDI note number 10 20 30 40 50 20 40 60 80 100 120

Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)

10 20 30 40 50 100 120 140 160 180 200 220 240 260 280 300 320 time [second]

slide-14
SLIDE 14

Chord Recognition in MIR

§ Identifying chord progression of tonal music § It is a challenging task (even for human)

– Chords are not explicit in music – Non-chord notes or passing notes – Key change and chromaticism: requires in-depth knowledge of music theory – In audio, multiple musical instruments are mixed

  • Relevant: harmonically arranged notes
  • Irrelevant: percussive sounds (but can help detecting chord changes)

§ What kind of audio features can be extracted to recognize chords in a robust way?

14

slide-15
SLIDE 15

Pitch Helix

§ The basic assumption in tonal harmony is that octave-distance notes belong to the same pitch class

– No dissonance among them – As a result, there are “12 pitch class”

§ Shepard represented the octave equivalence with “pitch helix”

– Chroma: represents the inherent circularity

  • f pitch organization

– Height: naturally increase and have one

  • ctave apart for one rotation

15

Pitch Helix and Chroma (Shepard, 2001)

slide-16
SLIDE 16

Chroma

§ Chroma is independent of the height

– Shepard tone: single pitch class in harmonics – Constant rising and falling

§ Chroma contains the relative distribution of pitch classes and pitch height is noisy variation in chord recognition

– Thus, chroma is considered to be well-suited for analyzing harmony.

16

Optical illusion stairs Shepard tone https://vimeo.com/34749558

slide-17
SLIDE 17

Chroma Features

§ Chroma features are audio feature vectors that contain the chroma characteristics

– Ideally, obtained by polyphonic note transcription but too expensive – In addition, as notes are more harmonized, separating polyphonic notes become harder

§ In practice, chroma features are obtained by projecting all time-frequency energy onto 12 pitch classes § Used for not only for chord recognition but also key estimation, segmentation, synchronization, cover-song detection

17

slide-18
SLIDE 18

Chroma Features: FFT-based approach

§ Compute spectrogram and mapping matrix

– Convert frequency to music pitch scale and get the pitch class – Set one to the corresponding pitch class and, otherwise, set zero – Adjust non-zeros values such that low-frequency content have more weights

18

slide-19
SLIDE 19

Improvements

§ Blurring

– Intrinsic problem with STFT – Solutions: find amplitude peaks and use them only

§ De-tuning

– Notes can be deviated from reference tuning – Compute 36 bin chroma features: add two neighboring bins to each pitch class – Use only a peak value among the three bins per pitch class

§ Normalization

– Divide the frame chroma features by the local maximum or mean to regularize the volume change

19

slide-20
SLIDE 20

Chroma Features: Filter-bank approach

§ Alternatively, a filter-bank can be used to get a log-scale time-frequency representation

– Center frequencies are arranged over 88 piano notes – band widths are set to have constant-Q and robust to +/- 25 cent detune

§ The outputs that belong to the same pitch class are wrapped and summed.

20

(Muller, 2011)

slide-21
SLIDE 21

Beat-Synchronous Chroma Features

§ Make chroma features homogeneous within a beat (Bartsch and Wakefield, 2001)

21

(From Ellis’ slides)

slide-22
SLIDE 22

Key Estimation Overview

§ Estimate music key from music data

– One of 24 keys: 12 pitch classes (C, C#, D, .., B) + major/minor

§ General Framework (Gomez, 2006)

22

G major Similarity Measure Chroma Features Average Key Template Key Strength

slide-23
SLIDE 23

Key Template

§ Probe tone profile (Krumhansl and Kessler, 1982)

– Relative stability or weight of tones – Listeners rated which tones best completed the first seven notes of a major scale.

  • For example, in C major key, C, D, E, F, G, A, B, … what?

23

Probe Tone Profile - Relative Pitch Ranking

slide-24
SLIDE 24

Key Estimation

§ Similarity by cross-correlation between chroma features and templates § Find the key that produces the maximum correlation

24

slide-25
SLIDE 25

Chord Recognition

§ Estimate chords from music data

– Typically, one of 24 keys: 12 pitch classes + major/minor – Often, diminish chords are added (36 chords)

§ General Framework

25

Chords Decision Making Audio/ Transform Chroma Features Chord Template

  • r Models

Template Matching HMM, SVM

slide-26
SLIDE 26

Template-Based Approach

§ Use chord templates (Fujishima, 1999; Harte and Sandler, 2005) and find the best matches § Chord Templates

26

(from Bello’s Slides)

slide-27
SLIDE 27

Template-Based Approach

§ Compute the cross-correlation between chroma features and chord templates and select chords that have maximum values

27

(from Bello’s Slides)

slide-28
SLIDE 28

Limitations

§ Template approach is too straightforward

– The binary templates are hard assignments

§ Temporal dependency of chords is not considered

– The majority of tonal music have certain types of chord progression

§ The recognized chords are not smooth

– Some post-processing (smoothing) is necessary

28

slide-29
SLIDE 29

Demo

§ Chordify: https://chordify.net

29

slide-30
SLIDE 30

References

§ P. R. Cook (Editor), “Music, Cognition, and Computerized Sound: An Introduction to Psychoacoustics”, book, 2001 § C. Krumhansl, “Cognitive Foundations of Musical Pitch”, 1990 § M.A. Bartsch and G. H. Wakefield,“To catch a chorus: Using chroma-based representations for audio thumbnailing”, 2001 § E. Gómez, P. Herrera, “Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modeling Strategies”, 2004. § M. Müller and S. Ewert, “Chroma Toolbox: MATLAB Implementations for Extracting Variants of Chroma-Based Audio Features”, 2011. § T. Fujishima, “Real-time chord recognition of musical sound: A system using common lisp music,” 1999

30