[PPT] - GCT535- Sound Technology for Multimedia Pitch Analysis Graduate PowerPoint Presentation

SLIDE 1

GCT535- Sound Technology for Multimedia Pitch Analysis

Graduate School of Culture Technology KAIST Juhan Nam

1

SLIDE 2

Outlines

§ Introduction

– Definition of Pitch – Information in Pitch

§ Monophonic Pitch Detection Algorithms

– Time-Domain Approaches – Frequency-Domain Approaches – Psychoacoustic Model Approaches

§ Pitch Tracking § Applications

2

SLIDE 3

Definition of Pitch

§ Pitch

– Defined as auditory attribute of sound according to which sounds can be ordered on a scale from low and high (ANSI, 1994) – One way of measuring pitch is finding the frequency of a sine wave that is matched to the target sound in a psychophysical experiment – thus, subject to individual persons: e.g. tone-deaf

§ Fundamental Frequency

– Physical attribute of sounds measured from periodicity – Often called F0

§ Pitch should be discriminated from F0:

– However, in practice, they are exchangeably used.

3

SLIDE 4

Information in pitch

§ Music

– Notes or melody – Tonality (in polyphony) – Size (or register) of musical instruments: bass, cello, violin

§ Speech

– Context (prosody): question, mood, attitude – Speaker: gender, age, identity – Meaning: Chinese (Mandarin)

§ Others

– Vocalization of animals (e.g. bird’s chirp, whale): size and types, communication

4

SLIDE 5

Pitch and Musical Instruments

§ Pitch is determined by the spectral characteristics of musical instruments

– Not all musical instruments have pitch

§ Type of musical Instruments by harmonicity

– Harmonic and steady: guitar, flute – Harmonic and dynamic: violin, organ, singing voice(vowel) – Inharmonic: piano, vibraphone – Non-harmonic: drum, percussion, singing voice (consonant)

5

*Inharmonicity in Piano Vibraphone

[From Klapuri’s slides]

SLIDE 6

Pitch Detection Algorithms

§ Time-Domain Approaches

– Periodicity in time

§ Frequency-Domain Approaches

– Periodicity in frequency

§ Psychoacoustic Model Approaches

– Both time and frequency

6 228 230 232 234 236 238 240 242 244 −0.2 −0.1 0.1 0.2 0.3 time [ms] Amplitude

1000 2000 3000 4000 5000 6000 −20 −10 10 20 30 40 50 freqeuncy [Hertz] Magnitude (dB)

waveform spectrum

SLIDE 7

Time-Domain Approach

§ Basic Ideas

– Periodicity: x(t) = x(t+T) – Measure similarity (or distance) between two adjacent segments – Find the period (T ) that gives the closest distance

§ Two main approaches

– Auto-correlation function (ACF): distance by inner product – Average magnitude difference function(AMDF): distance by difference (e.g., L1, L2 norm)

7

SLIDE 8

Auto-Correlation Function (ACF)

§ Measuring self-similarity by

8

r

t(l) =

xt(n)

n=0 N−1−l

∑

⋅ xt(n +l), l = 0,1,2,..., L −1

Singing Voice

(Sondhi 1967)

100 200 300 400 500 600 700 800 900 1000 −1 −0.5 0.5 1 time [sample] Waveform 100 200 300 400 500 600 700 800 900 1000 −40 −20 20 40 60 80 lag [sample] Auto−correlation

SLIDE 9

Auto-Correlation Function (ACF)

§ Biased auto-correlation § Unbiased auto-correlation

9

r

biased,t(l) =

xt(n)

n=0 N−1−l

∑

⋅ xt(n +l), l = 0,1,2,..., L −1

r

unbiased,t(l) =

1 N −l xt(n)

n=0 N−1−l

∑

⋅ xt(n +l), l = 0,1,2,..., L −1

100 200 300 400 500 600 700 800 900 1000 −0.04 −0.02 0.02 0.04 0.06 0.08 lag [sample] Auto−correlation

SLIDE 10

Pitch Detection by ACF

10

Spectrogram (tracking max values) ACF (tracking max values)

SLIDE 11

Interpretation of ACF in Frequency Domain

§ By convolution theorem, auto-correlation can be computed in frequency domain and also efficiently using FFT § Thus, the ACF can be computed as

11

x(n)

n=0 N−1−l

∑

⋅ x(n+l) = FFT−1(X(k)X*(k)) = FFT−1( X(k)

2)

r(l) = 1 N −l real(FFT−1( X(k)

2))

X(k) = FFT(x(n))

SLIDE 12

Interpretation of ACF in Frequency Domain

§ This is equivalent to § ACF is a simple template-based approach in the frequency domain

– Positive weights for (harmonic) peaks and negative weights for valleys

12

r(l) = 1 N −l cos(2πlk K ) X(k)

2 k=0 K−1

∑

10 20 30 40 50 60 70 80 90 100 −0.4 −0.2 0.2 0.4 0.6 0.8 1 Freqeuncy [bin] Magnitude Power Power Spectrogram Weight

SLIDE 13

Problems in ACF

§ Bias to the large peak around zero lag § Not robust to octave errors, particularly, lower octaves

– ACF is sensitive to amplitude changes

§ Equal weights for all harmonic partials

– In general, low-numbered harmonic partials are more important in determining pitch

13

SLIDE 14

Average Magnitude Difference Function (AMDF)

§ Measuring self-similarity by § In YIN, p is set to 2 § And the AMDF is normalized as

14

dt(l) = xt(n)− xt(n +l)

p n=0 N−1−l

∑

, l = 0,1,2,..., L −1

ˆ d(l) = 1 l = 0 d(l) [1 l d(u)

u=1 l

∑

]

therwise

" # $ $ % $ $ dt(l) = (xt(n)− xt(n +l))2

n=0 N−1−l

∑

= xt(n)2 − 2xt(n)xt(n +l)+ xt(n +l)2

n=0 N−1−l

∑

= r

t(0)− 2r t(l)+r t+l(0)

Minimize the negative ACF plus a lag-dependent term (de Cheveigné & Kawahara, 2002)

SLIDE 15

Average Magnitude Difference Function (AMDF)

15

AMDF Normalized AMDF

SLIDE 16

Why YIN (AMDF) works better

16

§ Robust to changes in amplitude

– The difference (instead of correlation) takes care of amplitude changes. – This reduces octave errors.

§ Zero-lag bias is avoided by the normalized AMDF § The normalized AMDF allows using a fixed threshold

– Can choose multiple candidates and refine peaks

SLIDE 17

Example of AMDF (YIN)

17

SLIDE 18

Frequency-Domain Approach

§ Basic Ideas

– Periodic in time domain à Harmonic in frequency domain – Measure how harmonic the spectrum is – Find F0 that best explains the harmonic patterns (harmonic partials)

§ Algorithms

– Pattern Matching – Cepstrum – Harmonic-Product-Sum (HPS)

18

SLIDE 19

Pattern Matching: Comb-filtering

§ Using sharp harmonic sieves to take harmonic peak regions only

– Compute pitch saliency for F0 candidates

19

(Puckette et al. 1998)

SLIDE 20

Pattern Matching: Cross-correlation

§ Cross-correlation with an ideal template on a log-scale spectrogram

20

[From Ellis’ e4896 course slides]

SLIDE 21

500 1000 1500 2000 2500 3000 3500 4000 −20 20 40 60 80 100 120 Frequency [Hz] Magnitude [dB] 100 200 300 400 500 600 700 800 −100 −50 50 100 150 200 Quefrency Cepstrum

Cepstrum

§ Real Cepstrum is defined as § Basic ideas

– Harmonic partials are periodic in frequency domain – (Inverse) FFT find the the periodicity

21

cx(l) = real(FFT−1(log( FFT(x))))

(Noll, 1967)

Liftering

SLIDE 22

Harmonic Product Sum (HPS)

§ Harmonic Product Sum (HPS) is obtained by multiplying the original magnitude spectrum its decimated spectra by an integer number

22

HPS(k)= X(mk)

m=1 M

∏

(Noll, 1969)

SLIDE 23

Auditory Filter bank

§ A set of filter bank that imitates the magnitude and delay of traveling waves on basilar membrane in cochlear § Correlogram

– Formed by concatenating the ACF of individual HC output – 3-D representation (time-channel-lag) or “auditory images”

23

Cochlear Filter banks

Oval window High Freq. Low Freq. Stabilize & Combine input . . .

HC HC HC

. . .

ACF ACF ACF

Summary ACF

Correlogram Summary ACF Correlogram

Hair cells Auto-correlation Functions

SLIDE 24

Types of Auditory Filter Banks

§ Gamma-tone Filter banks

– Gamma-tone: – Used in Patterson’s auditory filter banks based on ERB

§ Pole-Zero Filter Cascade (Lyon)

24

g(t) = at

n−1e −2πbt cos(2π ft +ϕ)u(t)

SLIDE 25

Hair-Cell

§ (Inner) Hair-cell

– Transform mechanical movement into neural spikes

§ Modeled as cascade of

– Half-wave rectification – Compression – Low-pass filtering

§ This conducts a non-linear processing

– Generate new harmonic partials – Associated with missing fundamentals

25

SLIDE 26

Pitch Analysis Using Auditory Model

26

Summary ACF

§ Summary ACF is computed by summing the ACF across all channels

– The peaks in the ACF represent periodicity features – This is known to be robust to band-limited noises

SLIDE 27

Pitch Tracking

§ Pitch is usually continuous over time

– Once a pitch with strong harmonicity is detected on a frame, the following frames form smooth pitch contour

§ Pitch tracking methods

– Post processing: first detect pitch in a frame-by-frame manner and then find a continuous path by smoothing.

Median Filtering
Dynamic Programming (Talkin, 1995)

– Probabilistic approach: detect multiple pitch candidates every frame and and find the best path

Viterbi-decoding: Probabilistic YIN (Mauch, 2014)

27

SLIDE 28

Applications

§ Sound Modification

– Time-stretching using PSOLA – Auto-tune: pitch-correction or T-Pain effect

§ Music Performance

– Tuning musical instruments – Pitch-based sound control – Score-following and auto-accompaniment

§ Query-by humming

– Relative pitch change might be more important

§ Singing evaluation (e.g. karaoke) and visualization

28

SLIDE 29

References

§ A. de Cheveigné and H. Kawahara, “YIN, a Fundamental Frequency Estimator for Speech and Music”, 2002. § A. Noll, “Cepstrum Pitch Determination,” 1967. § A. Noll, “Pitch Determination of Human Speech by the Harmonic Product Spectrum, the harmonic sum spectrum and a maximum likelihood estimate”, 1969 § M. Puckette, T. Apel and D. Zicarelli, “Real-time audio analysis tools for Pd and MSP,” 1998 § M. Sondhi,“New Methods of Pitch Extraction,” 1968. § D. Talkin,“A Robust Algorithm for Pitch Tracking (RAPT),” 1995. § M. Mauch and S. Dixon ,“PYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions,” 2014.

29