GCT535- Sound Technology for Multimedia Timbre Analysis Graduate - - PowerPoint PPT Presentation

gct535 sound technology for multimedia timbre analysis
SMART_READER_LITE
LIVE PREVIEW

GCT535- Sound Technology for Multimedia Timbre Analysis Graduate - - PowerPoint PPT Presentation

GCT535- Sound Technology for Multimedia Timbre Analysis Graduate School of Culture Technology KAIST Juhan Nam 1 Outlines Timbre Analysis Definition of Timbre Timbre Features Zero-crossing rate Spectral summary features


slide-1
SLIDE 1

GCT535- Sound Technology for Multimedia Timbre Analysis

Graduate School of Culture Technology KAIST Juhan Nam

1

slide-2
SLIDE 2

Outlines

§ Timbre Analysis

– Definition of Timbre – Timbre Features

  • Zero-crossing rate
  • Spectral summary features
  • Mel-Frequency Cepstral Coefficient (MFCC)

2

slide-3
SLIDE 3

What is timbre?

§ Definition

– Attribute of sensation in terms of which a listener can judge that two sounds having the same loudness and pitch are dissimilar (ANSI) – Tone color or quality that defines a particular sound

§ Associated with classifying or identifying sound sources

– Class: piano, guitar, singing voice, engine sound – Identity: Steinway Model D, Fender Stratocaster, Michael Jackson, Harley Davisson

§ Also used to holistically describe polyphonic sounds

– For example, music or environmental sounds – Associated with genre, mood or other high-level descriptions

3

slide-4
SLIDE 4

What is timbre?

§ Timbre is a very vague concept

– There is no single quantitative scale like loudness or pitch. – There are actually multiple attributes.

§ Different aspects of the multiplicity

– Acoustic attributes: temporal or spectral factors – Timber space: perceptual similarity/dissimilarity – Semantic attributes: textual descriptions

4

slide-5
SLIDE 5

Acoustic Attributes in Timbre Perception

§ Acoustic Attributes (Schouten, 1968)

– Harmonicity: the range between tonal and noise-like character – Time envelope (ADSR) – Spectral envelope – Changes of spectral envelope and fundamental frequency – The onset of a sound differing notably from the sustained vibration

5

Changes of spectral envelope ADSR

slide-6
SLIDE 6

Acoustic Attributes in Timbre Perception

§ Sound design problem?

6

slide-7
SLIDE 7

Timbre Space

§ Perceptual multi-dimensional attributes based on measuring similarity

– Ask human to listen a pair of sounds and judge the degree of similarity as a score – The similarity matrix is processed using multi- dimensional scaling (MDS), a dimensionality reduction algorithm which determines the timbre space

§ Acoustic correlation with the three (reduced) dimensions

– Spectral energy distribution – Attack and decay time – Amount of inharmonic sound in the attack

7

(Grey, 1977)

slide-8
SLIDE 8

Semantic attributes

§ Verbally describe different characteristics of timbre using words

8

Dull______|______Brilliant Cold______|______Warm Pure______|______Rich

(Pratt and Doak, 1976)

Dull______|______Sharp Compact______|______Scattered Full______|______Empty Colorful______|______Colorless

(von Bismark, 1974)

(T. Rossing’s music150 slides)

slide-9
SLIDE 9

Timbre Feature Extraction

§ Extracting acoustic features from signals § Low-level Acoustic Features

– Zero-crossing rates – Spectral summaries – Spectral envelope: MFCC

9

slide-10
SLIDE 10

Zero-Crossing Rate (ZCR)

§ ZCR is low for harmonic (voiced) sounds and high for noisy (unvoiced) sounds § For simple periodic signals, it is related to the F0

10

Voiced Unvoiced

slide-11
SLIDE 11

Spectral Summary Features

§ Spectral Centroid: “Center of gravity” of the spectrum

– Associated with the brightness of sounds

§ Spectral Roll-off: frequency under which 85% or 95% of spectral energy is concentrated in

SC(t) = fk Xt(k)

k

Xt(k)

k

11

Xt(k)

k Rt

= 0.85 Xt(k)

k N

slide-12
SLIDE 12

Spectral Summary Features

§ Spectral Spread(SS): a measure of the bandwidth of the spectrum § Spectral flatness (SF): a measure of the noisiness of the spectrum

– The ratio between the geometric and arithmetic means – Examples: white noise à 1, pure tone à 0

12

SS(t) = ( fk − SC(t))2 Xt(k)

k

Xt(k)

k

SF(t) = Xt(k)

k

K

1 K Xt(k)

k

slide-13
SLIDE 13

Examples of Spectral Centroids

13

time [sec] frequency [Hz]

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time [sec] frequency [Hz]

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Classical: “Beethoven String Quartet” Pop: “Video killed the radio star”

slide-14
SLIDE 14

Mel-Frequency Cepstral Coefficient (MFCC)

§ Most popularly used audio feature that extracts spectral envelop from an audio frame

– Standard audio feature in speech recognition – Introduced in music domain by Logan in 2000

§ Computation Steps

DCT Log magnitude DFT (audio frame) Mapping freq. scale to mel

14

slide-15
SLIDE 15

Mel-Frequency Spectrogram

§ Convert linear frequency to mel scale § Usually reduce the dimensionality of spectrum

15

Spectrum Spectrum (mel-scaled)

slide-16
SLIDE 16

Discrete Cosine Transform

§ Real-valued transform: similar to DFT

– De-correlate the mel-scaled log spectrum and reduce the dimensionality again

16

Spectrum (mel-scaled) MFCC

XDCT (k) = 2 N x(n)cos(πk N (n − 0.5))

n=1 N−1

slide-17
SLIDE 17

Reconstructed Frequency Spectrum from MFCC

17

Frequency spectrum (512 bins) Frequency spectrum (mel-scaled, 60 bins) MFCC (13 dim) Reconstructed Frequency Spectrum (mel-scaled) Reconstructed Frequency spectrum

slide-18
SLIDE 18

Comparison of Spectrogram and MFCC

18

Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC

slide-19
SLIDE 19

Sound Examples of MFCC

§ Original: § MFCC reconstruction (using white-noise as a source):

19

slide-20
SLIDE 20

Post-processing

§ Adding temporal dynamics

– Short-term dynamics of features are characterized with delta or double-delta – 39 MFCCs in speech recognition: 13 MFCCs + 13 delta + 13 double-delta

§ Normalization

– Cepstral Mean Subtraction (CMS): subtract the mean over surrounding frames – Standardization: subtract the mean and divide by the variance

20

Δx = x(n)− x(n − h) h ΔΔx = Δx(n)− Δx(n − h) h

slide-21
SLIDE 21

Applications

§ Music

– Musical Instrument classification – Music genre/mood classification – Similarity-based audio retrieval

§ Speech

– Speech recognition – Speaker recognition

21

slide-22
SLIDE 22

References

§ J. Grey, “Multidimensional Perceptual Scaling of musical timbre”, 1977 § D. Wessel, “Timbre Space as a musical control structure”, 1979 § S. Donnadieu, “Mental Representation of the Timbre of Complex Sounds”, book chapter (ch. 8) in “Analysis, Synthesis and Perception of Musical sounds”, ed. J. Beauchamp, 2007 § B. Logan, “Mel Frequency Cepstral Coefficients for Music Modeling”, 2000

22