EE E6820: Speech & Audio Processing & Recognition Lecture 6: - - PowerPoint PPT Presentation

ee e6820 speech audio processing recognition lecture 6
SMART_READER_LITE
LIVE PREVIEW

EE E6820: Speech & Audio Processing & Recognition Lecture 6: - - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis 1 Music and nonspeech 2 Music synthesis techniques 3 Sinewave synthesis 4 Music analysis 5 Transcription Dan Ellis


slide-1
SLIDE 1

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 6: Music analysis and synthesis

Music and nonspeech Music synthesis techniques Sinewave synthesis Music analysis Transcription

Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/

1 2 3 4 5

slide-2
SLIDE 2

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 2

Music & nonspeech

  • What is ‘nonspeech’?
  • according to research effort: a little music
  • in the world: most everything

attributes?

1

Origin

natural man-made

Information content

low high wind & water animal sounds speech music machines & engines contact/ collision

slide-3
SLIDE 3

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 3

Sound attributes

  • Attributes suggest model parameters
  • What do we notice about ‘general’ sound?
  • psychophysics: pitch, loudness, ‘timbre’
  • bright/dull; sharp/soft; grating/soothing
  • sound is not ‘abstract’:

tendency is to describe by source-events

  • Ecological perspective
  • what matters about sound is ‘what happened’

  • ur percepts express this more-or-less directly
slide-4
SLIDE 4

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 4

Aside: Sound textures

  • What do we hear in:
  • a city street
  • a symphony orchestra
  • How do we distinguish:
  • waterfall
  • rainfall
  • applause
  • static
  • Levels
  • f ecological description...

time / s freq / Hz Applause04 1 2 3 4 1000 2000 3000 4000 5000 time / s freq / Hz Rain01 1 2 3 4 1000 2000 3000 4000 5000

slide-5
SLIDE 5

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 5

Motivations for modeling

  • Describe/classify
  • cast sound into model because want to use the

resulting parameters

  • Store/transmit
  • model implicitly exploits limited structure of

signal

  • Resynthesize/modify
  • model separates out interesting parameters

Sound Model parameter space

slide-6
SLIDE 6

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 6

Analysis and synthesis

  • Analysis is the converse of synthesis:
  • Can exist apart:
  • analysis for classification
  • synthesis of artificial sounds
  • Often used together:
  • encoding/decoding of compressed formats
  • resynthesis based on analyses
  • analysis-by-synthesis

Sound Analysis Synthesis Model / representation

slide-7
SLIDE 7

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 7

Outline

Music and nonspeech Music synthesis techniques

  • Framework
  • Historical development

Sinewave synthesis Music analysis Transcription elements? 1 2 3 4 5

slide-8
SLIDE 8

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 8

Music synthesis techniques

  • What is music?
  • could be anything

→ flexible synthesis needed!

  • Key elements of conventional music
  • instruments

→ note-events (time, pitch, accent level) → melody, harmony, rhythm

  • patterns of repetition & variation
  • Synthesis framework:

instruments: common framework for many notes score: sequence of (time, pitch, level) note events

2

slide-9
SLIDE 9

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 9

The nature of musical instrument notes

  • Characterized by instrument (register),

note, loudness (emphasis), articulation... distinguish how?

Time Frequency Piano 1000 2000 3000 4000 1 2 3 4 Time Violin 1000 2000 3000 4000 1 2 3 4 Time Frequency Clarinet 1000 2000 3000 4000 1 2 3 4 Time Trumpet 1000 2000 3000 4000 1 2 3 4

slide-10
SLIDE 10

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 10

Development of music synthesis

  • Goals of music synthesis:
  • generate realistic / pleasant new notes
  • control / explore timbre (quality)
  • Earliest computer systems in 1960s

(voice synthesis, algorithmic)

  • Pure synthesis approaches:
  • 1970s:

Analog synths

  • 1980s:

FM (Stanford/Yamaha)

  • 1990s:

Physical modeling, hybrids

  • Analysis-synthesis methods:
  • sampling / wavetables
  • sinusoid modeling
  • harmonics + noise (+ transients)
  • thers?
slide-11
SLIDE 11

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 11

Analog synthesis

  • The minimum to make an ‘interesting’ sound
  • Elements:
  • harmonics-rich oscillators
  • time-varying filters
  • time-varying envelope
  • modulation: low frequency + envelope-based
  • Result:
  • time-varying spectrum, independent pitch

Oscillator Filter Envelope Pitch Trigger Vibrato t t f + Cutoff freq Gain Sound + +

slide-12
SLIDE 12

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 12

FM synthesis

  • Fast frequency modulation

→ sidebands:

  • a harmonic series if

ω

c

= r · ω

m

  • J

n

( β ) is a Bessel function: → Complex harmonic spectra by varying β

ωct β ωmt ( ) sin + ( ) cos Jn β ( ) ωc nωm + ( )t ( ) cos

n ∞ – =

=

1 2 3 4 5 6 7 8 9

  • 0.5

0.5 1

J0 J1 J2 J3 J4 Jn(β) ≈ 0 for β < n - 2 modulation index β

time / s freq / Hz 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1000 2000 3000 4000

what use?

ωc 2000Hz = ωm 200Hz =

slide-13
SLIDE 13

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 13

Sampling synthesis

  • Resynthesis from real notes

→ vary pitch, duration, level

  • Pitch: stretch (resample) waveform
  • Duration: loop a ‘sustain’ section
  • Level: cross-fade different examples
  • need to ‘line up’ source samples
  • 0.2
  • 0.1

0.1 0.2 0.1 0.2

time

0.002 0.004 0.006 0.008 time / s

  • 0.2
  • 0.1

0.1 0.2 0.002 0.004 0.006 0.008 time / s

596 Hz 894 Hz

  • 0.2
  • 0.1

0.1 0.2

  • 0.2
  • 0.1

0.1 0.2

  • 0.2
  • 0.1

0.1 0.2 0.1 0.2 0.3 0.1 0.2 0.3

time / s time / s

0.204 0.206 0.174 0.176

  • 0.2
  • 0.1

0.1 0.2

  • 0.2
  • 0.1

0.1 0.2 0.05 0.1 0.15 0.05 0.1 0.15

time / s time / s

Soft Loud veloc mix

good & bad?

slide-14
SLIDE 14

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 14

Outline

Music and nonspeech Music synthesis techniques Sinewave synthesis (detail)

  • Sinewave modeling
  • Sines + residual ...

Music analysis Transcription 1 2 3 4 5

slide-15
SLIDE 15

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 15

Sinewave synthesis

  • If patterns of harmonics are what matter,

why not generate them all explicitly:

  • particularly powerful model for pitched signals
  • Analysis (as with speech):
  • find peaks in STFT |S[ω,n]| & track
  • or track fundamental ω0 (harmonics / autoco)

& sample STFT at k·ω0 →set of Ak[n] to duplicate tone:

  • Synthesis via bank of oscillators

3

s n [ ] Ak n [ ] k ω0 n [ ] n ⋅ ⋅ ( ) cos

k

=

0.05 0.1 0.15 0.2

time / s time / s

2000 4000 6000 8000

freq / Hz freq / Hz mag

0.1 0.2 5000 1 2

slide-16
SLIDE 16

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 16

Steps to sinewave modeling - 1

  • The underlying STFT:

What value for N (FFT length & window size)? What value for H (hop size: n0 = r·H, r = 0, 1, 2...)?

  • STFT window length determines freq. resol’n:
  • Choose N long enough to resolve harmonics

→ 2-3x longest (lowest) fundamental period

  • e.g. 30-60 ms = 480-960 samples @ 16 kHz
  • choose H ≤ N/2
  • N too long → lost time resolution
  • limits sinusoid amplitude rate of change

X k n0 , [ ] x n n0 + [ ] w n [ ] j 2πkn N

   – exp ⋅ ⋅

n = N 1 –

= Xw e jω ( ) X e jω ( ) W e jω ( )

*

=

slide-17
SLIDE 17

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 17

Steps to sinewave modeling - 2

  • Choose candidate sinusoids at each time

by picking peaks in each STFT frame:

  • Quadratic fit for peak, lin. interp. for phase:

+ linear interp. of unwrapped phase

time / s

freq / Hz level / dB

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2000 4000 6000 8000 1000 2000 3000 4000 5000 6000 7000 freq / Hz

  • 60
  • 40
  • 20

20

400 600 800 freq / Hz

  • 20
  • 10

10 20 400 600 800 freq / Hz

  • 10
  • 5

level / dB phase / rad

y x y = ax(x-b) b/2 ab2/4

slide-18
SLIDE 18

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 18

Steps to sinewave modeling - 3

  • Which peaks to pick?

Want ‘true’ sinusoids, not noise fluctuations

  • ‘prominence’ threshold above smoothed spec.
  • Sinusoids exhibit stability...
  • of amplitude in time
  • of phase derivative in time

→compare with adjacent time frames to test?

1000 2000 3000 4000 5000 6000 7000

freq / Hz

  • 60
  • 40
  • 20

20

level / dB

slide-19
SLIDE 19

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 19

Steps to sinewave modeling - 4

  • ‘Grow’ tracks by appending newly-found peaks

to existing tracks:

  • ambiguous assignments possible
  • Unclaimed new peak
  • ‘birth’ of new track
  • backtrack to find earliest trace?
  • No continuation peak for existing track
  • ‘death’ of track
  • or: reduce peak threshold for hysteresis

time freq

death birth

existing tracks new peaks

slide-20
SLIDE 20

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 20

Resynthesis of sinewave models

  • After analysis, each track defines contours in

frequency, amplitude fk[n], Ak[n] (+ phase?)

  • use to drive a sinewave oscillators & sum up
  • ‘Regularize’ to exactly harmonic fk[n] = k·f0[n]

0.05 0.1 0.15 0.2 500 600 700 1 2 3 0.05 0.1 0.15 0.2

time / s time / s freq / Hz level

  • 3
  • 2
  • 1

1 2 3

Ak[n]·cos(2πfk[n]·t) fk[n] Ak[n] n

0.05 0.1 0.15 0.2 2000 4000 6000 0.05 0.1 0.15 0.2 550 600 650 700

time / s time / s freq / Hz freq / Hz

what to do?

slide-21
SLIDE 21

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 21

Modification in sinewave resynthesis

  • Change duration by warping timebase
  • may want to keep onset unwarped
  • Change pitch by scaling frequencies
  • either stretching or resampling envelope
  • Change timbre by interpolating params

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1000 2000 3000 4000 5000

time / s freq / Hz

1000 2000 3000 4000 10 20 30 40

freq / Hz level / dB

1000 2000 3000 4000 10 20 30 40

freq / Hz level / dB

slide-22
SLIDE 22

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 22

Sinusoids + residual

  • Only ‘prominent peaks’ became tracks
  • remainder of spectral energy was noisy?

→ model residual energy with noise!

  • How to obtain ‘non-harmonic’ spectrum?
  • zero-out spectrum near extracted peaks?
  • or: resynthesize (exactly) & subtract waveforms

.. must preserve phase!

  • Can model residual signal with LPC

→flexible representation of noisy residual

es n [ ] s n [ ] Ak n [ ] 2πn f k n [ ] ⋅ ( ) cos

k

– =

1000 2000 3000 4000 5000 6000 7000 freq / Hz mag / dB

  • 80
  • 60
  • 40
  • 20

20

  • riginal

sinusoids residual LPC

slide-23
SLIDE 23

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 23

Sinusoids + noise + transients

  • Sound represented as sinusoids and noise:

Parameters are {Ak[n], fk[n]}, hn[n]

  • Separate out abrupt transients in residual?
  • more specific → more flexible

s n [ ] Ak n [ ] 2πn f k n [ ] ⋅ ( ) cos

k

hn n [ ] b n [ ]

*

+ =

Sinusoids Residual es n

[ ]

time / s freq / Hz

0.2 0.4 0.6 2000 4000 6000 8000 2000 4000 6000 8000 0.2 0.4 0.6 2000 4000 6000 8000

{Ak[n], fk[n]} hn[n]

es n [ ] tk n [ ]

k

hn n [ ] b n [ ]

*

+ =

slide-24
SLIDE 24

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 24

Outline

Music and nonspeech Music synthesis techniques Sinewave synthesis Music analysis

  • Instrument identification
  • Pitch tracking

Transcription 1 2 3 4 5

slide-25
SLIDE 25

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 25

Music analysis

  • What might we want to get out of music?
  • Instrument identification
  • different levels of specificity
  • ‘registers’ within instruments
  • Score recovery
  • transcribe the note sequence
  • extract the ‘performance’
  • Ensemble performance
  • ‘gestalts’: chords, tone colors
  • Broader timescales
  • phrasing & musical structure
  • artist / genre clustering and classification

4

slide-26
SLIDE 26

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 26

Instrument identification

  • Research looks for perceptual ‘timbre space’
  • Cues to instrument identification
  • onset (rise time), sustain (brightness)
  • Hierarchy of instrument families
  • strings / reeds / brass
  • optimize features at each level

bright dull low flux hi flux low attack hi attack procedure?

slide-27
SLIDE 27

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 27

Pitch tracking

  • Fundamental frequency (→ pitch)

is a key attribute of musical sounds →pitch tracking as a key technology

  • Pitch tracking for speech
  • voice pitch & spectrum highly dynamic
  • speech is voiced and unvoiced ground truth?
  • Applications
  • voice coders (excitation description)
  • harmonic modeling
slide-28
SLIDE 28

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 28

Pitch tracking for music

  • Pitch in music
  • pitch is more stable (although vibrato)
  • but: multiple pitches
  • Applications
  • harmonic modeling
  • music transcription (→ storage, resynthesis)
  • source separation
  • Approaches: “place” & “time”

Time Frequency 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1000 2000 3000 4000

??

slide-29
SLIDE 29

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 29

Meddis & Hewitt pitch model

  • Autocorrelation (time) based pitch extraction
  • fundamental period → peak(s) in autocorrelation
  • Compute separately in each frequency band

& ‘summarize’ across (perceptual) channels

x t ( ) x t T + ( ) ≈ rxx T ( ) x t ( )x t T + ( )

= max ≈

20 40 60 80 100

  • 0.2
  • 0.1

0.1 0.2 20 40 60 80 100

  • 1

1 2 3

time / samples lag / samples

Waveform x[n] Autocorrelation rxx[l]

80 328 866 1924 4000 CF / Hz Autocorrelogram 2.5 5.0 7.5 10.0 12.5 1 lag / ms

Bandpass filters Rectification & low-pass filter Periodicity detection Cross-channel sum sound Summary ACG

slide-30
SLIDE 30

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 30

Tolonen & Karjalainen simplification

  • Multiple frequency channels can have different

pitches dominant...

  • But equalizing (flattening) the spectrum works:

→ Summary AC as a function of time:

  • ‘Enhancement’ = cancel subharmonics

Pre- whitening sound Highpass @ 1kHz Rectify & low-pass Lowpass @ 1kHz Periodicity detection SACF enhance ESACF Periodicity detection +

time/s 50 100 200 400 1000 f/Hz Periodogram for M/F voice mix 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Summary autocorrelation at t=0.775 s

0.001 0.002 0.003 0.004 0.006 0.01 0.02 lag/s 200 Hz (0.005s) 125 Hz (0.008s)

lag vs. freq?

slide-31
SLIDE 31

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 31

Post-processing of pitch tracks

  • Remove outliers with median filtering
  • Octave errors are common:
  • if x(t) ≈ x(t + T ) then x(t) ≈ x(t + 2T ) etc.

→ dynamic programming/HMM

  • Validity
  • “is there a pitch at this time?”
  • voiced/unvoiced decision for speech
  • Event detection
  • when does a pitch slide indicate a new note?

time 5-pt median

slide-32
SLIDE 32

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 32

Outline

Music and nonspeech Music synthesis techniques Sinewave synthesis Music analysis Transcription

  • Bottom-up and top-down
  • Transcription from sinewave models

1 2 3 4 5

slide-33
SLIDE 33

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 33

Transcription

  • Basic idea: Recover the score
  • Is it possible? Why is it hard?
  • music students do it

... but they are highly trained; know the rules

  • Motivations
  • for study: what was played?
  • highly compressed representation (e.g. MIDI)
  • the ultimate restoration system...

5

Time Frequency 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1000 2000 3000 4000

slide-34
SLIDE 34

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 34

Transcription framework

  • Recover discrete events to explain signal
  • analysis-by-synthesis?
  • Exhaustive search?
  • would be possible given exact note waveforms
  • .. or just a 2-dimensional ‘note’ template?

but superposition is not linear in |STFT| space

  • Inference depends on all detected notes
  • is this evidence ‘available’ or ‘used’?
  • full solution is exponentially complex

Note events {tk, pk, ik} Observations X[k,n] synthesis

?

note template 2-D convolution

slide-35
SLIDE 35

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 35

Bottom-up versus top-down

  • Bottom-up: observ’n directly gives description
  • e.g. peaks in 2-D convolution
  • but: few domains are that ‘linear’
  • Top-down: pursue & confirm hypotheses
  • e.g. analysis-by-resynthesis matching
  • but: need to limit search space
  • Generally, need to do both:
  • bottom-up guides & limits search
  • top-down resolves ambiguities in low-level

how to transcribe?

Abstract constraints guidance tests Raw

  • bservations

Data-driven analyses Hypothesis search

slide-36
SLIDE 36

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 36

Transcription from sinewave models

  • Form sinusoid model
  • as with synthesis, but signal

is more complex

  • Break tracks
  • need to detect new ‘onset’

at single frequencies

  • Group by onset & common

harmonicity

  • find sets of tracks that start

around the same time

  • + stable harmonic pattern
  • Pass on to constraint-

based filtering...

time / s freq / Hz

1 2 3 4 500 1000 1500 2000 2500 3000

bu/td? mistakes?

0.5 1 1.5 time / s 0.02 0.04 0.06

slide-37
SLIDE 37

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 37

Problems for transcription

  • Music is practically worst case!
  • note events are often synchronized

→ defeats common onset

  • notes have harmonic relations (2:3 etc.)

→ collision/interference between harmonics

  • variety of instruments, techniques, ...
  • Listeners are very sensitive to certain errors
  • .. and impervious to others
  • Apply further constraints
  • like our ‘music student’
  • maybe even the whole score (Scheirer)!
slide-38
SLIDE 38

E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 38

Summary

  • ‘Nonspeech audio’
  • i.e. sound in general
  • characteristics: ecological
  • Music synthesis
  • control of pitch, duration, loudness, articulation
  • evolution of techniques
  • sinusoids + noise + transients
  • Music analysis
  • different aspects: instruments, pitches,

performance

  • transcription complications:

representation, octaves, onsets, ...

  • rely on high-level structural constraints

and beyond?