E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 1
EE E6820: Speech & Audio Processing & Recognition Lecture 6: - - PowerPoint PPT Presentation
EE E6820: Speech & Audio Processing & Recognition Lecture 6: - - PowerPoint PPT Presentation
EE E6820: Speech & Audio Processing & Recognition Lecture 6: Music analysis and synthesis 1 Music and nonspeech 2 Music synthesis techniques 3 Sinewave synthesis 4 Music analysis 5 Transcription Dan Ellis
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 2
Music & nonspeech
- What is ‘nonspeech’?
- according to research effort: a little music
- in the world: most everything
attributes?
1
Origin
natural man-made
Information content
low high wind & water animal sounds speech music machines & engines contact/ collision
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 3
Sound attributes
- Attributes suggest model parameters
- What do we notice about ‘general’ sound?
- psychophysics: pitch, loudness, ‘timbre’
- bright/dull; sharp/soft; grating/soothing
- sound is not ‘abstract’:
tendency is to describe by source-events
- Ecological perspective
- what matters about sound is ‘what happened’
→
- ur percepts express this more-or-less directly
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 4
Aside: Sound textures
- What do we hear in:
- a city street
- a symphony orchestra
- How do we distinguish:
- waterfall
- rainfall
- applause
- static
- Levels
- f ecological description...
time / s freq / Hz Applause04 1 2 3 4 1000 2000 3000 4000 5000 time / s freq / Hz Rain01 1 2 3 4 1000 2000 3000 4000 5000
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 5
Motivations for modeling
- Describe/classify
- cast sound into model because want to use the
resulting parameters
- Store/transmit
- model implicitly exploits limited structure of
signal
- Resynthesize/modify
- model separates out interesting parameters
Sound Model parameter space
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 6
Analysis and synthesis
- Analysis is the converse of synthesis:
- Can exist apart:
- analysis for classification
- synthesis of artificial sounds
- Often used together:
- encoding/decoding of compressed formats
- resynthesis based on analyses
- analysis-by-synthesis
Sound Analysis Synthesis Model / representation
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 7
Outline
Music and nonspeech Music synthesis techniques
- Framework
- Historical development
Sinewave synthesis Music analysis Transcription elements? 1 2 3 4 5
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 8
Music synthesis techniques
- What is music?
- could be anything
→ flexible synthesis needed!
- Key elements of conventional music
- instruments
→ note-events (time, pitch, accent level) → melody, harmony, rhythm
- patterns of repetition & variation
- Synthesis framework:
instruments: common framework for many notes score: sequence of (time, pitch, level) note events
2
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 9
The nature of musical instrument notes
- Characterized by instrument (register),
note, loudness (emphasis), articulation... distinguish how?
Time Frequency Piano 1000 2000 3000 4000 1 2 3 4 Time Violin 1000 2000 3000 4000 1 2 3 4 Time Frequency Clarinet 1000 2000 3000 4000 1 2 3 4 Time Trumpet 1000 2000 3000 4000 1 2 3 4
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 10
Development of music synthesis
- Goals of music synthesis:
- generate realistic / pleasant new notes
- control / explore timbre (quality)
- Earliest computer systems in 1960s
(voice synthesis, algorithmic)
- Pure synthesis approaches:
- 1970s:
Analog synths
- 1980s:
FM (Stanford/Yamaha)
- 1990s:
Physical modeling, hybrids
- Analysis-synthesis methods:
- sampling / wavetables
- sinusoid modeling
- harmonics + noise (+ transients)
- thers?
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 11
Analog synthesis
- The minimum to make an ‘interesting’ sound
- Elements:
- harmonics-rich oscillators
- time-varying filters
- time-varying envelope
- modulation: low frequency + envelope-based
- Result:
- time-varying spectrum, independent pitch
Oscillator Filter Envelope Pitch Trigger Vibrato t t f + Cutoff freq Gain Sound + +
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 12
FM synthesis
- Fast frequency modulation
→ sidebands:
- a harmonic series if
ω
c
= r · ω
m
- J
n
( β ) is a Bessel function: → Complex harmonic spectra by varying β
ωct β ωmt ( ) sin + ( ) cos Jn β ( ) ωc nωm + ( )t ( ) cos
n ∞ – =
∞
∑
=
1 2 3 4 5 6 7 8 9
- 0.5
0.5 1
J0 J1 J2 J3 J4 Jn(β) ≈ 0 for β < n - 2 modulation index β
time / s freq / Hz 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1000 2000 3000 4000
what use?
ωc 2000Hz = ωm 200Hz =
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 13
Sampling synthesis
- Resynthesis from real notes
→ vary pitch, duration, level
- Pitch: stretch (resample) waveform
- Duration: loop a ‘sustain’ section
- Level: cross-fade different examples
- need to ‘line up’ source samples
- 0.2
- 0.1
0.1 0.2 0.1 0.2
time
0.002 0.004 0.006 0.008 time / s
- 0.2
- 0.1
0.1 0.2 0.002 0.004 0.006 0.008 time / s
596 Hz 894 Hz
- 0.2
- 0.1
0.1 0.2
- 0.2
- 0.1
0.1 0.2
- 0.2
- 0.1
0.1 0.2 0.1 0.2 0.3 0.1 0.2 0.3
time / s time / s
0.204 0.206 0.174 0.176
- 0.2
- 0.1
0.1 0.2
- 0.2
- 0.1
0.1 0.2 0.05 0.1 0.15 0.05 0.1 0.15
time / s time / s
Soft Loud veloc mix
good & bad?
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 14
Outline
Music and nonspeech Music synthesis techniques Sinewave synthesis (detail)
- Sinewave modeling
- Sines + residual ...
Music analysis Transcription 1 2 3 4 5
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 15
Sinewave synthesis
- If patterns of harmonics are what matter,
why not generate them all explicitly:
- particularly powerful model for pitched signals
- Analysis (as with speech):
- find peaks in STFT |S[ω,n]| & track
- or track fundamental ω0 (harmonics / autoco)
& sample STFT at k·ω0 →set of Ak[n] to duplicate tone:
- Synthesis via bank of oscillators
3
s n [ ] Ak n [ ] k ω0 n [ ] n ⋅ ⋅ ( ) cos
k
∑
=
0.05 0.1 0.15 0.2
time / s time / s
2000 4000 6000 8000
freq / Hz freq / Hz mag
0.1 0.2 5000 1 2
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 16
Steps to sinewave modeling - 1
- The underlying STFT:
What value for N (FFT length & window size)? What value for H (hop size: n0 = r·H, r = 0, 1, 2...)?
- STFT window length determines freq. resol’n:
- Choose N long enough to resolve harmonics
→ 2-3x longest (lowest) fundamental period
- e.g. 30-60 ms = 480-960 samples @ 16 kHz
- choose H ≤ N/2
- N too long → lost time resolution
- limits sinusoid amplitude rate of change
X k n0 , [ ] x n n0 + [ ] w n [ ] j 2πkn N
-
– exp ⋅ ⋅
n = N 1 –
∑
= Xw e jω ( ) X e jω ( ) W e jω ( )
*
=
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 17
Steps to sinewave modeling - 2
- Choose candidate sinusoids at each time
by picking peaks in each STFT frame:
- Quadratic fit for peak, lin. interp. for phase:
+ linear interp. of unwrapped phase
time / s
freq / Hz level / dB
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2000 4000 6000 8000 1000 2000 3000 4000 5000 6000 7000 freq / Hz
- 60
- 40
- 20
20
400 600 800 freq / Hz
- 20
- 10
10 20 400 600 800 freq / Hz
- 10
- 5
level / dB phase / rad
y x y = ax(x-b) b/2 ab2/4
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 18
Steps to sinewave modeling - 3
- Which peaks to pick?
Want ‘true’ sinusoids, not noise fluctuations
- ‘prominence’ threshold above smoothed spec.
- Sinusoids exhibit stability...
- of amplitude in time
- of phase derivative in time
→compare with adjacent time frames to test?
1000 2000 3000 4000 5000 6000 7000
freq / Hz
- 60
- 40
- 20
20
level / dB
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 19
Steps to sinewave modeling - 4
- ‘Grow’ tracks by appending newly-found peaks
to existing tracks:
- ambiguous assignments possible
- Unclaimed new peak
- ‘birth’ of new track
- backtrack to find earliest trace?
- No continuation peak for existing track
- ‘death’ of track
- or: reduce peak threshold for hysteresis
time freq
death birth
existing tracks new peaks
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 20
Resynthesis of sinewave models
- After analysis, each track defines contours in
frequency, amplitude fk[n], Ak[n] (+ phase?)
- use to drive a sinewave oscillators & sum up
- ‘Regularize’ to exactly harmonic fk[n] = k·f0[n]
0.05 0.1 0.15 0.2 500 600 700 1 2 3 0.05 0.1 0.15 0.2
time / s time / s freq / Hz level
- 3
- 2
- 1
1 2 3
Ak[n]·cos(2πfk[n]·t) fk[n] Ak[n] n
0.05 0.1 0.15 0.2 2000 4000 6000 0.05 0.1 0.15 0.2 550 600 650 700
time / s time / s freq / Hz freq / Hz
what to do?
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 21
Modification in sinewave resynthesis
- Change duration by warping timebase
- may want to keep onset unwarped
- Change pitch by scaling frequencies
- either stretching or resampling envelope
- Change timbre by interpolating params
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1000 2000 3000 4000 5000
time / s freq / Hz
1000 2000 3000 4000 10 20 30 40
freq / Hz level / dB
1000 2000 3000 4000 10 20 30 40
freq / Hz level / dB
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 22
Sinusoids + residual
- Only ‘prominent peaks’ became tracks
- remainder of spectral energy was noisy?
→ model residual energy with noise!
- How to obtain ‘non-harmonic’ spectrum?
- zero-out spectrum near extracted peaks?
- or: resynthesize (exactly) & subtract waveforms
.. must preserve phase!
- Can model residual signal with LPC
→flexible representation of noisy residual
es n [ ] s n [ ] Ak n [ ] 2πn f k n [ ] ⋅ ( ) cos
k
∑
– =
1000 2000 3000 4000 5000 6000 7000 freq / Hz mag / dB
- 80
- 60
- 40
- 20
20
- riginal
sinusoids residual LPC
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 23
Sinusoids + noise + transients
- Sound represented as sinusoids and noise:
Parameters are {Ak[n], fk[n]}, hn[n]
- Separate out abrupt transients in residual?
- more specific → more flexible
s n [ ] Ak n [ ] 2πn f k n [ ] ⋅ ( ) cos
k
∑
hn n [ ] b n [ ]
*
+ =
Sinusoids Residual es n
[ ]
time / s freq / Hz
0.2 0.4 0.6 2000 4000 6000 8000 2000 4000 6000 8000 0.2 0.4 0.6 2000 4000 6000 8000
{Ak[n], fk[n]} hn[n]
es n [ ] tk n [ ]
k
∑
hn n [ ] b n [ ]
*
+ =
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 24
Outline
Music and nonspeech Music synthesis techniques Sinewave synthesis Music analysis
- Instrument identification
- Pitch tracking
Transcription 1 2 3 4 5
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 25
Music analysis
- What might we want to get out of music?
- Instrument identification
- different levels of specificity
- ‘registers’ within instruments
- Score recovery
- transcribe the note sequence
- extract the ‘performance’
- Ensemble performance
- ‘gestalts’: chords, tone colors
- Broader timescales
- phrasing & musical structure
- artist / genre clustering and classification
4
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 26
Instrument identification
- Research looks for perceptual ‘timbre space’
- Cues to instrument identification
- onset (rise time), sustain (brightness)
- Hierarchy of instrument families
- strings / reeds / brass
- optimize features at each level
bright dull low flux hi flux low attack hi attack procedure?
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 27
Pitch tracking
- Fundamental frequency (→ pitch)
is a key attribute of musical sounds →pitch tracking as a key technology
- Pitch tracking for speech
- voice pitch & spectrum highly dynamic
- speech is voiced and unvoiced ground truth?
- Applications
- voice coders (excitation description)
- harmonic modeling
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 28
Pitch tracking for music
- Pitch in music
- pitch is more stable (although vibrato)
- but: multiple pitches
- Applications
- harmonic modeling
- music transcription (→ storage, resynthesis)
- source separation
- Approaches: “place” & “time”
Time Frequency 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1000 2000 3000 4000
??
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 29
Meddis & Hewitt pitch model
- Autocorrelation (time) based pitch extraction
- fundamental period → peak(s) in autocorrelation
- Compute separately in each frequency band
& ‘summarize’ across (perceptual) channels
x t ( ) x t T + ( ) ≈ rxx T ( ) x t ( )x t T + ( )
∫
= max ≈
→
20 40 60 80 100
- 0.2
- 0.1
0.1 0.2 20 40 60 80 100
- 1
1 2 3
time / samples lag / samples
Waveform x[n] Autocorrelation rxx[l]
80 328 866 1924 4000 CF / Hz Autocorrelogram 2.5 5.0 7.5 10.0 12.5 1 lag / ms
Bandpass filters Rectification & low-pass filter Periodicity detection Cross-channel sum sound Summary ACG
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 30
Tolonen & Karjalainen simplification
- Multiple frequency channels can have different
pitches dominant...
- But equalizing (flattening) the spectrum works:
→ Summary AC as a function of time:
- ‘Enhancement’ = cancel subharmonics
Pre- whitening sound Highpass @ 1kHz Rectify & low-pass Lowpass @ 1kHz Periodicity detection SACF enhance ESACF Periodicity detection +
time/s 50 100 200 400 1000 f/Hz Periodogram for M/F voice mix 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Summary autocorrelation at t=0.775 s
0.001 0.002 0.003 0.004 0.006 0.01 0.02 lag/s 200 Hz (0.005s) 125 Hz (0.008s)
lag vs. freq?
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 31
Post-processing of pitch tracks
- Remove outliers with median filtering
- Octave errors are common:
- if x(t) ≈ x(t + T ) then x(t) ≈ x(t + 2T ) etc.
→ dynamic programming/HMM
- Validity
- “is there a pitch at this time?”
- voiced/unvoiced decision for speech
- Event detection
- when does a pitch slide indicate a new note?
time 5-pt median
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 32
Outline
Music and nonspeech Music synthesis techniques Sinewave synthesis Music analysis Transcription
- Bottom-up and top-down
- Transcription from sinewave models
1 2 3 4 5
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 33
Transcription
- Basic idea: Recover the score
- Is it possible? Why is it hard?
- music students do it
... but they are highly trained; know the rules
- Motivations
- for study: what was played?
- highly compressed representation (e.g. MIDI)
- the ultimate restoration system...
5
Time Frequency 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1000 2000 3000 4000
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 34
Transcription framework
- Recover discrete events to explain signal
- analysis-by-synthesis?
- Exhaustive search?
- would be possible given exact note waveforms
- .. or just a 2-dimensional ‘note’ template?
but superposition is not linear in |STFT| space
- Inference depends on all detected notes
- is this evidence ‘available’ or ‘used’?
- full solution is exponentially complex
Note events {tk, pk, ik} Observations X[k,n] synthesis
?
note template 2-D convolution
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 35
Bottom-up versus top-down
- Bottom-up: observ’n directly gives description
- e.g. peaks in 2-D convolution
- but: few domains are that ‘linear’
- Top-down: pursue & confirm hypotheses
- e.g. analysis-by-resynthesis matching
- but: need to limit search space
- Generally, need to do both:
- bottom-up guides & limits search
- top-down resolves ambiguities in low-level
how to transcribe?
Abstract constraints guidance tests Raw
- bservations
Data-driven analyses Hypothesis search
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 36
Transcription from sinewave models
- Form sinusoid model
- as with synthesis, but signal
is more complex
- Break tracks
- need to detect new ‘onset’
at single frequencies
- Group by onset & common
harmonicity
- find sets of tracks that start
around the same time
- + stable harmonic pattern
- Pass on to constraint-
based filtering...
time / s freq / Hz
1 2 3 4 500 1000 1500 2000 2500 3000
bu/td? mistakes?
0.5 1 1.5 time / s 0.02 0.04 0.06
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 37
Problems for transcription
- Music is practically worst case!
- note events are often synchronized
→ defeats common onset
- notes have harmonic relations (2:3 etc.)
→ collision/interference between harmonics
- variety of instruments, techniques, ...
- Listeners are very sensitive to certain errors
- .. and impervious to others
- Apply further constraints
- like our ‘music student’
- maybe even the whole score (Scheirer)!
E6820 SAPR - Dan Ellis L06 - Music A & S 2002-03-04 - 38
Summary
- ‘Nonspeech audio’
- i.e. sound in general
- characteristics: ecological
- Music synthesis
- control of pitch, duration, loudness, articulation
- evolution of techniques
- sinusoids + noise + transients
- Music analysis
- different aspects: instruments, pitches,
performance
- transcription complications:
representation, octaves, onsets, ...
- rely on high-level structural constraints