Extracting and Using Music Audio Information Dan Ellis Laboratory - - PowerPoint PPT Presentation

extracting and using music audio information
SMART_READER_LITE
LIVE PREVIEW

Extracting and Using Music Audio Information Dan Ellis Laboratory - - PowerPoint PPT Presentation

Extracting and Using Music Audio Information Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu/ 1. Motivation: Music Collections


slide-1
SLIDE 1

Music Audio Information - Ellis 2007-11-02

  • p. /42

1

  • 1. Motivation: Music Collections
  • 2. Music Information
  • 3. Music Similarity
  • 4. Music Structure Discovery

Extracting and Using Music Audio Information

Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio

  • Dept. Electrical Engineering, Columbia University, NY USA

http://labrosa.ee.columbia.edu/

slide-2
SLIDE 2

Music Audio Information - Ellis 2007-11-02

  • p. /42

2

LabROSA Overview

Information Extraction Machine Learning Signal Processing Speech Music Environment

Recognition Retrieval Separation

slide-3
SLIDE 3

Music Audio Information - Ellis 2007-11-02

  • p. /42

3

  • 1. Managing Music Collections
  • A lot of music data available

e.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks

  • Management challenge

how can computers help?

  • Application scenarios

personal music collection discovering new music “music placement”

slide-4
SLIDE 4

Music Audio Information - Ellis 2007-11-02

  • p. /42

Learning from Music

  • What can we infer from 1000 h of music?

common patterns sounds, melodies, chords, form what is and what isn’t music

  • Data driven musicology?
  • Applications

modeling/description/coding computer generated music curiosity...

4

Scatter of PCA(3:6) of 12x16 beatchroma 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60

slide-5
SLIDE 5

Music Audio Information - Ellis 2007-11-02

  • p. /42

5

The Big Picture

.. so far

Music audio Tempo and beat Low-level features Classification and Similarity Music Structure Discovery Melody and notes Key and chords

browsing discovery production modeling generation curiosity

slide-6
SLIDE 6

Music Audio Information - Ellis 2007-11-02

  • p. /42
  • 2. Music Information
  • How to represent music audio?
  • Audio features

spectrogram, MFCCs, bases

  • Musical elements

notes, beats, chords, phrases requires transcription

  • Or something inbetween?
  • ptimized for a certain task?

6

Time Frequency 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1000 2000 3000 4000

slide-7
SLIDE 7

Music Audio Information - Ellis 2007-11-02

  • p. /42

7

Transcription as Classification

  • Exchange signal models for data

transcription as pure classification problem:

Poliner & Ellis ‘05,’06,’07

Classification:

  • N-binary SVMs (one for ea. note).
  • Independent frame-level

classification on 10 ms grid.

  • Dist. to class bndy as posterior.

classification posteriors

T emporal Smoothing:

  • T

wo state (on/off) independent HMM for ea. note. Parameters learned from training data.

  • Find

Viterbi sequence for ea. note. hmm smoothing

Training data and features:

  • MIDI, multi-track recordings,

playback piano, & resampled audio (less than 28 mins of train audio).

  • Normalized magnitude STFT.

feature representation

feature vector

slide-8
SLIDE 8

Music Audio Information - Ellis 2007-11-02

  • p. /42

Polyphonic Transcription

  • Real music excerpts + ground truth

8

MIREX 2007

Frame-level transcription

Estimate the fundamental frequency of all notes present on a 10 ms grid

Note-level transcription

Group frame-level predictions into note-level transcriptions by estimating onset/offset

0.25 0.50 0.75 1.00 1.25

Precision Recall Acc Etot Esubs Emiss Efa

0.25 0.50 0.75 1.00 1.25

Precision Recall

  • Ave. F-measure
  • Ave. Overlap
slide-9
SLIDE 9

Music Audio Information - Ellis 2007-11-02

  • p. /42

Beat Tracking

  • Goal: One feature vector per ‘beat’ (tatum)

for tempo normalization, efficiency

  • “Onset Strength Envelope”

sumf(max(0, difft(log |X(t, f)|)))

  • Autocorr. + window → global tempo estimate

9

10 20 30 40 freq / mel 5 10 15 time / sec

100 200 300 400 500 600 700 800 900 lag / 4 ms samples 1000

168.5 BPM

Ellis ’06,’07

slide-10
SLIDE 10

Music Audio Information - Ellis 2007-11-02

  • p. /42

Beat Tracking

  • Dynamic Programming finds beat times {ti}
  • ptimizes i O(ti) +  i W((ti+1 – ti – p)/)

where O(t) is onset strength envelope (local score) W(t) is a log-Gaussian window (transition cost) p is the default beat period per measured tempo incrementally find best predecessor at every time backtrace from largest final score to get beats

10

C*(t) = γ O(t) + (1–γ)max{W((τ – τp)/β)C*(τ)} τ P(t) = argmax{W((τ – τp)/β)C*(τ)} τ t τ O(t) C*(t)

slide-11
SLIDE 11

Music Audio Information - Ellis 2007-11-02

  • p. /42

Beat Tracking

  • DP will bridge gaps (non-causal)

there is always a best path ...

  • 2nd place in MIREX 2006 Beat Tracking

compared to McKinney & Moelants human data

11

10 20 30 40 5 10 15 20 40 freq / Bark band Subject # time / s test 2 (Bragg) - McKinney + Moelants Subject data 182 184 186 188 190 192 time / sec 10 20 30 40 freq / Bark band Alanis Morissette - All I Want - gap + beats

slide-12
SLIDE 12

Music Audio Information - Ellis 2007-11-02

  • p. /42

Chroma Features

  • Chroma features convert spectral energy

into musical weights in a canonical octave

i.e. 12 semitone bins

  • Can resynthesize as “Shepard Tones”

all octaves at once

12

Piano scale

2 4 6 8 10 100 200 300 400 500 600 700 time / sec freq / kHz 1 2 3 4 time / frames chroma A C D F G Piano chromatic scale IF chroma 500 1000 1500 2000 2500 freq / Hz

  • 60
  • 50
  • 40
  • 30
  • 20
  • 10

2 4 6 8 10 time / sec freq / kHz level / dB 1 2 3 4 Shepard tone resynth 12 Shepard tone spectra

slide-13
SLIDE 13

Music Audio Information - Ellis 2007-11-02

  • p. /42

Aligned Global model

Taxman Eleanor Rigby I'm Only Sleeping She Said She Said Good Day Sunshine And Your Bird Can Sing Love You To Yellow Submarine

aligned chroma A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G aligned chroma

Key Estimation

  • Covariance of chroma reflects key
  • Normalize by transposing for best fit

single Gaussian model of one piece find ML rotation

  • f other pieces

model all transposed pieces iterate until convergence

13

Ellis ICASSP ’07

slide-14
SLIDE 14

Music Audio Information - Ellis 2007-11-02

  • p. /42

Chord Transcription

  • “Real Books” give chord transcriptions

but no exact timing .. just like speech transcripts

  • Use EM to simultaneously

learn and align chord models

14

# The Beatles - A Hard Day's Night # G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G D G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G C9 G Cadd9 Fadd9

ae1 ae2 ae3 dh1 dh2

Model inventory Uniform initialization alignments Initialization parameters Repeat until convergence E-step: probabilities

  • f unknowns

M-step: maximize via parameters Labelled training data dh ax k ae t s ae t aa n dh dh s ae t aa n ax ax k k ae ae t Θinit p(qn|X1,Θold)

i N

Θ : max E[log p(X,Q | Θ)]

Sheh & Ellis ’03

slide-15
SLIDE 15

Music Audio Information - Ellis 2007-11-02

  • p. /42

Chord Transcription

  • Needed more training data...

15

MFCCs are poor (can overtrain) PCPs better (ROT helps generalization)

(random ~3%)

Feature Recog. Alignment MFCC 8.7% 22.0% PCP_ROT 21.7% 76.0% Frame-level Accuracy

true

E G D Bm G

align

E G DBm G

recog E

G Bm Am Em7 Bm Em7 16.27 24.84 time / sec intensity A # B C # D # E F # G # pitch class

Beatles - Beatles For Sale - Eight Days a Week (4096pt)

20 40 60 80 100 120

slide-16
SLIDE 16

Music Audio Information - Ellis 2007-11-02

  • p. /42
  • 3. Music Similarity
  • The most central problem...

motivates extracting musical information supports real applications (playlists, discovery)

  • But do we need content-based similarity?

compete with collaborative filtering compete with fingerprinting + metadata

  • Maybe ... for the Future of Music

connect listeners directly to musicians

16

slide-17
SLIDE 17

Music Audio Information - Ellis 2007-11-02

  • p. /42

Discriminative Classification

  • Classification as a proxy for similarity
  • Distribution models...
  • vs. SVM

17

KL KL MFCCs Artist 1 Artist 2 Test Song Min Artist Training GMMs D D D D D D MFCCs Artist 1 Artist 2 Song Features DAG SVM Test Song Artist Training

Mandel & Ellis ‘05

slide-18
SLIDE 18

Music Audio Information - Ellis 2007-11-02

  • p. /42

Segment-Level Features

  • Statistics of spectra and envelope

define a point in feature space

for SVM classification, or Euclidean similarity...

18

Mandel & Ellis ‘07

slide-19
SLIDE 19

Music Audio Information - Ellis 2007-11-02

  • p. /42

MIREX’07 Results

  • One system for similarity and classification

19

Audio Music Similarity Audio Classification

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2 Greater0 Psum Fine WCsum SDsum Greater1 10 20 30 40 50 60 70 80 IMsvm MEspec ME TL GT IMknn KL CL GH Genre ID Hierarchical Genre ID Raw Mood ID Composer ID Artist ID

PS = Pohle, Schnitzer; GT = George Tzane- takis; LB = Barrington, Turnbull, Torres, Lanckriet; CB = Christoph Bastuck; TL = Lidy, Rauber, Per- tusa, I˜ nesta; ME = Mandel, Ellis; BK = Bosteels, Kerre; PC = Paradzinets, Chen IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy, Rauber, Pertusa, I˜ nesta; GT = George Tzane- takis; KL = Kyogu Lee; CL = Laurier, Herrera; GH = Guaus, Herrera

slide-20
SLIDE 20

Music Audio Information - Ellis 2007-11-02

  • p. /42

Active-Learning Playlists

  • SVMs are well suited to “active learning”

solicit labels on items closest to current boundary

  • Automatic player

with “skip” = Ground truth data collection

active-SVM automatic playlist generation

20

slide-21
SLIDE 21

Music Audio Information - Ellis 2007-11-02

  • p. /42

Cover Song Detection

  • “Cover Songs” = reinterpretation of a piece

different instrumentation, character no match with “timbral” features

  • Need a different representation!

beat-synchronous chroma features

21

Let It Be - The Beatles Let It Be - Nick Cave

time / se time / sec freq / kHz Let It Be / Beatles / verse 1 2 4 6 8 10 1 2 3 4 freq / kHz chroma chroma 1 2 3 4 Let It Be / Nick Cave / verse 1 2 4 6 8 10 Beat-sync chroma features 5 10 15 20 25 beats Beat-sync chroma features 5 10 15 20 25 beat A C D F G A C D F G

Ellis & Poliner ‘07

slide-22
SLIDE 22

Music Audio Information - Ellis 2007-11-02

  • p. /42

22

Beat-Synchronous Chroma Features

  • Beat + chroma features / 30ms frames

→ average chroma within each beat

compact; sufficient?

! "# "! $# $! %# %! $ & ' ( "# "$ "# $# %# &# $ & ' ( "# "$ # ! "# "! )*+,-.-/,0 )*+,-.-1,2)/ 34,5-.-6,7 89/,)-/)4,9:); 0;48+2-1*9/ 0;48+2-1*9/ #

slide-23
SLIDE 23

Music Audio Information - Ellis 2007-11-02

  • p. /42

Matching: Global Correlation

  • Cross-correlate entire beat-chroma matrices

... at all possible transpositions implicit combination of match quality and duration

  • One good matching fragment is sufficient...?

23

100 200 300 400 500 beats @281 BPM

  • 500
  • 400
  • 300
  • 200
  • 100

100 200 300 400 skew / beats

  • 5

+5 G E D C A chroma bins G E D C A chroma bins skew / semitones Elliott Smith - Between the Bars Glen Phillips - Between the Bars Cross-correlation

slide-24
SLIDE 24

Music Audio Information - Ellis 2007-11-02

  • p. /42

MIREX 06 Results

  • Cover song contest

30 songs x 11 versions of each (!) (data has not been disclosed) # true covers in top 10 8 systems compared (4 cover song + 4 similarity)

  • Found 761/3300

= 23% recall

next best: 11% guess: 3%

24

MIREX 06 Cover Song Results: # Covers retrieved per song per system

CS DE KL1 KL2 KWL KWT LR TP cover song systems similarity systems

song-set (each row is one query song)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

2

correct matches retrieved

4 6 8

slide-25
SLIDE 25

Music Audio Information - Ellis 2007-11-02

  • p. /42

Cross-Correlation Similarity

  • Use cover-song approach to find similarity

e.g. similar note/instrumentation sequence may sound very similar to judges

  • Numerous variants

try on chroma (melody/harmony) and MFCCs (timbre) try full search (xcorr)

  • r landmarks (indexable)

compare to random, segment-level stats

  • Evaluate by subjective tests

modeled after MIREX similarity

25

slide-26
SLIDE 26

Music Audio Information - Ellis 2007-11-02

  • p. /42

Cross-Correlation Similarity

  • Human web-based judgments

binary judgments for speed 6 users x 30 queries x 10 candidate returns

  • Cross-correlation inferior to baseline...

... but is getting somewhere, even with ‘landmark’

26

possible of 180. Algorithm Similar count (1) Xcorr, chroma 48/180 = 27% (2) Xcorr, MFCC 48/180 = 27% (3) Xcorr, combo 55/180 = 31% (4) Xcorr, combo + tempo 34/180 = 19% (5) Xcorr, combo at boundary 49/180 = 27% (6) Baseline, MFCC 81/180 = 45% (7) Baseline, rhythmic 49/180 = 27% (8) Baseline, combo 88/180 = 49% Random choice 1 22/180 = 12% Random choice 2 28/180 = 16%

slide-27
SLIDE 27

Music Audio Information - Ellis 2007-11-02

  • p. /42

Cross-Correlation Similarity

  • Results are not overwhelming

.. but database is only a few thousand clips

27

slide-28
SLIDE 28

Music Audio Information - Ellis 2007-11-02

  • p. /42

“Anchor Space”

  • Acoustic features describe each song

.. but from a signal, not a perceptual, perspective .. and not the differences between songs

  • Use genre classifiers to define new space

prototype genres are “anchors”

28

n-dimensional vector in "Anchor Space" Anchor Anchor Anchor Audio Input (Class j) p(a1|x) p(a2|x) p(an|x) GMM Modeling

Conversion to Anchorspace

n-dimensional vector in "Anchor Space" Anchor Anchor Anchor Audio Input (Class i) p(a1|x) p(a2|x) p(an|x) GMM Modeling

Conversion to Anchorspace

Similarity Computation KL-d, EMD, etc.

Berenzweig & Ellis ‘03

slide-29
SLIDE 29

Music Audio Information - Ellis 2007-11-02

  • p. /42

29

“Anchor Space”

  • Frame-by-frame high-level categorizations

compare to raw features? properties in distributions? dynamics?

third cepstral coef fifth cepstral coef madonna bowie

Cepstral Features

1 0.5 0.5 0.8 0.6 0.4 0.2 0.2 0.4 0.6 Country Electronica madonna bowie 10 5

Anchor Space Features

15 10 5 15

slide-30
SLIDE 30

Music Audio Information - Ellis 2007-11-02

  • p. /42

30

‘Playola’ Similarity Browser

slide-31
SLIDE 31

Music Audio Information - Ellis 2007-11-02

  • p. /42

31

Ground-truth data

  • Hard to evaluate Playola’s ‘accuracy’

user tests... ground truth?

  • “Musicseer” online

survey/game:

ran for 9 months in 2002 > 1,000 users, > 20k judgments

http://labrosa.ee.columbia.edu/ projects/musicsim/ Ellis et al, ‘02

slide-32
SLIDE 32

Music Audio Information - Ellis 2007-11-02

  • p. /42

“Semantic Bases”

  • Describe segment in human-relevant terms

e.g. anchor space, but more so

  • Need ground truth...

what words to people use?

  • MajorMiner

game:

400 users 7500 unique tags 70,000 taggings 2200 10-sec clips used

  • Train classifiers...

32

slide-33
SLIDE 33

Music Audio Information - Ellis 2007-11-02

  • p. /42
  • 3. Music Structure Discovery
  • Use the many examples to map out the

“manifold” of music audio

... and hence define the subset that is music

  • Problems

alignment/registration of data factoring & abstraction separating parts?

33

test tracks artist model s 32GMMs on 1000 MFCC20s ae be br cr da de fl ga ge gr ma me pi qu ro ro ti u2 aerosmith beatles bryan_adams creedence_clearwater_revival dave_matthews_band depeche_mode fleetwood_mac garth_brooks genesis green_day madonna metallica pink_floyd queen rolling_stones roxette tina_turner u2
  • 7
  • 6.5
  • 6
  • 5.5
  • 5
  • 4.5
  • 4
  • 3.5
  • 3
  • 2.5
x 104
slide-34
SLIDE 34

Music Audio Information - Ellis 2007-11-02

  • p. /42

34

Eigenrhythms: Drum Pattern Space

  • Pop songs built on repeating “drum loop”

variations on a few bass, snare, hi-hat patterns

  • Eigen-analysis (or ...) to capture variations?

by analyzing lots of (MIDI) data, or from audio

  • Applications

music categorization “beat box” synthesis insight

Ellis & Arroyo ‘04

slide-35
SLIDE 35

Music Audio Information - Ellis 2007-11-02

  • p. /42

35

Aligning the Data

  • Need to align patterns prior to modeling...

tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template

slide-36
SLIDE 36

Music Audio Information - Ellis 2007-11-02

  • p. /42

36

Eigenrhythms (PCA)

  • Need 20+ Eigenvectors for good coverage
  • f 100 training patterns (1200 dims)
  • Eigenrhythms both add and subtract
slide-37
SLIDE 37

Music Audio Information - Ellis 2007-11-02

  • p. /42

37

Posirhythms (NMF)

  • Nonnegative: only adds beat-weight
  • Capturing some structure

Posirhythm 1 BD SN HH BD SN HH BD SN HH BD SN HH BD SN HH BD SN HH Posirhythm 2 Posirhythm 3 Posirhythm 4 Posirhythm 5 50 samples (@ 200 Hz) beats (@ 120 BPM) 100 150 200 250 300 350 400 1 2 3 4 1 2 3 4 Posirhythm 6 50 100 150 200 250 300 350

  • 0.1

0.1

slide-38
SLIDE 38

Music Audio Information - Ellis 2007-11-02

  • p. /42

38

Eigenrhythm BeatBox

  • Resynthesize rhythms from eigen-space
slide-39
SLIDE 39

Music Audio Information - Ellis 2007-11-02

  • p. /42

39

Melody Clustering

  • Goal: Find ‘fragments’ that recur in melodies

.. across large music database .. trade data for model sophistication

  • Data sources

pitch tracker, or MIDI training data

  • Melody fragment representation

DCT(1:20) - removes average, smoothes detail

Training data Melody extraction 5 second fragments Top clusters VQ clustering

slide-40
SLIDE 40

Music Audio Information - Ellis 2007-11-02

  • p. /42

40

Melody Clustering

  • Clusters match underlying contour:
  • Some interesting

matches:

e.g. Pink + Nsync

slide-41
SLIDE 41

Music Audio Information - Ellis 2007-11-02

  • p. /42

Beat-Chroma Fragment Codebook

  • Idea: Find the very popular music fragments

e.g. perfect cadence, rising melody, ...?

  • Clustering a large enough database should

reveal these

but: registration of phrase boundaries, transposition

  • Need to deal with really large datasets

e.g. 100k+ tracks, multiple landmarks in each but: Locality Sensitive Hashing can help - quickly finds ‘most’ points in a certain radius

  • Experiments in progress...

41

slide-42
SLIDE 42

Music Audio Information - Ellis 2007-11-02

  • p. /42

42

Conclusions

  • Lots of data

+ noisy transcription + weak clustering ⇒ musical insights?

Music audio Tempo and beat Low-level features Classification and Similarity Music Structure Discovery Melody and notes Key and chords

browsing discovery production modeling generation curiosity