Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory - - PowerPoint PPT Presentation

analysis of everyday sounds
SMART_READER_LITE
LIVE PREVIEW

Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory - - PowerPoint PPT Presentation

Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. Personal and Consumer Audio 2. Segmenting &


slide-1
SLIDE 1

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

  • 1. Personal and Consumer Audio
  • 2. Segmenting & Clustering
  • 3. Special-Purpose Detectors
  • 4. Generic Concept Detectors
  • 5. Challenges & Future

Analysis of Everyday Sounds

Dan Ellis and Keansub Lee

Laboratory for Recognition and Organization of Speech and Audio

  • Dept. Electrical Eng., Columbia Univ., NY USA

dpwe@ee.columbia.edu 1

slide-2
SLIDE 2

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

LabROSA Overview

2

Information Extraction Machine Learning Signal Processing Speech Music Environment

Recognition Retrieval Separation

slide-3
SLIDE 3

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

  • 1. Personal Audio Archives
  • Easy to record everything you hear

<2GB / week @ 64 kbps

  • Hard to find anything

how to scan? how to visualize? how to index?

  • Need automatic analysis
  • Need minimal impact

3

slide-4
SLIDE 4

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Personal Audio Applications

  • Automatic appointment-book history

fills in when & where of movements

  • “Life statistics”

how long did I spend in meetings this week? most frequent conversations favorite phrases?

  • Retrieving details

what exactly did I promise? privacy issues...

  • Nostalgia
  • ... or what?

4

slide-5
SLIDE 5

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Consumer Video

  • Short video clips as the

evolution of snapshots

10-60 sec, one location, no editing browsing?

  • More information for indexing...

video + audio foreground + background

5

slide-6
SLIDE 6

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Information in Audio

  • Environmental recordings contain info on:

location – type (restaurant, street, ...) and specific activity – talking, walking, typing people – generic (2 males), specific (Chuck & John) spoken content ... maybe

  • but not:

what people and things “looked like” day/night ... ... except when correlated with audible features

6

slide-7
SLIDE 7

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

A Brief History of Audio Processing

  • Environmental sound classification

draws on earlier sound classification work

as well as source separation...

7

Speech Recognition Source Separation One channel Model-based Cue-based Multi-channel Speaker ID GMM-HMMs Sountrack & Environmental Recognition Music Audio Genre & Artist ID

slide-8
SLIDE 8

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

  • 2. Segmentation & Clustering
  • Top-level structure for long recordings:

Where are the major boundaries?

e.g. for diary application support for manual browsing

  • Length of fundamental time-frame

60s rather than 10ms? background more important than foreground average out uncharacteristic transients

  • Perceptually-motivated features

.. so results have perceptual relevance broad spectrum + some detail

8

slide-9
SLIDE 9

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

MFCC Features

  • Need “timbral” features:

Mel-Frequency Cepstral Coeffs (MFCCs)

auditory-like frequency warping log-domain discrete cosine transform = orthogonalization

9

                              

    

slide-10
SLIDE 10

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Long-Duration Features

  • Capture both average and variation
  • Capture a little more detail in subbands...

60 dB 80 100 120 freq / bark

Average Linear Energy

5 10 15 20 dB 20 40 60 freq / bark

Normalized Energy Deviation

5 10 15 20 bits 0.1 0.2 0.3 0.4 0.5 time / min freq / bark

Spectral Entropy Deviation

50 100 150 200 250 300 350 400 450 5 10 15 20 bits 0.5 0.6 0.7 0.8 0.9 freq / bark

Average Spectral Entropy

5 10 15 20 dB 5 10 15 freq / bark

Log Energy Deviation

5 10 15 20 dB 60 80 100 120 freq / bark

Average Log Energy

5 10 15 20

10

slide-11
SLIDE 11

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Spectral Entropy

  • Auditory spectrum:
  • Spectral entropy ≈ ‘peakiness’ of each band:

H[n, j] = −

NF

k=0

wjkX[n,k] A[n, j] ·log wjkX[n,k] A[n, j]

  • A[n, j] =

NF

k=0

wjkX[n,k]

1000 2000 3000 4000 5000 6000 7000 8000

  • 60
  • 40
  • 20

30 340 750 1130 1630 2280 3220 3780 4470 5280 6250 7380

  • 1
  • 0.5

0.5

energy / dB

  • rel. entropy / bits

freq / Hz FFT spectral magnitude Auditory Spectrum per-band Spectral Entropies

11

slide-12
SLIDE 12

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

BIC Segmentation

  • BIC (Bayesian Info. Crit.) compares models:

log L(X1;M1)L(X2;M2)

L(X;M0)

≷ λ

2 log(N)∆#(M)

12

2004-09-10-1023_AvgLEnergy 5 10 15 20 13:30 14:00 14:30 15:00 15:30 16:00

  • 200
  • 100

time / hr AvgLogAudSpec BIC score boundary passes no boundary with shorter context BIC L(X;M0)

current conte last segmentation point xt limit candidate boundary

L(X1;M1) L(X2;M2)

slide-13
SLIDE 13

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

BIC Segmentation Results

  • Evaluate: 62 hr hand-marked dataset

8 days, 139 segments, 16 categories measure Correct Accept % @ False Accept = 2%:

μdB 80.8% μH 81.1% σH/μH 81.6% μdB + σH/μH 84.0% μdB + σH/μH + μH 83.6% mfcc 73.6%

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 - Specificity Sensitivity

µdB µH H/µH µdB + H/µH µdB + µH + H/µH

Correct Accept Feature

13

slide-14
SLIDE 14

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Segment Clustering

  • Daily activity has lots of repetition:

Automatically cluster similar segments

‘affinity’ of segments as KL2 distances

14 !"#$%"&' ()!%*#)+ ,#)"- ,'(('"#. ()!%*#)/ ,"#,)#

  • "#"0-)

1))%'23 4*5)#1-%

(', #4% 4%# 666

768 + 7

!15

!"15*4 (',#"#9 #)4%"*#"2% 4%#))% ,0:('23 ;01)

slide-15
SLIDE 15

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Spectral Clustering

  • Eigenanalysis of affinity matrix: A = U•S•V′

eigenvectors vk give cluster memberships

  • Number of clusters?

200 400 600 800 100 200 300 400 500 600 700 800 900 200 400 600 800 200 400 600 800 200 400 600 800 200 400 600 800

Affinity Matrix SVD components: uk•skk•vk' k=1 k=3 k=2 k=4

15

slide-16
SLIDE 16

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Clustering Results

  • Clustering of automatic segments gives

‘anonymous classes’

BIC criterion to choose number of clusters make best correspondence to 16 GT clusters

  • Frame-level scoring gives ~70% correct

errors when same ‘place’ has multiple ambiences

16

slide-17
SLIDE 17

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Browsing Interface

  • Browsing / Diary interface

links to other information (diary, email, photos) synchronize with note taking? (Stifelman & Arons) audio thumbnails

  • Release Tools + “how to” for capture

17

!"#!! !"#$! !%#!! !%#$! &!#!! &!#$! &&#!! &&#$! &'#!! &'#$! &$#!! &$#$! &(#!! &(#$! &)#!! &)#$! &*#!! &*#$! &+#!! ,-./01223 045. ,-./01223 045. 3.067-. 25580. 25580. 276922- :-27, 34; 045. <..68=:' 25580. 25580. 276922- 34; >2= ?4=7.3 @--2A2B C.//.-

'!!(D!%D&$

C' 045. 25580. 276922- 3.067-. 276922- <..68=:' 276922- 25580. 045. 25580. 276922- 25580. ,2/63.0 25580. EFG!$ 02<,<6: ?8H. F4<;4-64B

'!!(D!%D&(

C' 045. 25580. <..68=: 3.067-.' 25580. :-414< H.4=/7;

'!!(D!%D&)

<..68=: 045. 25580. <..68=:' ,2/63.0 3.067-. 276922- 276922- 25580. 276922- 276922- <..68=: 25580. ,2/63.0 <..68=: 25580. <..68=:' EFG!( 25580.

'!!(D!%D&*

045. :-27, 045. :-27, :-27, 276922- :-27, :-27, 25580. 045.

  • 9:

/.<8=4- :,

'!!(D!%D&+

slide-18
SLIDE 18

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

  • 3. Special-Purpose Detectors: Speech
  • Speech emerges as most interesting content
  • Just identifying speech would be useful

goal is speaker identification / labeling

  • Lots of background noise

conventional Voice Activity Detection inadequate

  • Insight: Listeners detect pitch track (melody)

look for voice-like periodicity in noise

18

Time Frequency coffeeshop excerpt 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1000 2000 3000 4000

slide-19
SLIDE 19

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Voice Periodicity Enhancement

  • Noise-robust subband autocorrelation

15 min test set; 88% acc (no suppression: 79%) also for enhancing speech by harmonic filtering

19

  • Subtract

local average

suppresses steady background e.g. machine noise

slide-20
SLIDE 20

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Detecting Repeating Events

  • Recurring sound events can be informative

indicate similar circumstance... but: define “event” – sound organization define “recurring event” – how similar? .. and how to find them – tractable?

  • Idea: Use hashing (fingerprints)

index points to other occurrences of each hash; intersection of hashes points to match

  • much quicker search

use a fingerprint insensitive to background?

20

with Jim Ogle

slide-21
SLIDE 21

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Shazam Fingerprints

  • Prominent spectral onsets are landmarks;

Use relations {f1, f2, t} as hashes

intrinsically robust to background noise

21

Phone ring - Shazam fingerprint 0.5 1 1.5 2 2.5 3 1000 2000 3000 4000

slide-22
SLIDE 22

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Exhaustive Search for Repeats

  • More selective

hashes →

few hits required to confirm match (faster; better precision) but less robust to backgound (reduce recall)

22

  • Works well when exact structure repeats

recorded music, electronic alerts no good for “organic” sounds e.g. garage door

slide-23
SLIDE 23

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Music Detector

  • Two characteristic features for music

strong, sustained periodicity (notes) clear, rhythmic repetition (beat) at least one should be present!

  • Noise-robust pitch detector

looks for high-order autocorrelation

  • Beat tracker

.. from Music IR work

23

music audio Rhythm-range envelope autocorrelation Pitch-range subband autocorrelation Local stability measure Fused music classifier Perceptual rhythm model

slide-24
SLIDE 24

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

  • 4. Generic Concept Detectors
  • Consumer

Video application: How to assist browsing?

system automatically tags recordings tags chosen by usefulness, feasibility

  • Initial set of 25 tags defined:

“animal”, “baby”, “cheer”, “dancing” ... human annotation of 1300+ videos evaluate by average precision

  • Multimodal detection

separate audio + visual low-level detectors (then fused...)

24

slide-25
SLIDE 25

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

MFCC Covariance Representation

  • Each clip/segment → fixed-size statistics

similar to speaker ID and music genre classification

  • Full Covariance matrix of MFCCs

maps the kinds of spectral shapes present

  • Clip-to-clip distances for SVM classifier

by KL or 2nd Gaussian model

25

VTS_04_0001 - Spectrogram freq / kHz 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8

  • 20
  • 10

10 20 30 time / sec time / sec level / dB value MFCC bin 1 2 3 4 5 6 7 8 9 2 4 6 8 10 12 14 16 18 20

  • 20
  • 15
  • 10
  • 5

5 10 15 20 MFCC dimension MFCC dimension MFCC covariance 5 10 15 20 2 4 6 8 10 12 14 16 18 20

  • 50

50

Video Soundtrack MFCC features MFCC Covariance Matrix

slide-26
SLIDE 26

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

GMM Histogram Representation

  • Want a more ‘discrete’ description

.. to accommodate nonuniformity in MFCC space .. to enable other kinds of models...

  • Divide up feature space with a single

Gaussian Mixture Model

.. then represent each clip by the components used

26

  • 20
  • 10

10 20

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 MFCC(0) MFCC(1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 50 100 150 GMM mixture count

MFCC features Global Gaussian Mixture Model Per-Category Mixture Component Histogram

slide-27
SLIDE 27

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Latent Semantic Analysis (LSA)

  • Probabilistic LSA (pLSA) models each

histogram as a mixture of several ‘topics’

.. each clip may have several things going on

  • Topic sets optimized through EM

p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) use p(topic | clip) as per-clip features

27

GMM histogram ftrs GMM histogram ftrs “Topic” “Topic” AV Clip AV Clip

p(ftr | clip) p(topic | clip) p(ftr | topic)

= *

slide-28
SLIDE 28

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Audio-Only Results

  • Wide range of results:

audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes

28

A n i m a l B a b y B e a c h B i r t h d a y B

  • a

t C r

  • w

d G r

  • u

p

  • f

3 + G r

  • u

p

  • f

2 M u s e u m N i g h t O n e p e r s

  • n

P a r k P i c n i c P l a y g r

  • u

n d S h

  • w

S p

  • r

t s S u n s e t W e d d i n g D a n c i n g P a r a d e S i n g i n g C h e e r M u s i c Concepts S k i 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 AP 1G + KL 1G + Mah GMM Hist. + pLSA Guess

slide-29
SLIDE 29

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

How does it ‘feel’?

  • Browser impressions: How wrong is wrong?

29

Top 8 hits for “Baby”

slide-30
SLIDE 30

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Confusion analysis

  • Where are the errors coming from?

30

0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1

Animal Baby Beach Birthday Boat Crowd Group of 3 + Group of 2 Museum Night One person Park Picnic Playground Show Sports Sunset Wedding Dancing Parade Ski Singing Cheer Music (a) Overlapped Manual Labels (b) Confusion Matrix of Classified Labels

0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7

1 3 M C 2 S P C N A S S S D B B BWM P P S P B 1 3 M C 2 S P C N A S S S D B B BWM P P S P B

slide-31
SLIDE 31

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Fused Results - AV Joint Boosting

  • Audio helps in many classes

31

AP 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Random Baseline Video Only Audio Only A+V Fusion animal singing baby beach birthday boat crowd dancing group of 3+ music cheer group of 2 museum night

  • ne person

parade park picnic playground shows ski sport sunset wedding MAP

slide-32
SLIDE 32

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

  • 5. Future: Temporal Focus
  • Global vs. local class models

tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: “background” (bg) model shared by all clips

32

Old Way: All frames contribute New Way: Limited temporal extents

time / s freq / kHz

YT baby 002:

voice baby laugh 5 10 15 1 2 3 4 voice baby laugh time / s freq / kHz 5 10 15 1 2 3 4 voice laugh laugh bg bg voice baby baby

slide-33
SLIDE 33

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Handling Sound Mixtures

  • MFCCs of mixtures ≠ mix of MFCCs

recognition despite widely varying background? factorial models / Nonnegative Matrix Factorization sinusoidal / landmark techniques

33

level / dB crm-11-070307-noise

  • 60
  • 40
  • 20

freq / kHz crm-11-070307 1 2 3 4 time / sec freq / kHz crm-11-070307+16-050105 0.5 1 1.5 1 2 3 4 time / sec crm-11-070307+16-050105-noise 0.5 1 1.5

Solo Voice M+F Voice Mix MFCC Noise resynth Original

crm-11737+16515.wav crm-11737+16515-noise.wav crm-11737.wav crm-11737-noise.wav

slide-34
SLIDE 34

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Larger Datasets

  • Many detectors are visibly data-limited

getting data is ~ hard labeling data is expensive

  • Bootstrap from

YouTube etc.

lots of web video is edited/dubbed...

  • need a “consumer video” detector?
  • Preliminary

YouTube results disappointing

downloaded data needed extensive clean-up models did not match Kodak data

  • (Freely available data!)

34

slide-35
SLIDE 35

Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. /35

Conclusions

  • Environmental sound contains information

.. that’s why we hear! .. computers can hear it too

  • Personal audio can be segmented, clustered

find specific sounds to help navigation/retrieval

  • Consumer video can be ‘tagged’

.. even in unpromising cases audio is complementary to video

  • Interesting directions for better models

35