Introduction to Speaker Diarization Dr. Gerald Friedland - - PowerPoint PPT Presentation

introduction to speaker diarization
SMART_READER_LITE
LIVE PREVIEW

Introduction to Speaker Diarization Dr. Gerald Friedland - - PowerPoint PPT Presentation

Introduction to Speaker Diarization Dr. Gerald Friedland International Computer Science Institute Berkeley, CA friedland@icsi.berkeley.edu Monday, May 21, 12 Speaker Diarization... tries to answer the question: who spoke when?


slide-1
SLIDE 1
  • Dr. Gerald Friedland

International Computer Science Institute Berkeley, CA friedland@icsi.berkeley.edu

Introduction to Speaker Diarization

Monday, May 21, 12

slide-2
SLIDE 2

Speaker Diarization...

➡tries to answer the question: “who spoke when?” ➡using a single or multiple microphone inputs ➡without prior knowledge of anything (#speakers, language, text, etc...)

2

Monday, May 21, 12

slide-3
SLIDE 3

Visualization

Estimate “who spoke when” with no prior knowledge of speakers, #of speakers, words, or language spoken. Audiotrack:

3

Monday, May 21, 12

slide-4
SLIDE 4

Visualization

Estimate “who spoke when” with no prior knowledge of speakers, #of speakers, words, or language spoken. Audiotrack:

3

Monday, May 21, 12

slide-5
SLIDE 5

Visualization

Estimate “who spoke when” with no prior knowledge of speakers, #of speakers, words, or language spoken. Audiotrack: Segmentation:

3

Monday, May 21, 12

slide-6
SLIDE 6

Visualization

Estimate “who spoke when” with no prior knowledge of speakers, #of speakers, words, or language spoken. Audiotrack: Segmentation:

3

Monday, May 21, 12

slide-7
SLIDE 7

Visualization

Estimate “who spoke when” with no prior knowledge of speakers, #of speakers, words, or language spoken. Audiotrack: Clustering: Segmentation:

3

Monday, May 21, 12

slide-8
SLIDE 8

Speaker Diarization is NOT

4

Monday, May 21, 12

slide-9
SLIDE 9

Speaker Diarization is NOT

  • Speaker ID (Speaker ID is

supervized and needs prior training)

4

Monday, May 21, 12

slide-10
SLIDE 10

Speaker Diarization is NOT

  • Speaker ID (Speaker ID is

supervized and needs prior training)

  • Speaker Verification (is

supervized and returns yes/no answer)

4

Monday, May 21, 12

slide-11
SLIDE 11

Speaker Diarization is NOT

  • Speaker ID (Speaker ID is

supervized and needs prior training)

  • Speaker Verification (is

supervized and returns yes/no answer)

  • Beamforming (as this requires

multiple mics, even though beamforming can be used to support diarization)

4

Monday, May 21, 12

slide-12
SLIDE 12

Why Diarization?

5

Monday, May 21, 12

slide-13
SLIDE 13

Why Diarization?

5

  • Important basic technology for

various semantic audio analysis tasks

Monday, May 21, 12

slide-14
SLIDE 14

Why Diarization?

5

  • Important basic technology for

various semantic audio analysis tasks

  • Meeting retrieval, video conferencing,

speaker-adaptive ASR, video retrieval, etc...

Monday, May 21, 12

slide-15
SLIDE 15

Why Diarization?

5

  • Important basic technology for

various semantic audio analysis tasks

  • Meeting retrieval, video conferencing,

speaker-adaptive ASR, video retrieval, etc...

  • Let’s take a look at some examples

Monday, May 21, 12

slide-16
SLIDE 16

6

Application: Meeting Browsing

Monday, May 21, 12

slide-17
SLIDE 17

Application: Semantic Navigation

  • G. Friedland, L. Gottlieb, A. Janin: “Joke-o-mat: Browsing Sitcoms Punchline by

Punchline”, Proceedings of ACM Multimedia, Beijing, China, October 2009.

Monday, May 21, 12

slide-18
SLIDE 18

Application: Video Duplicate Detection

8

Monday, May 21, 12

slide-19
SLIDE 19

Other Applications

9

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-20
SLIDE 20

Other Applications

9

  • Beamforming

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-21
SLIDE 21

Other Applications

9

  • Beamforming
  • Visual Localization

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-22
SLIDE 22

Other Applications

9

  • Beamforming
  • Visual Localization
  • Video Analysis: Object Detection,

Event Detection, Scene Detection

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-23
SLIDE 23

Other Applications

9

  • Beamforming
  • Visual Localization
  • Video Analysis: Object Detection,

Event Detection, Scene Detection

  • behavior-level analysis tasks, such as

dominance detection

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-24
SLIDE 24

Other Applications

9

  • Beamforming
  • Visual Localization
  • Video Analysis: Object Detection,

Event Detection, Scene Detection

  • behavior-level analysis tasks, such as

dominance detection

  • Robotics Applications (e.g. addressing

people)

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-25
SLIDE 25

Other Applications

9

  • Beamforming
  • Visual Localization
  • Video Analysis: Object Detection,

Event Detection, Scene Detection

  • behavior-level analysis tasks, such as

dominance detection

  • Robotics Applications (e.g. addressing

people)

  • Support for adaptive speech

recognition

(Speaker) Diarization is often used as underlying support for...

Monday, May 21, 12

slide-26
SLIDE 26

Main Drive: NIST RT Eval

10

Monday, May 21, 12

slide-27
SLIDE 27

Main Drive: NIST RT Eval

10

  • Speaker Diarization was evaluated as

part of the NIST Rich Transcription Evaluation (since about 2002)

Monday, May 21, 12

slide-28
SLIDE 28

Main Drive: NIST RT Eval

10

  • Speaker Diarization was evaluated as

part of the NIST Rich Transcription Evaluation (since about 2002)

  • Idea: Create “Rich Transcripts” of

broadcast news, later meetings.

Monday, May 21, 12

slide-29
SLIDE 29

Main Drive: NIST RT Eval

10

  • Speaker Diarization was evaluated as

part of the NIST Rich Transcription Evaluation (since about 2002)

  • Idea: Create “Rich Transcripts” of

broadcast news, later meetings.

  • Evaluated on Real-World data

Monday, May 21, 12

slide-30
SLIDE 30

Speech Recognition Relevant Web Scraping Audio Signal

"who spoke when"

Speaker Diarization Speaker Attribution

"what's relevant to this" "who said what"

Summarization

"what was said"

Indexing, Search, Retrieval Question Answering

... ... higher-level analysis ... "what are the main points" ...

11

Typical Component Composition for RT

Monday, May 21, 12

slide-31
SLIDE 31

Speech Recognition Relevant Web Scraping Audio Signal

"who spoke when"

Speaker Diarization Speaker Attribution

"what's relevant to this" "who said what"

Summarization

"what was said"

Indexing, Search, Retrieval Question Answering

... ... higher-level analysis ... "what are the main points" ...

11

Typical Component Composition for RT

Monday, May 21, 12

slide-32
SLIDE 32

Speaker Diarization: General Overview

12

Feature Extraction Speech/Non- Speech Detector Diarization Engine Audio Signal Metadata Speech Only MFCC Segmentation Clustering

Monday, May 21, 12

slide-33
SLIDE 33

Output Format of Diarization

13

Monday, May 21, 12

slide-34
SLIDE 34

Output Format of Diarization

13

  • RTTM files (as defined by NIST)

Monday, May 21, 12

slide-35
SLIDE 35

Output Format of Diarization

13

  • RTTM files (as defined by NIST)
  • Example:

Monday, May 21, 12

slide-36
SLIDE 36

Output Format of Diarization

13

  • RTTM files (as defined by NIST)
  • Example:

SPEAKER soupnazi 1 40.0 2.5 <NA> <NA> George <NA>

Monday, May 21, 12

slide-37
SLIDE 37

Output Format of Diarization

13

  • RTTM files (as defined by NIST)
  • Example:

SPEAKER soupnazi 1 40.0 2.5 <NA> <NA> George <NA> SPEAKER soupnazi 1 42.5 2.5 <NA> <NA> Jerry <NA>

Monday, May 21, 12

slide-38
SLIDE 38

Output Format of Diarization

13

  • RTTM files (as defined by NIST)
  • Example:

SPEAKER soupnazi 1 40.0 2.5 <NA> <NA> George <NA> SPEAKER soupnazi 1 42.5 2.5 <NA> <NA> Jerry <NA> SPEAKER soupnazi 1 45.0 2.5 <NA> <NA> female <NA>

Monday, May 21, 12

slide-39
SLIDE 39

Output Format of Diarization

13

  • RTTM files (as defined by NIST)
  • Example:

SPEAKER soupnazi 1 40.0 2.5 <NA> <NA> George <NA> SPEAKER soupnazi 1 42.5 2.5 <NA> <NA> Jerry <NA> SPEAKER soupnazi 1 45.0 2.5 <NA> <NA> female <NA>

Monday, May 21, 12

slide-40
SLIDE 40

Output Format of Diarization

13

  • RTTM files (as defined by NIST)
  • Example:

SPEAKER soupnazi 1 40.0 2.5 <NA> <NA> George <NA> SPEAKER soupnazi 1 42.5 2.5 <NA> <NA> Jerry <NA> SPEAKER soupnazi 1 45.0 2.5 <NA> <NA> female <NA>

  • Large amount of tools available to

deal with these files.

Monday, May 21, 12

slide-41
SLIDE 41

Error Measurement

14

Monday, May 21, 12

slide-42
SLIDE 42

Error Measurement

14

  • US NIST defines error metrics and is

evaluating speaker diarization on a regular basis

Monday, May 21, 12

slide-43
SLIDE 43

Error Measurement

14

  • US NIST defines error metrics and is

evaluating speaker diarization on a regular basis

  • Error metrics is called ‘Diarization

Error Rate’ (DER)

Monday, May 21, 12

slide-44
SLIDE 44

Error Measurement

14

  • US NIST defines error metrics and is

evaluating speaker diarization on a regular basis

  • Error metrics is called ‘Diarization

Error Rate’ (DER)

  • All tools available open source

Monday, May 21, 12

slide-45
SLIDE 45

Error Measurement

15

DER = The amounts of time a speaker has been assigned wrongly, missed, assumed when there is none, or assumed solely when there is more than one relative to the length of the audio.

Monday, May 21, 12

slide-46
SLIDE 46

Segmentation & Clustering

16

Monday, May 21, 12

slide-47
SLIDE 47

Segmentation & Clustering

16

  • Originally: Segment first, cluster later

Chen, S. S. and Gopalakrishnan, P., “Clustering via the bayesian information criterion with applications in speech recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2001, Vol. 2, Seattle, USA, pp. 645-648.

Monday, May 21, 12

slide-48
SLIDE 48

Segmentation & Clustering

16

  • Originally: Segment first, cluster later

Chen, S. S. and Gopalakrishnan, P., “Clustering via the bayesian information criterion with applications in speech recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2001, Vol. 2, Seattle, USA, pp. 645-648.

  • More effjcient: Top-Down and

Bottom-Up Approaches

Monday, May 21, 12

slide-49
SLIDE 49

Segmentation: Secret Sauce

17

Monday, May 21, 12

slide-50
SLIDE 50

Segmentation: Secret Sauce

17

  • How do you distinguish speakers?

Monday, May 21, 12

slide-51
SLIDE 51

Segmentation: Secret Sauce

17

  • How do you distinguish speakers?
  • Combination of MFCC+GMM+BIC

seems unbeatable!

Monday, May 21, 12

slide-52
SLIDE 52

Segmentation: Secret Sauce

17

  • How do you distinguish speakers?
  • Combination of MFCC+GMM+BIC

seems unbeatable!

  • Can be generalized to Audio Percepts

Monday, May 21, 12

slide-53
SLIDE 53

MFCC: Idea

18

power cepstrum of signal

Pre-emphasis Windowing FFT Mel-Scale Filterbank Log-Scale DCT Audio Signal MFCC

Monday, May 21, 12

slide-54
SLIDE 54

MFCC: Mel Scale

19

Monday, May 21, 12

slide-55
SLIDE 55

MFCC: Result

20

Monday, May 21, 12

slide-56
SLIDE 56

Gaussian Mixtures

21

Monday, May 21, 12

slide-57
SLIDE 57

Training of Mixture Models

22

Goal: Find ai for Expectation: Maximization:

Monday, May 21, 12

slide-58
SLIDE 58

Bayesian Information Criterion

23

BIC =

where X is the sequence of features for a segment,

Θ are the parameters of the statistical model for the segment,

K is the number of parameters for the model, N is the number of frames in the segment,

λ is an optimization parameter.

Monday, May 21, 12

slide-59
SLIDE 59

Bayesian Information Criterion: Explanation

24

Monday, May 21, 12

slide-60
SLIDE 60

Bayesian Information Criterion: Explanation

24

  • BIC penalizes the complexity of the

model (as of number of parameters in model).

Monday, May 21, 12

slide-61
SLIDE 61

Bayesian Information Criterion: Explanation

24

  • BIC penalizes the complexity of the

model (as of number of parameters in model).

  • BIC measures the effjciency of the

parameterized model in terms of predicting the data.

Monday, May 21, 12

slide-62
SLIDE 62

Bayesian Information Criterion: Explanation

24

  • BIC penalizes the complexity of the

model (as of number of parameters in model).

  • BIC measures the effjciency of the

parameterized model in terms of predicting the data.

  • BIC is therfore used to choose the

number of clusters according to the intrinsic complexity present in a particular dataset.

Monday, May 21, 12

slide-63
SLIDE 63

Bayesian Information Criterion: Properties

25

Monday, May 21, 12

slide-64
SLIDE 64

Bayesian Information Criterion: Properties

25

  • BIC is a minimum description length

criterion.

Monday, May 21, 12

slide-65
SLIDE 65

Bayesian Information Criterion: Properties

25

  • BIC is a minimum description length

criterion.

  • BIC is independent of the prior.

Monday, May 21, 12

slide-66
SLIDE 66

Bayesian Information Criterion: Properties

25

  • BIC is a minimum description length

criterion.

  • BIC is independent of the prior.
  • It is closely related to other penalized

likelihood criteria such as RIC and the Akaike information criterion.

Monday, May 21, 12

slide-67
SLIDE 67

Bottom-Up Algorithm

Cluster1 Cluster2 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-68
SLIDE 68

Bottom-Up Algorithm

Initialization

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-69
SLIDE 69

Bottom-Up Algorithm

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-70
SLIDE 70

Bottom-Up Algorithm

(Re-)Training Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-71
SLIDE 71

Bottom-Up Algorithm

(Re-)Training Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-72
SLIDE 72

Bottom-Up Algorithm

(Re-)Alignment (Re-)Training Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-73
SLIDE 73

Bottom-Up Algorithm

(Re-)Alignment (Re-)Training

Cluster1 Cluster2 Cluster2 Cluster3 Cluster1 Cluster2 Cluster2 Cluster3

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-74
SLIDE 74

Bottom-Up Algorithm

(Re-)Alignment Merge two Clusters?

Yes

(Re-)Training

Cluster1 Cluster2 Cluster2 Cluster3 Cluster1 Cluster2 Cluster2 Cluster3

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-75
SLIDE 75

Bottom-Up Algorithm

(Re-)Alignment Merge two Clusters?

Yes

(Re-)Training

Cluster1 Cluster2 Cluster2 Cluster3 Cluster1 Cluster2 Cluster2 Cluster3

Initialization

Cluster1 Cluster2 Cluster2 Cluster2 Cluster1 Cluster2 Cluster2 Cluster2

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-76
SLIDE 76

Bottom-Up Algorithm

(Re-)Alignment Merge two Clusters?

Yes

(Re-)Training

Cluster1 Cluster2 Cluster2 Cluster3 Cluster1 Cluster2 Cluster2 Cluster3

Initialization

Cluster1 Cluster2 Cluster2 Cluster2 Cluster1 Cluster2 Cluster2 Cluster2

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-77
SLIDE 77

Bottom-Up Algorithm

(Re-)Alignment Merge two Clusters?

Yes

(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2 Cluster1 Cluster2 Cluster1 Cluster2

Initialization

Cluster1 Cluster2 Cluster2 Cluster2 Cluster1 Cluster2 Cluster2 Cluster2

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-78
SLIDE 78

Bottom-Up Algorithm

(Re-)Alignment Merge two Clusters?

Yes

(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2 Cluster1 Cluster2 Cluster1 Cluster2

End

No

Initialization

Cluster1 Cluster2 Cluster2 Cluster2 Cluster1 Cluster2 Cluster2 Cluster2

26

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Monday, May 21, 12

slide-79
SLIDE 79

27

ICSI’s Speaker Diarization

Monday, May 21, 12

slide-80
SLIDE 80

27

  • Speaker Diarization research @ ICSI since 2001

ICSI’s Speaker Diarization

Monday, May 21, 12

slide-81
SLIDE 81

27

  • Speaker Diarization research @ ICSI since 2001
  • Various versions of Diarization Engines

developed over the years

ICSI’s Speaker Diarization

Monday, May 21, 12

slide-82
SLIDE 82

27

  • Speaker Diarization research @ ICSI since 2001
  • Various versions of Diarization Engines

developed over the years

  • Status: Research code but stable for some

applications that are error tolerant

ICSI’s Speaker Diarization

Monday, May 21, 12

slide-83
SLIDE 83

28

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-84
SLIDE 84

28

Basic (single mic, easy installation)

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-85
SLIDE 85

28

Basic (single mic, easy installation) Fast (single mic, multiple CPU cores)

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-86
SLIDE 86

28

Basic (single mic, easy installation) Fast (single mic, multiple CPU cores) Super fast (single mic, multiple GPUs)

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-87
SLIDE 87

28

Basic (single mic, easy installation) Fast (single mic, multiple CPU cores) Super fast (single mic, multiple GPUs) Accurate but slow (multi mic, additional

preprocessing)

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-88
SLIDE 88

28

Basic (single mic, easy installation) Fast (single mic, multiple CPU cores) Super fast (single mic, multiple GPUs) Accurate but slow (multi mic, additional

preprocessing)

Audio/Visual (single and multi mic, for

localization)

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-89
SLIDE 89

28

Basic (single mic, easy installation) Fast (single mic, multiple CPU cores) Super fast (single mic, multiple GPUs) Accurate but slow (multi mic, additional

preprocessing)

Audio/Visual (single and multi mic, for

localization)

Online (single mic, “who is speaking now”)

ICSI’s Speaker Diarization Engine Variants

Monday, May 21, 12

slide-90
SLIDE 90

Basic Speaker Diarization: Facts

29

Monday, May 21, 12

slide-91
SLIDE 91

Basic Speaker Diarization: Facts

29

  • Input: 16kHz mono audio

Monday, May 21, 12

slide-92
SLIDE 92

Basic Speaker Diarization: Facts

29

  • Input: 16kHz mono audio
  • Features: MFCC19, no delta or

deltadelta

Monday, May 21, 12

slide-93
SLIDE 93

Basic Speaker Diarization: Facts

29

  • Input: 16kHz mono audio
  • Features: MFCC19, no delta or

deltadelta

  • Speech/Non-Speech Detector

external

Monday, May 21, 12

slide-94
SLIDE 94

Basic Speaker Diarization: Facts

29

  • Input: 16kHz mono audio
  • Features: MFCC19, no delta or

deltadelta

  • Speech/Non-Speech Detector

external

  • Runtime: ~ realtime (1h audio needs

1h processing on a single CPU, excluding speech/non-speech)

Monday, May 21, 12

slide-95
SLIDE 95

Multi-CPU Speaker Diarization: Facts

30

Monday, May 21, 12

slide-96
SLIDE 96

Multi-CPU Speaker Diarization: Facts

30

  • Same as Basic Speaker Diarization

Monday, May 21, 12

slide-97
SLIDE 97

Multi-CPU Speaker Diarization: Facts

30

  • Same as Basic Speaker Diarization
  • Runtime: Dependent on number of

CPUs used. Example: 8 cores runtime = 14.3 x realtime, i.e. 14minutes of audio need 1 minute of processing.

Monday, May 21, 12

slide-98
SLIDE 98

Multi-CPU Speaker Diarization: Facts

30

  • Same as Basic Speaker Diarization
  • Runtime: Dependent on number of

CPUs used. Example: 8 cores runtime = 14.3 x realtime, i.e. 14minutes of audio need 1 minute of processing.

  • Runtime bottleneck usually: Speech/

Non-Speech Detector

Monday, May 21, 12

slide-99
SLIDE 99

GPU Speaker Diarization: Facts

31

Monday, May 21, 12

slide-100
SLIDE 100

GPU Speaker Diarization: Facts

31

  • Same as Basic Speaker Diarization

Monday, May 21, 12

slide-101
SLIDE 101

GPU Speaker Diarization: Facts

31

  • Same as Basic Speaker Diarization
  • Runtime: 250 x realtime, i.e. 1h of

audio is processed in 14.4sec!

Monday, May 21, 12

slide-102
SLIDE 102

GPU Speaker Diarization: Facts

31

  • Same as Basic Speaker Diarization
  • Runtime: 250 x realtime, i.e. 1h of

audio is processed in 14.4sec!

  • Uses current CUDA NVidia Framework

as backend.

Monday, May 21, 12

slide-103
SLIDE 103

GPU Speaker Diarization: Facts

31

  • Same as Basic Speaker Diarization
  • Runtime: 250 x realtime, i.e. 1h of

audio is processed in 14.4sec!

  • Uses current CUDA NVidia Framework

as backend.

  • Frontend: Python!

Monday, May 21, 12

slide-104
SLIDE 104

GPU Speaker Diarization: Facts

31

  • Same as Basic Speaker Diarization
  • Runtime: 250 x realtime, i.e. 1h of

audio is processed in 14.4sec!

  • Uses current CUDA NVidia Framework

as backend.

  • Frontend: Python!
  • Runtime bottleneck usually: Speech/

Non-Speech Detector, Feature Extraction

Monday, May 21, 12

slide-105
SLIDE 105

Demo: 1CPU vs 8CPU vs GPU

32

Monday, May 21, 12

slide-106
SLIDE 106

Most Accurate Speaker Diarization: Overview

Short-Term Feature Extraction Speech/Non- Speech Detector Diarization Engine Audio Signal "who spoke when"

MFCC (only Speech) MFCC

Segmentation Clustering Long-Term Feature Extraction EM Clustering

Prosodics (only Speech) Initial Segments

Prosodics (only speech)

Dynamic Range Compression Beamforming

Delay Features

Audio Audio

Wiener Filtering

33

Monday, May 21, 12

slide-107
SLIDE 107

Audio/Visual Speaker Diarization: Overview

34

Feature Extraction Speech/Non- Speech Detector Audio Signal "who spoke when"

MFCC (only Speech) MFCC

Diarization Engine Segmentation Clustering Feature Extraction Video Signal

Video Activity (only Speech Regions)

Events

Invert Visual Models "where the speaker was"

Monday, May 21, 12

slide-108
SLIDE 108

Video Feature Extraction

35

MPEG-4 Video n-dimensional activity vector Divide Frames into n Regions

  • Avg. Motion

Vectors Detect Skin Blocks

Windowsize: 400ms

Monday, May 21, 12

slide-109
SLIDE 109

36

Audio/Visual Speaker Diarization: Facts

Monday, May 21, 12

slide-110
SLIDE 110

36

  • One engine for audio and

video

Audio/Visual Speaker Diarization: Facts

Monday, May 21, 12

slide-111
SLIDE 111

36

  • One engine for audio and

video

  • Scales with n cameras

Audio/Visual Speaker Diarization: Facts

Monday, May 21, 12

slide-112
SLIDE 112

36

  • One engine for audio and

video

  • Scales with n cameras
  • Robust against visual

changes such as difgerent cloth, occlusions, etc... “A voiceprint does not care about somebody dimming the light”

Audio/Visual Speaker Diarization: Facts

Monday, May 21, 12

slide-113
SLIDE 113

Audio/Visual Diarization: Example Video

37

Monday, May 21, 12

slide-114
SLIDE 114

In a perfect world...

38

Monday, May 21, 12

slide-115
SLIDE 115

In a perfect world...

38

  • There is no overlapped speech

Monday, May 21, 12

slide-116
SLIDE 116

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean

Monday, May 21, 12

slide-117
SLIDE 117

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise

Monday, May 21, 12

slide-118
SLIDE 118

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)

Monday, May 21, 12

slide-119
SLIDE 119

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)

Monday, May 21, 12

slide-120
SLIDE 120

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)
  • Speakers are non-emotional

Monday, May 21, 12

slide-121
SLIDE 121

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)
  • Speakers are non-emotional
  • Recording is at 16kHz.

Monday, May 21, 12

slide-122
SLIDE 122

In a perfect world...

38

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)
  • Speakers are non-emotional
  • Recording is at 16kHz.
  • Recording is 15-60 minute length

Monday, May 21, 12

slide-123
SLIDE 123

Current Results using Difgerent Inputs

39

Error/System Basic System: 1 Audio Stream 8 Audio Streams 1 Audio Stream + 1 Camera 1 Audio Stream + 4 Cameras

Diarization Error Rate

32.09% 27.55% 27.52% 24.00%

Relative Improvement

baseline 14% 14% 25%

Core Speed (x realtime)

1.0 2.2 1.4 1.3

12 Meeting Recordings from AMI corpus

Monday, May 21, 12

slide-124
SLIDE 124

Most Accurate Results

40

Error/System MFCC only (basic system) Full System Full System + One Camera

Diarization Error Rate

32.09% 20.33% 18.98%

Relative Improvement

baseline 36% 41%

Core Speed (x realtime)

1.0 2.5 2.9

12 Meetings from AMI corpus “VACE Meetings”

Monday, May 21, 12

slide-125
SLIDE 125

Top Error Sources

41

Monday, May 21, 12

slide-126
SLIDE 126

Top Error Sources

41

  • Overlapped Speech

Monday, May 21, 12

slide-127
SLIDE 127

Top Error Sources

41

  • Overlapped Speech
  • Short Speech Segments (<2s)

Monday, May 21, 12

slide-128
SLIDE 128

Top Error Sources

41

  • Overlapped Speech
  • Short Speech Segments (<2s)
  • Environmental Noise

Monday, May 21, 12

slide-129
SLIDE 129

Top Error Sources

41

  • Overlapped Speech
  • Short Speech Segments (<2s)
  • Environmental Noise
  • Low SNR

Monday, May 21, 12

slide-130
SLIDE 130

Top Error Sources

41

  • Overlapped Speech
  • Short Speech Segments (<2s)
  • Environmental Noise
  • Low SNR
  • Bad Speech/Non-Speech Detector

performance based on training data mismatch

Monday, May 21, 12

slide-131
SLIDE 131

Top Error Sources

41

  • Overlapped Speech
  • Short Speech Segments (<2s)
  • Environmental Noise
  • Low SNR
  • Bad Speech/Non-Speech Detector

performance based on training data mismatch

  • Parameter mismatch, e.g. too few

initial clusters

Monday, May 21, 12

slide-132
SLIDE 132

Optimal Performance is achieved when...

42

Monday, May 21, 12

slide-133
SLIDE 133

Optimal Performance is achieved when...

42

  • There is no overlapped speech

Monday, May 21, 12

slide-134
SLIDE 134

Optimal Performance is achieved when...

42

  • There is no overlapped speech
  • The signal is clean

Monday, May 21, 12

slide-135
SLIDE 135

Optimal Performance is achieved when...

42

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise

Monday, May 21, 12

slide-136
SLIDE 136

Optimal Performance is achieved when...

42

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)

Monday, May 21, 12

slide-137
SLIDE 137

Optimal Performance is achieved when...

42

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)

Monday, May 21, 12

slide-138
SLIDE 138

Optimal Performance is achieved when...

42

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)
  • Speakers are non-emotional

Monday, May 21, 12

slide-139
SLIDE 139

Optimal Performance is achieved when...

42

  • There is no overlapped speech
  • The signal is clean
  • No environmental noise
  • Limited amount of speakers (4 or so)
  • Speaker are well-distinguishable in

their voice (e.g. male - female, young

  • old)
  • Speakers are non-emotional
  • Recording is at 16kHz or higher.

Monday, May 21, 12

slide-140
SLIDE 140

Future Work!

43

Monday, May 21, 12

slide-141
SLIDE 141

Thank You!

44

Questions?

Some of the Presented Work performed together with: Mary Knox, Katya Gonina, Adam Janin and others.

Monday, May 21, 12