Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC - - PowerPoint PPT Presentation

workshop programme
SMART_READER_LITE
LIVE PREVIEW

Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC - - PowerPoint PPT Presentation

Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: X-vectors: Neural Speech Embeddings for Speaking Recognition - Daniel Garcia-Romero 19:25


slide-1
SLIDE 1
slide-2
SLIDE 2

Workshop Programme

Introduction: “VoxCeleb, VoxConverse & VoxSRC” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition” - Daniel Garcia-Romero Speaker verification: Leaderboards and winners for Tracks 1-3 Participant talks from Tracks 1, 2 and 3, live Q & A Coffee break Keynote: “Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition ” - Shinji Watanabe Diarization: Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A Wrap-up discussions and closing

19:00 19:25 20:00 20:05 20:50 21:10 21:40 21:42 22:00

slide-3
SLIDE 3

Organisers

Andrew Zisserman Joon Son Chung Jaesung Huh Andrew Brown Ernesto Coto Weidi Xie Mitch Mclaren Doug Reynolds Arsha Nagrani

slide-4
SLIDE 4

Workshop Programme

Introduction: “VoxCeleb, VoxConverse & VoxSRC” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition” - Daniel Garcia-Romero Speaker verification: Leaderboards and winners for Tracks 1-3 Participant talks from Tracks 1, 2 and 3, live Q & A Coffee break Keynote: “Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition ” - Shinji Watanabe Diarization: Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A Wrap-up discussions and closing

19:00 19:25 20:00 20:05 20:50 21:10 21:40 21:42 22:00

slide-5
SLIDE 5
  • 1. Data: VoxCeleb and VoxConverse
  • 2. Challenge Mechanics: New tracks, rules and

metrics

Introduction

slide-6
SLIDE 6
  • Multi-speaker environments
  • Varying audio quality and

background channel noise

  • Freely available

Studio Interviews Outdoor and pitch Interviews Red Carpet Interviews

VoxCeleb datasets

slide-7
SLIDE 7

Transferring labels from Vision to Speech

Face + landmarks detection Active Speaker Detection Felicity Jones Face verification VOXCELEB match match

VoxCeleb - automatic pipeline

Input Video

slide-8
SLIDE 8

VoxCeleb Statistics

Train Validation # Speakers 5,994 1,251 # Utterances 1,092,009 153,516

  • VoxCeleb2 dev set -> primary data for speaker verification
  • Validation toolkit for scoring
slide-9
SLIDE 9

A more challenging test set - VoxMovies

  • Hard samples found from VoxCeleb identities speaking in movies
  • Playing characters, showing strong emotions, background noise

VoxCeleb VoxMovies

accent change background music emotion

slide-10
SLIDE 10

A more challenging test set - VoxMovies

  • Hard samples found from VoxCeleb identities speaking in movies
  • Playing characters, showing strong emotions, background noise

VoxCeleb VoxMovies

accent change background music emotion

slide-11
SLIDE 11

A more challenging test set - VoxMovies

  • Audio dataset, but we use visual methods to collect it (VoxCeleb

automatic pipeline)

VoxCeleb VoxMovies Steve Martin

slide-12
SLIDE 12

Audio speaker diarization

  • Solving “who spoke when” in multi-speaker video.

Steve Martin

slide-13
SLIDE 13

http://www.robots.ox.ac.uk/~vgg/data/voxconverse/

Diarization - The VoxConverse dataset

  • 526 videos from YouTube
  • Mostly debates, talk

shows, news segments

slide-14
SLIDE 14

Face detection & face track clustering

Automatic audio-visual diarization method

VoxConverse

Active speaker detection Audio-visual source separation Speaker verification

X O

Input video

Chung, Joon Son, et al. "Spot the conversation: speaker diarisation in the wild." INTERSPEECH (2020).

slide-15
SLIDE 15

The VoxCeleb Speaker Recognition Challenge

slide-16
SLIDE 16
  • Track 1 : Supervised speaker verification (closed)
  • Track 2 : Supervised speaker verification (open)
  • Track 3 : Self-supervised speaker verification (closed)
  • Track 4 : Speaker diarization (open)

VoxSRC-2020 tracks

TWO NEW tracks this year!

slide-17
SLIDE 17

New Tracks

Track 3: Self-Supervised

  • No speaker labels allowed
  • Can use future frames, visual frames, or any other objective

from the video itself Track 4: Speaker Diarization

  • Solving “who spoke when” in multi-speaker video.
  • Speaker overlap, challenging background conditions
slide-18
SLIDE 18

Mechanics

  • Metrics (Tracks 1-3)
  • DCF, EER
  • Following NIST-SRE 2018
  • Metrics (Track 4 )
  • DER, JER
  • Overlapping speech counted, collar of 0.25s
  • Only 1 submission per day, 5 in total
  • Submissions via CodaLab
slide-19
SLIDE 19
  • New, more difficult test sets
  • Manual verification of all speech

segments

  • In addition, annotators pay particular

attention to examples whose speaker embeddings are far from cluster centres

Test Sets