Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC - - PowerPoint PPT Presentation
Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC - - PowerPoint PPT Presentation
Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: X-vectors: Neural Speech Embeddings for Speaking Recognition - Daniel Garcia-Romero 19:25
Workshop Programme
Introduction: “VoxCeleb, VoxConverse & VoxSRC” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition” - Daniel Garcia-Romero Speaker verification: Leaderboards and winners for Tracks 1-3 Participant talks from Tracks 1, 2 and 3, live Q & A Coffee break Keynote: “Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition ” - Shinji Watanabe Diarization: Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A Wrap-up discussions and closing
19:00 19:25 20:00 20:05 20:50 21:10 21:40 21:42 22:00
Organisers
Andrew Zisserman Joon Son Chung Jaesung Huh Andrew Brown Ernesto Coto Weidi Xie Mitch Mclaren Doug Reynolds Arsha Nagrani
Workshop Programme
Introduction: “VoxCeleb, VoxConverse & VoxSRC” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition” - Daniel Garcia-Romero Speaker verification: Leaderboards and winners for Tracks 1-3 Participant talks from Tracks 1, 2 and 3, live Q & A Coffee break Keynote: “Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition ” - Shinji Watanabe Diarization: Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A Wrap-up discussions and closing
19:00 19:25 20:00 20:05 20:50 21:10 21:40 21:42 22:00
- 1. Data: VoxCeleb and VoxConverse
- 2. Challenge Mechanics: New tracks, rules and
metrics
Introduction
- Multi-speaker environments
- Varying audio quality and
background channel noise
- Freely available
Studio Interviews Outdoor and pitch Interviews Red Carpet Interviews
VoxCeleb datasets
Transferring labels from Vision to Speech
Face + landmarks detection Active Speaker Detection Felicity Jones Face verification VOXCELEB match match
VoxCeleb - automatic pipeline
Input Video
VoxCeleb Statistics
Train Validation # Speakers 5,994 1,251 # Utterances 1,092,009 153,516
- VoxCeleb2 dev set -> primary data for speaker verification
- Validation toolkit for scoring
A more challenging test set - VoxMovies
- Hard samples found from VoxCeleb identities speaking in movies
- Playing characters, showing strong emotions, background noise
VoxCeleb VoxMovies
accent change background music emotion
A more challenging test set - VoxMovies
- Hard samples found from VoxCeleb identities speaking in movies
- Playing characters, showing strong emotions, background noise
VoxCeleb VoxMovies
accent change background music emotion
A more challenging test set - VoxMovies
- Audio dataset, but we use visual methods to collect it (VoxCeleb
automatic pipeline)
VoxCeleb VoxMovies Steve Martin
Audio speaker diarization
- Solving “who spoke when” in multi-speaker video.
Steve Martin
http://www.robots.ox.ac.uk/~vgg/data/voxconverse/
Diarization - The VoxConverse dataset
- 526 videos from YouTube
- Mostly debates, talk
shows, news segments
Face detection & face track clustering
Automatic audio-visual diarization method
VoxConverse
Active speaker detection Audio-visual source separation Speaker verification
X O
Input video
Chung, Joon Son, et al. "Spot the conversation: speaker diarisation in the wild." INTERSPEECH (2020).
The VoxCeleb Speaker Recognition Challenge
- Track 1 : Supervised speaker verification (closed)
- Track 2 : Supervised speaker verification (open)
- Track 3 : Self-supervised speaker verification (closed)
- Track 4 : Speaker diarization (open)
VoxSRC-2020 tracks
TWO NEW tracks this year!
New Tracks
Track 3: Self-Supervised
- No speaker labels allowed
- Can use future frames, visual frames, or any other objective
from the video itself Track 4: Speaker Diarization
- Solving “who spoke when” in multi-speaker video.
- Speaker overlap, challenging background conditions
Mechanics
- Metrics (Tracks 1-3)
- DCF, EER
- Following NIST-SRE 2018
- Metrics (Track 4 )
- DER, JER
- Overlapping speech counted, collar of 0.25s
- Only 1 submission per day, 5 in total
- Submissions via CodaLab
- New, more difficult test sets
- Manual verification of all speech
segments
- In addition, annotators pay particular