workshop programme
play

Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC - PowerPoint PPT Presentation

Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: X-vectors: Neural Speech Embeddings for Speaking Recognition - Daniel Garcia-Romero 19:25


  1. Workshop Programme Introduction: “ VoxCeleb, VoxConverse & VoxSRC ” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition ” - Daniel Garcia-Romero 19:25 Speaker verification: Leaderboards and winners for Tracks 1-3 20:00 Participant talks from Tracks 1, 2 and 3, live Q & A 20:05 Coffee break 20:50 Keynote: “ Tackling Multispeaker Conversation Processing based on Speaker Diarization and 21:10 Multispeaker Speech Recognition ” - Shinji Watanabe 21:40 Diarization : Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A 21:42 Wrap-up discussions and closing 22:00

  2. Organisers Andrew Zisserman Joon Son Chung Arsha Nagrani Jaesung Huh Andrew Brown Mitch Mclaren Doug Reynolds Ernesto Coto Weidi Xie

  3. Workshop Programme Introduction: “ VoxCeleb, VoxConverse & VoxSRC ” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition ” - Daniel Garcia-Romero 19:25 Speaker verification: Leaderboards and winners for Tracks 1-3 20:00 Participant talks from Tracks 1, 2 and 3, live Q & A 20:05 Coffee break 20:50 Keynote: “ Tackling Multispeaker Conversation Processing based on Speaker Diarization and 21:10 Multispeaker Speech Recognition ” - Shinji Watanabe 21:40 Diarization : Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A 21:42 Wrap-up discussions and closing 22:00

  4. Introduction 1. Data : VoxCeleb and VoxConverse 2. Challenge Mechanics: New tracks, rules and metrics

  5. VoxCeleb datasets • Multi-speaker environments • Varying audio quality and background channel noise Red Carpet Interviews • Freely available Studio Interviews Outdoor and pitch Interviews

  6. VoxCeleb - automatic pipeline Transferring labels from Vision to Speech Felicity Jones Face + landmarks Input Video detection Face verification match match Active Speaker VOXCELEB Detection

  7. VoxCeleb Statistics VoxCeleb2 dev set -> primary data for speaker verification ● Validation toolkit for scoring ● Train Validation # Speakers 5,994 1,251 # Utterances 1,092,009 153,516

  8. A more challenging test set - VoxMovies • Hard samples found from VoxCeleb identities speaking in movies • Playing characters, showing strong emotions, background noise VoxCeleb VoxMovies accent change background music emotion

  9. A more challenging test set - VoxMovies • Hard samples found from VoxCeleb identities speaking in movies • Playing characters, showing strong emotions, background noise VoxCeleb VoxMovies accent change background music emotion

  10. A more challenging test set - VoxMovies • Audio dataset, but we use visual methods to collect it (VoxCeleb automatic pipeline) VoxCeleb VoxMovies Steve Martin

  11. Audio speaker diarization • Solving “who spoke when” in multi-speaker video. Steve Martin

  12. Diarization - The VoxConverse dataset ● 526 videos from YouTube ● Mostly debates, talk shows, news segments http://www.robots.ox.ac.uk/~vgg/data/voxconverse/

  13. Automatic audio-visual diarization method Active speaker Audio-visual source Face detection & detection separation Input video face track clustering VoxConverse X O Speaker verification Chung, Joon Son, et al. "Spot the conversation: speaker diarisation in the wild." INTERSPEECH (2020).

  14. The VoxCeleb Speaker Recognition Challenge

  15. VoxSRC-2020 tracks • Track 1 : Supervised speaker verification (closed) • Track 2 : Supervised speaker verification (open) • Track 3 : Self-supervised speaker verification (closed) • Track 4 : Speaker diarization (open) TWO NEW tracks this year!

  16. New Tracks Track 3: Self-Supervised No speaker labels allowed • Can use future frames, visual frames, or any other objective • from the video itself Track 4: Speaker Diarization Solving “who spoke when” in multi-speaker video. • Speaker overlap, challenging background conditions •

  17. Mechanics Metrics (Tracks 1-3) • DCF , EER • Following NIST-SRE 2018 • Metrics (Track 4 ) • DER , JER • Overlapping speech counted, collar of 0.25s • Only 1 submission per day, 5 in total • Submissions via CodaLab •

  18. Test Sets • New, more difficult test sets • Manual verification of all speech segments • In addition, annotators pay particular attention to examples whose speaker embeddings are far from cluster centres

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend