LipNet End-to-End Sentence-level Lipreading Yannis Assael, Brendan - - PowerPoint PPT Presentation

lipnet
SMART_READER_LITE
LIVE PREVIEW

LipNet End-to-End Sentence-level Lipreading Yannis Assael, Brendan - - PowerPoint PPT Presentation

LipNet End-to-End Sentence-level Lipreading Yannis Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas NVIDIA GTC San Jose 2017 Outline 1. Introduction 2. Background 3. LipNet 4. Analysis 1. Introduction How easy do you think


slide-1
SLIDE 1

LipNet

End-to-End Sentence-level Lipreading

Yannis Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas

NVIDIA GTC San Jose 2017

slide-2
SLIDE 2

Outline

  • 1. Introduction
  • 2. Background
  • 3. LipNet
  • 4. Analysis
slide-3
SLIDE 3

/21

  • McGurk effect (McGurk & MacDonald, 1976)
  • Phonemes and Visemes (Fisher, 1968)
  • Human lipreading performance is poor

We can improve it…

How easy do you think lipreading is?

  • 1. Introduction

3

slide-4
SLIDE 4

/21

https://goo.gl/hyFBVQ

4

  • 1. Introduction
slide-5
SLIDE 5

/21

Among others:

  • Improved hearing aids
  • Speech recognition in noisy environments (e.g. cars)
  • Silent dictation in public spaces
  • Security
  • Biometric identification
  • Silent-movie processing

5

Why is lipreading important?

  • 1. Introduction
slide-6
SLIDE 6

/21

https://goo.gl/RTXh9Q

6

  • 1. Introduction
slide-7
SLIDE 7

/21

  • Most existing work does not employ deep learning
  • Heavy preprocessing
  • Open problems:
  • generalisation across speakers
  • extraction of motion features

7

Automated lipreading

  • 1. Introduction
slide-8
SLIDE 8

/21

  • 1. Hierarchical, expressive, differentiable function
  • 1. Adjust parameters to maximise probability of data with gradient

descent

8

End-to-end supervised learning using NNs

  • 2. Background

input predictive distribution Layer 1 Layer 2 Layer L

slide-9
SLIDE 9

/21

  • Model: Deep stacks of local operations.
  • Good for: relationships over space (2D):
  • Also good for time (1D)
  • Or in our case, space & time (3D): every layer can model either
  • r both. Lets the optimisation decide what's best.

9

Convolutional Neural Networks

  • 2. Background

deeplearning.net

slide-10
SLIDE 10

/21

  • Model: carry information over time using a state
  • Good for: sequences
  • Often used to predict classes at each timestep
  • But what if inputs/outputs are unequal length, or aren't aligned?

10

Recurrent Neural Networks

  • 2. Background
slide-11
SLIDE 11

/21

  • If inputs/outputs aren't aligned, CTC (Graves 2006) efficiently

marginalises over all alignments

  • To do this, let the RNN output blanks or duplicates:
  • Sum over every way to output the same sequence:

p(am) = p(aam) + p(amm) + p(_am) + p(a_m) + p(am_)

11

Recurrent Neural Networks

  • 2. Background
slide-12
SLIDE 12

/21

  • Monosyllabic vs Compound words (Easton & Basala, 1982)
  • Spatiotemporal features
  • End-to-end, sentence-level
  • GRID corpus 33000 sentences

12

LipNet

  • 3. LipNet
slide-13
SLIDE 13

/21 13

GRID corpus

  • 3. LipNet
slide-14
SLIDE 14

/21 14

  • Facial Landmarks
  • Crop the mouth
  • Affine transform the frames
  • Smoothen using Kalman filter
  • Temporal augmentation

Preprocessing

  • 3. LipNet
slide-15
SLIDE 15

/21 15

Model Architecture

  • 3. LipNet
slide-16
SLIDE 16

/21

  • Hearing-Impaired People

3 students from the Oxford Students’ Disability Community

  • Baseline-LSTM

Replicate previous state-of-the-art architecture by (Wand et al., 2016)

  • Baseline-2D

Spatial-only convolutions

  • Baseline-NoLM

Language model disabled

16

Baselines

  • 3. LipNet
slide-17
SLIDE 17

/21 17

Lipreading Performance

Unseen Speakers Overlapped Speakers CER WER CER WER

Hearing Impaired

47.7%

Baseline- LSTM

38.4% 52.8% 15.2% 26.3%

Baseline- 2D

16.2% 26.7% 4.3% 11.6%

Baseline- NoLM

6.7% 13.6% 2.0% 5.6%

LipNet

6.4% 11.4% 1.9% 4.8%

  • 3. LipNet
slide-18
SLIDE 18

/21 18

Learned Representations

  • 4. Analysis
slide-19
SLIDE 19

/21 19

Viseme Confusions

  • 4. Analysis
slide-20
SLIDE 20

Thank you!

slide-21
SLIDE 21

Thank you NVIDIA!

DGX-1