Dilated Convolutional Network with Iterative Optimization for - - PowerPoint PPT Presentation

dilated convolutional network with iterative optimization
SMART_READER_LITE
LIVE PREVIEW

Dilated Convolutional Network with Iterative Optimization for - - PowerPoint PPT Presentation

Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition Junfu Pu, Wengang Zhou, Houqiang Li CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, EEIS


slide-1
SLIDE 1

Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition

Junfu Pu, Wengang Zhou, Houqiang Li

CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, EEIS Department, University of Science and Technology of China pjh@mail.ustc.edu.cn, zhwg@ustc.edu.cn, lihq@ustc.edu.cn July 2018

slide-2
SLIDE 2

Outline

 Background  Contribution  Proposed Architecture  Iterative Optimization  Experimental Results  Conclusions

2

slide-3
SLIDE 3

Outline

 Background  Contribution  Proposed Architecture  Iterative Optimization  Experimental Results  Conclusions

3

slide-4
SLIDE 4

Background

 What is Sign Language?

◼ Communicating language used primarily by deaf people ◼ Use different medium such as hands, face, etc. for communication purpose

 Why Sign Language?

◼ > 20 million people with hearing damage ◼ Algorithm applied for human-machine interaction ◼ Social impact: AI techniques improve the life quality for people with disabilities

4

slide-5
SLIDE 5

Background

5

Problem in real world Research Topic

Communication Difficulty Sign video

Recognition (translation) System

Results

Translation Text

hearing and language damage

slide-6
SLIDE 6

Background

 Problem Formulation

6

➢ Continuous SLR ➢ Isolated SLR

Ƹ 𝑑 = arg max

𝑗

𝑞(𝑑𝑗|𝑾) 𝑗 = 1,2, … , 𝐿

MOEGLICH HEUTE NACHT FROST GLATT VORSICHT FLUSS MOEGLICH PLUS ACHT

Democracy

𝒕 = 𝑡𝑗 𝑢=1

𝑈

𝑡𝑗 ∈ 𝒲|𝑗 = 1,2, … , 𝐿} ො 𝒕 = arg max

𝒕∈𝒕∗ 𝑞(𝒕|𝑾)

Input

Output

slide-7
SLIDE 7

Outline

 Background  Contribution  Proposed Architecture  Iterative Optimization  Experimental Results  Conclusions

7

slide-8
SLIDE 8

Contribution

 Develop a new framework based on 3D residual network and dilated convolutions for continuous sign language recognition  Propose an iterative optimization strategy with Connectionist Temporal Classification (CTC) for our sign language recognition system  Outperform the state-of-the-art methods on RWTH-PHOENIX-Weather dataset

8

slide-9
SLIDE 9

Outline

 Background  Contribution  Proposed Architecture  Iterative Optimization  Experimental Results  Conclusions

9

slide-10
SLIDE 10

Proposed Architecture

 Overall Framework

10

➢ Visual Feature Extractor: 3D-ResNet ➢ Sequence Learning Model: Dilated Conv. Net with CTC

𝐘 = 𝑦𝑢 𝑢=1

𝑈

𝐆𝑂 = 𝚾𝚰 𝒘𝒖

𝑢=1 𝑂

𝐖𝑂 = 𝑤𝑢 𝑢=1

𝑂

𝑨 = tanh 𝒟𝑒 ℎ𝑢

𝑗−1

⊙ 𝜏(𝒟𝑒(ℎ𝑢

(𝑗−1)))

𝑝𝑢

𝑗 = tanh(𝒟1∗1(𝑨))

ℎ𝑢

𝑗 = ℎ𝑢 𝑗−1 + 𝑝𝑢 𝑗

𝑝𝑢 = ෍

𝑏𝑚𝑚−𝑐𝑚𝑝𝑑𝑙𝑡

𝑗

𝑝𝑢

𝑗

slide-11
SLIDE 11

Proposed Architecture

 3D ResNet

11

𝐘 = 𝑦𝑢 𝑢=1

𝑈

𝐆𝑂 = 𝚾𝚰 𝒘𝒖

𝑢=1 𝑂

𝐖𝑂 = 𝑤𝑢 𝑢=1

𝑂

𝑨 = tanh 𝒟𝑒 ℎ𝑢

𝑗−1

⊙ 𝜏(𝒟𝑒(ℎ𝑢

(𝑗−1)))

𝑝𝑢

𝑗 = tanh(𝒟1∗1(𝑨))

ℎ𝑢

𝑗 = ℎ𝑢 𝑗−1 + 𝑝𝑢 𝑗

𝑝𝑢 = ෍

𝑏𝑚𝑚−𝑐𝑚𝑝𝑑𝑙𝑡

𝑗

𝑝𝑢

𝑗

 Dilated Cell

slide-12
SLIDE 12

Outline

 Background  Contribution  Proposed Architecture  Iterative Optimization  Experimental Results  Conclusions

12

slide-13
SLIDE 13

Iterative Optimization

13

➢ Step 1: Optimize dilated convolutional network with CTC loss, generate pseudo labels. ℒCTC = − ln 𝑞(𝒕|𝐖) ℓ𝑗 = arg max

𝑘

𝑄𝑗∗ ➢ Step 2: Fine-tune 3D-ResNet with category loss using pseudo labels. ➢ Step 3: Extract improved C3D features for sequence learning. Alternately run Step 1 and Step 2 until converge.

slide-14
SLIDE 14

Outline

 Background  Contribution  Proposed Architecture  Iterative Optimization  Experimental Results  Conclusions

14

slide-15
SLIDE 15

Experiments

 Dataset and Evaluation

◼ Continuous SLR Dataset: RWTH-PHOENIX-Weather ◼ Evaluation Metric: Word Error Rate (WER)

 3D-ResNet Setups and Initialization

◼ Image crops: 224x224 ◼ Sliding window: length 8, step 4 (50% overlap) ◼ Pre-trained on an isolated Chinese SLR dataset ◼ Batch size 5, learning rate 0.001, weight decay 5 × 10−5 ◼ Pooling-5b activations for clip representation

 Dilated Convolutional Network Setups

◼ Dilations for each layer: 1, 2, 4, 8, 16 ◼ Size of blocks: 5

15

slide-16
SLIDE 16

Experimental Results

 Iterative Results

16

 Comparison

slide-17
SLIDE 17

Experimental Results

 An example for iterative optimization

17

slide-18
SLIDE 18

Conclusions

 A novel framework with dilated convolutions for continuous sign language recognition.  An iterative optimization strategy to train the proposed architecture by generating pseudo labels.  Performs well both in accuracy and speed.

18

slide-19
SLIDE 19