Attention-based Model and It Its Application in in Scene Text xt - - PowerPoint PPT Presentation

attention based model and it its application in in scene
SMART_READER_LITE
LIVE PREVIEW

Attention-based Model and It Its Application in in Scene Text xt - - PowerPoint PPT Presentation

Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi ( ) Huazhong University of Science and Technology March 23, 2016 Part 1: Introduction to Attention- Based Models 3/23/2016 VALSE Panel 2


slide-1
SLIDE 1

Attention-based Model and It Its Application in in Scene Text xt Recognition

Baoguang Shi (石葆光) Huazhong University of Science and Technology March 23, 2016

slide-2
SLIDE 2

Part 1: Introduction to Attention- Based Models

3/23/2016 VALSE Panel 2

slide-3
SLIDE 3

Introduction

  • The problem to solve: predicting a sequence given an input, with

deep neural networks.

  • Input can be: image, speech, sentence, etc.
  • Why it matters?
  • Speech recognition: speech signal sequence => transcription sequence
  • Image captioning: image => word sequence
  • Machine translation: word sequence (in one language) => word sequence (in

another)

3/23/2016 VALSE Panel 3

slide-4
SLIDE 4

Main Difficulties

  • Outputs are variable-length sequences
  • Inputs may also have unfixed number of dimensions

3/23/2016 VALSE Panel 4

slide-5
SLIDE 5

Attention-based models [1-2]

  • An encoder-decoder framework
  • At each step,
  • Select relevant contents in the representation (attend)
  • Generate a token

[1] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. [2] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition. CoRR abs/1506.07503 (2015)

3/23/2016 VALSE Panel 5

Encoder Input Representation Output sequence Decoder RNN, CNN, etc. RNN

slide-6
SLIDE 6

First Look

  • ℎ : input sequence
  • 𝑧 : output sequence

3/23/2016 VALSE Panel 6

[1] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition. CoRR abs/1506.07503 (2015)

slide-7
SLIDE 7

Key: Attention Weights

  • At each step,
  • Calculates a vector of attention weights (non-negative, sum to 1)
  • Linearly combines the input vectors into a glimpse vector
  • Convert variable-length inputs into a fixed-dimensional vector

3/23/2016 VALSE Panel 7

ℎ1 ℎ𝑈

Input contents Attention weights

RNN X

  • utput sequence

RNN

Fixed-dim vector

slide-8
SLIDE 8

Different Weights at Every Step

3/23/2016 VALSE Panel 8

𝑢 = 1

ℎ1 ℎ𝑈

Input contents Attention weights

LSTM X

  • utput sequence
  • Attend to different contents
slide-9
SLIDE 9

Different Weights at Every Step

3/23/2016 VALSE Panel 9

  • Attend to different contents

𝑢 = 2

ℎ1 ℎ𝑈

Input contents Attention weights

LSTM X

  • utput sequence

LSTM

slide-10
SLIDE 10

Different Weights at Every Step

  • Attend to different contents

3/23/2016 VALSE Panel 10

𝑢 = 3

ℎ1 ℎ𝑈

Input contents Attention weights

LSTM X

  • utput sequence

LSTM LSTM <EOS>

slide-11
SLIDE 11

Detailed architecture

3/23/2016 VALSE Panel 11

slide-12
SLIDE 12

The Attention Mechanism

  • Allows us to predict a sequence from input contents.
  • Allows the model to be trained ent-to-end.

3/23/2016 VALSE Panel 12

slide-13
SLIDE 13

What do attention weights tell us?

  • Indicate the importance of inputs for each output token
  • Provides the soft-alignment between inputs and outputs

3/23/2016 VALSE Panel 13

slide-14
SLIDE 14

Attention weights (2D)

3/23/2016 VALSE Panel 14

[1] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015

woman park Frisbee

slide-15
SLIDE 15

Part 2: Attention-based Models for Scene Text Recognition

3/23/2016 VALSE Panel 15

slide-16
SLIDE 16

Scene Text Recognition

  • A problem of image to sequence learning

3/23/2016 VALSE Panel 16

Tango ATM Hotel BLACK

slide-17
SLIDE 17

Traditional Approaches

  • Character detection
  • Character recognition
  • Generate word from characters

3/23/2016 VALSE Panel 17

slide-18
SLIDE 18

Our Previous Work: CRNN

  • Convolutional Recurrent Neural

Network

  • Convolutional layers
  • Bidirectional LSTM
  • CTC layer
  • Code & model released at

https://github.com/bgshih/crnn

3/23/2016 VALSE Panel 18

slide-19
SLIDE 19

Our approach

  • Scheme: sequence-to-sequence learning
  • Sequence-based image representation
  • To character sequence
  • Encoder: Convolutional layers + LSTM, extract sequence-based

representation

  • Decoder: Attention-based RNN, generate character sequence

3/23/2016 VALSE Panel 19

[‘S’,‘A’,‘L’,‘E’,<EOS>]

slide-20
SLIDE 20

Encoder

  • Extract a sequence-based representation of image
  • Structure: Convolutional layers + Bidirectional-LSTM
  • Convolutional layers extract feature maps of size 𝐷 × 𝐼 × 𝑋
  • Split feature maps along columns, into 𝑋 vectors with 𝐷𝐼 dimensions (map-

to-sequence conversion)

  • A Bidirectional-LSTM models the context within the sequence

3/23/2016 VALSE Panel 20

slide-21
SLIDE 21

Decoder

  • Attention-based RNN, whose cells are Gated Recurrent Units (GRU)

3/23/2016 VALSE Panel 21

Attention: select relevant contents

slide-22
SLIDE 22

Sequence Recognition Network: The Whole Structure

  • Components
  • Convolutional layer
  • Bidirectional-LSTM
  • Attention-based decoder

3/23/2016 VALSE Panel 22

slide-23
SLIDE 23

However…

  • This scheme does not work well on irregular text

3/23/2016 VALSE Panel 23

slide-24
SLIDE 24

Rectification + Recognition

  • Rectifying images using a Spatial Transformer Network [1] (STN).
  • Recognizing rectified images using the network mentioned above

(SRN).

  • STN and SRN are trained jointly.

3/23/2016 VALSE Panel 24

[1] Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu: Spatial Transformer Networks. CoRR abs/1506.02025 (2015)

slide-25
SLIDE 25

Rectification with STN

  • Given an input image,
  • regress the locations of 20 fiducial points on the input image
  • calculate a TPS transformation
  • transform the input image

3/23/2016 VALSE Panel 25

slide-26
SLIDE 26

End-to-end trainable

  • No need to label the fiducial points manually, but let STN learns by

itself

3/23/2016 VALSE Panel 26

slide-27
SLIDE 27

Performance

  • Significant improvement on datasets that focuses on irregular text

SVT-Perspective (perspective text) CUTE80 (curved text)

3/23/2016 VALSE Panel 27

slide-28
SLIDE 28

Performance

  • State-of-the-art, or highly competitive results on general text

recognition datasets

3/23/2016 VALSE Panel 28

slide-29
SLIDE 29

Some Results

3/23/2016 VALSE Panel 29

slide-30
SLIDE 30

Recognition & Character Localization

  • Row-𝑢 is the vector of attention weight at step 𝑢

3/23/2016 VALSE Panel 30

“door” “billiards” “hertz” “restaurant” “central” “everest”

slide-31
SLIDE 31

Advantages of the Proposed Model

  • Globally trainable learning system
  • Learning representation from data
  • End-to-end trainable
  • Handles images of arbitrary sizes, and text of arbitrary length
  • The encoder accepts images of arbitrary widths
  • For the decoder, both input and output sequences can have arbitrary lengths
  • Robust to irregular text

3/23/2016 VALSE Panel 31

slide-32
SLIDE 32

Takeaways

  • Attention-based models predict sequences given input

images/speeches/sentences/etc.

  • Attention-weights provide soft-alignment between inputs and
  • utputs
  • The rectification + recognition scheme is effective for scene text

recognition

3/23/2016 VALSE Panel 32

slide-33
SLIDE 33

Thanks!

  • Paper: Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, Xiang

Bai, Robust Scene Text Recognition with Automatic Rectification. Accepted to CVPR 2016.

  • Preprint available at http://arxiv.org/abs/1603.03915
  • CRNN code & model: https://github.com/bgshih/crnn

3/23/2016 VALSE Panel 33