Attention-based Model and It Its Application in in Scene Text xt Recognition
Baoguang Shi (石葆光) Huazhong University of Science and Technology March 23, 2016
Attention-based Model and It Its Application in in Scene Text xt - - PowerPoint PPT Presentation
Attention-based Model and It Its Application in in Scene Text xt Recognition Baoguang Shi ( ) Huazhong University of Science and Technology March 23, 2016 Part 1: Introduction to Attention- Based Models 3/23/2016 VALSE Panel 2
Baoguang Shi (石葆光) Huazhong University of Science and Technology March 23, 2016
3/23/2016 VALSE Panel 2
deep neural networks.
another)
3/23/2016 VALSE Panel 3
3/23/2016 VALSE Panel 4
[1] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. [2] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition. CoRR abs/1506.07503 (2015)
3/23/2016 VALSE Panel 5
Encoder Input Representation Output sequence Decoder RNN, CNN, etc. RNN
3/23/2016 VALSE Panel 6
[1] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, Yoshua Bengio: Attention-Based Models for Speech Recognition. CoRR abs/1506.07503 (2015)
3/23/2016 VALSE Panel 7
ℎ1 ℎ𝑈
Input contents Attention weights
RNN X
RNN
Fixed-dim vector
3/23/2016 VALSE Panel 8
𝑢 = 1
ℎ1 ℎ𝑈
Input contents Attention weights
LSTM X
3/23/2016 VALSE Panel 9
𝑢 = 2
ℎ1 ℎ𝑈
Input contents Attention weights
LSTM X
LSTM
3/23/2016 VALSE Panel 10
𝑢 = 3
ℎ1 ℎ𝑈
Input contents Attention weights
LSTM X
LSTM LSTM <EOS>
3/23/2016 VALSE Panel 11
3/23/2016 VALSE Panel 12
3/23/2016 VALSE Panel 13
3/23/2016 VALSE Panel 14
[1] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
woman park Frisbee
3/23/2016 VALSE Panel 15
3/23/2016 VALSE Panel 16
Tango ATM Hotel BLACK
3/23/2016 VALSE Panel 17
Network
https://github.com/bgshih/crnn
3/23/2016 VALSE Panel 18
representation
3/23/2016 VALSE Panel 19
[‘S’,‘A’,‘L’,‘E’,<EOS>]
to-sequence conversion)
3/23/2016 VALSE Panel 20
3/23/2016 VALSE Panel 21
Attention: select relevant contents
3/23/2016 VALSE Panel 22
3/23/2016 VALSE Panel 23
(SRN).
3/23/2016 VALSE Panel 24
[1] Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu: Spatial Transformer Networks. CoRR abs/1506.02025 (2015)
3/23/2016 VALSE Panel 25
itself
3/23/2016 VALSE Panel 26
SVT-Perspective (perspective text) CUTE80 (curved text)
3/23/2016 VALSE Panel 27
recognition datasets
3/23/2016 VALSE Panel 28
3/23/2016 VALSE Panel 29
3/23/2016 VALSE Panel 30
“door” “billiards” “hertz” “restaurant” “central” “everest”
3/23/2016 VALSE Panel 31
images/speeches/sentences/etc.
recognition
3/23/2016 VALSE Panel 32
Bai, Robust Scene Text Recognition with Automatic Rectification. Accepted to CVPR 2016.
3/23/2016 VALSE Panel 33