CS6501: Deep Learning for Visual Recognition Seq2Seq Model & - - PowerPoint PPT Presentation

cs6501 deep learning for visual recognition
SMART_READER_LITE
LIVE PREVIEW

CS6501: Deep Learning for Visual Recognition Seq2Seq Model & - - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Seq2Seq Model & Text-to-Image Synthesis Presenter: Fuwen Tan Todays Class Mini-batch training of the RNN model Special End-of-Sequence token: <end> Padding


slide-1
SLIDE 1

CS6501: Deep Learning for Visual Recognition

Seq2Seq Model & Text-to-Image Synthesis

Presenter: Fuwen Tan

slide-2
SLIDE 2
  • Mini-batch training of the RNN model
  • Special “End-of-Sequence” token: <end>
  • Padding
  • Sequence-to-sequence model
  • Neural Machine Translation[1]
  • Text-to-Image Synthesis[2]

Today’s Class

[1] Effective Approaches to Attention-based Neural Machine Translation. Thang Luong, Hieu Pham, and Christopher D. Manning. EMNLP 2015 [2] Text2Scene: Generating Compositional Scenes from Textual Descriptions. Fuwen Tan, Song Feng, Vicente Ordonez. CVPR 2019.

slide-3
SLIDE 3

A RNN model will never end

“Hello”, “world”, “!”, “!”, “!”, “!”, “!”, …

slide-4
SLIDE 4

Unless: set the maximum length before hand

Sample 1 “hello” “world” “java” “is” “better” Sample 2 “hello” “hoos” “I” “like” “python” Sample 3 “one” “plus” “eight” “equals” “to”

I want sentences of 5 words

slide-5
SLIDE 5

Or: learn to predict the END.

Ground-truth: “Hello”, “world”, “<end>” Training: learn to generate the ground-truth sequence with “<end>”. Testing: generate the sequence until an “<end>” is predicted.

slide-6
SLIDE 6

Computing loss: what if #ground-truth != #prediction

Ground- truth “hello” “world” “<end>” Prediction 1 “hello” “<end>” “foo” Prediction 2 “hello” “how” “are” “you” “<end>”

loss1 loss2 loss3

slide-7
SLIDE 7

Mini-batch training: padding

Sample 1 “hello” “how” “are” “you” “today” “<end>” Sample 2 “a” “dog” “is” “driving” “<end>” “<pad>” Sample 3 “hello” “world” “<end>” “<pad>” “<pad>” “<pad>”

slide-8
SLIDE 8

Mini-batch training: padding

Sample 1 “hello” “how” “are” “you” “today” “<end>” Sample 2 “a” “dog” “is” “driving” “<end>” “<pad>” Sample 3 “hello” “world” “<end>” “<pad>” “<pad>” “<pad>” Sample 1 1.0 1.0 1.0 1.0 1.0 1.0 Sample 2 1.0 1.0 1.0 1.0 1.0 0.0 Sample 3 1.0 1.0 1.0 0.0 0.0 0.0

slide-9
SLIDE 9

Generating text that makes sense: Language Model

je étudiant suis un <end> <start> je étudiant suis un h0 h1 h2 h3 h4 h5 Unconditional: h0 = 0

slide-10
SLIDE 10

Generating text with a goal: Machine Translation

Conditional: h0 = ℎ4 je étudiant suis un <end> <start> I am a student je étudiant suis un h0 h1 h2 h3 h4 h5

ℎ1 ℎ2 ℎ3 ℎ4

[3] Sequence to Sequence Learning with Neural Networks. Ilya Sutskever, Oriol Vinyals, Quoc V. Le. NeurIPS 2014.

slide-11
SLIDE 11

Seq2Seq model

“suis” !(ℎ$) = '(!)*+,(-

.ℎ$)

Seq2Seq:

slide-12
SLIDE 12

Seq2Seq model with perfect word alignments

“suis” Seq2Seq: Ideally: “suis” !"(ℎ%, “am”) !(ℎ%)

slide-13
SLIDE 13

Seq2Seq model with perfect word alignments

“suis” Seq2Seq: Ideally: “suis” !"(ℎ%, “am”) Or: “suis” !"(ℎ%, ℎ%) !(ℎ%)

slide-14
SLIDE 14

Seq2Seq model with attention

Ideally: “suis” In practice: “suis” !"(ℎ%, '%) Pray that S: ,%,% = 1, ,%,/!1% = 0 is true Or train the model such that S is almost true !"(ℎ%, ℎ%) '% = 3 ,%,/ℎ/

4 /15

slide-15
SLIDE 15

Seq2Seq model with attention

[1] Effective Approaches to Attention-based Neural Machine Translation. Thang Luong, Hieu Pham, and Christopher D. Manning. EMNLP 2015

!",$ = &'( (+,-.&(ℎ", ℎ$)) ∑ &'( (+,-.&(ℎ", ℎ2))

2

+,-.& ℎ", ℎ$ = ℎ"

34 5ℎ$

Key assumption: ℎ" ≈ ℎ" ≈ ℎ7 − “je” ≈ ℎ9 − “je”

slide-16
SLIDE 16

Seq2Seq model with attention

[1] Effective Approaches to Attention-based Neural Machine Translation. Thang Luong, Hieu Pham, and Christopher D. Manning. EMNLP 2015

“suis” !" ℎ$, &$ = ()!*+,-(/

0 tanh

(/

6[ℎ$; &$]))

slide-17
SLIDE 17

Perform much better for long sequences

[1] Effective Approaches to Attention-based Neural Machine Translation. Thang Luong, Hieu Pham, and Christopher D. Manning. EMNLP 2015

slide-18
SLIDE 18

Also very helpful in image captioning

[4] Show, attend and tell: neural image caption generation with visual attention. Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015

slide-19
SLIDE 19

ECCV 2018 accepted 776 papers

38 of them with “attention” in their titles

slide-20
SLIDE 20

Seq2Seq vs Text-to-Image Synthesis

Sentence: composition of words Image: composition of patches

slide-21
SLIDE 21

Can we do this?

Mike holds a hotdog h0 h1 h2

ℎ1 ℎ2 ℎ3 ℎ4

slide-22
SLIDE 22

Challenges

Machine Translation: “I am a student” à “je suis un étudiant” Text-to-Image Synthesis: "A person is holding a surfboard"

slide-23
SLIDE 23

Challenges: in each step

Machine Translation: student à étudiant Text-to-Image Synthesis:

  • bject category: person, surfboard

location: somewhere in the 2D world attributes: size, pose, expression, … "A person is holding a surfboard"

slide-24
SLIDE 24

Challenges: in each step

Text-to-Image Synthesis:

  • bject category: person

location: somewhere in the 2D world attributes: size, pose, expression, … Learning the distributions of categories, locations, attributes from the training samples "A person is holding a surfboard"

slide-25
SLIDE 25

Mike holds a hotdog h0

ℎ1 ℎ2 ℎ3 ℎ4

h1 h2

  • bject

location attributes Clip-art of “Mike”

somewhere

  • n the ground

pose: hold size, …

  • bject

location attributes Clip-art of “hotdog”

In Mike’s hand

size: < Mike

slide-26
SLIDE 26

Task 1: Abstract Scene Generation

“Mike is surprised at the

  • duck. The duck is standing on

the grill. Jenny is running towards Mike and the duck.”

slide-27
SLIDE 27

Task 1: Abstract Scene Generation

Object category 58 clip-art objects Location 28 x 28 grid Attributes 3 sizes, 2 orientations, 7 poses and 5 expressions for “Mike” and “Jenny”.

slide-28
SLIDE 28

Task 2: Scene Layout Generation

“A guy on a motorcycle with some people watching.”

slide-29
SLIDE 29

Task 2: Scene Layout Generation

Object category 80 object categories from COCO: “person”, “car”, “chair”,… Location 28 x 28 grid Attributes 17 sizes, 17 aspect-ratios

slide-30
SLIDE 30

Task 3: Composite Image Generation

“Several elephants walking together in a line near water.”

slide-31
SLIDE 31

Task 3: Composite Image Generation

Object category 95 object & stuff categories from COCO: “person”, “grass”, “sky”,… Location 32 x 32 grid Attributes 17 sizes, 17 aspect-ratios a feature vector for patch retrieval

slide-32
SLIDE 32
slide-33
SLIDE 33

Step-by-step generation of Abstract Scene

slide-34
SLIDE 34

Step-by-step generation of composite image

slide-35
SLIDE 35

Step-by-step generation of composite image

slide-36
SLIDE 36

More examples

slide-37
SLIDE 37

More examples

slide-38
SLIDE 38

More examples

slide-39
SLIDE 39

More examples

slide-40
SLIDE 40

Questions?

40