vision & language CS 685, Fall 2020 Introduction to Natural - - PowerPoint PPT Presentation

vision language
SMART_READER_LITE
LIVE PREVIEW

vision & language CS 685, Fall 2020 Introduction to Natural - - PowerPoint PPT Presentation

vision & language CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides adapted from


slide-1
SLIDE 1

vision & language

CS 685, Fall 2020

Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/

Mohit Iyyer

College of Information and Computer Sciences University of Massachusetts Amherst

some slides adapted from Vicente Ordonez, Fei-Fei Li, and Jacob Andreas

slide-2
SLIDE 2

Next week

  • Tues (11/3): exam review, will go over some important

topics, quiz questions, prev. exam questions

  • Thu (11/5): no class, work on your exams
  • We’ll release an overleaf link
  • You’re highly encouraged to type out your answers (in

LaTeX or with some word processing software); we will also accept hand-written answers if necessary

  • Exam will be released at 8AM Thursday, due 8AM

Saturday (US Eastern time) on Gradescope

slide-3
SLIDE 3

3

a red truck is parked on a street lined with trees

image captioning

slide-4
SLIDE 4

4

visual question answering

  • Is this truck considered

“vintage”?

  • Does the road look new?
  • What kind of tree is

behind the truck?

slide-5
SLIDE 5

we’ve seen how to compute representations of words and

  • sentences. what about images?
slide-6
SLIDE 6

CS6501: Vision and Language

What we see What a computer sees

grayscale images are matrices

what range of values can each pixel take?

slide-7
SLIDE 7

CS6501: Vision and Language

color images are tensors

𝑑h𝑏𝑜𝑜𝑓𝑚 𝑦 h𝑓𝑗𝑕h𝑢 𝑦 𝑥𝑗𝑒𝑢h

Channels are usually RGB: Red, Green, and Blue Other color spaces: HSV, HSL, LUV, XYZ, Lab, CMYK, etc

slide-8
SLIDE 8

CS6501: Vision and Language

Convolution operator

Image Credit: http://what-when-how.com/introduction-to-video-and-image-processing/neighborhood-processing-introduction-to-video- and-image-processing-part-1/

𝑙(𝑦, 𝑧) 𝑕(𝑦, 𝑧) = ∑

𝑤 ∑ 𝑣

𝑙(𝑣, 𝑤)𝑔(𝑦 − 𝑣, 𝑧 − 𝑤)

slide-9
SLIDE 9

CS6501: Vision and Language

(filter, kernel)

?

slide-10
SLIDE 10

demo: http://setosa.io/ev/image-kernels/

slide-11
SLIDE 11

CS6501: Vision and Language

Convolutional Layer (with 4 filters)

Input: 1x224x224 Output: 4x224x224 if zero padding, and stride = 1 weights: 4x1x9x9

slide-12
SLIDE 12

CS6501: Vision and Language

Convolutional Layer (with 4 filters)

Input: 1x224x224 Output: 4x112x112 if zero padding, but stride = 2 weights: 4x1x9x9

slide-13
SLIDE 13

pooling layers also used to reduce dimensionality

13

Convolutional Layers: slide a set of small filters over the image Pooling Layers: reduce dimensionality

  • f representation

image: https://cs231n.github.io/convolutional-networks/

why reduce dimensionality?

slide-14
SLIDE 14

CS6501: Vision and Language

Alexnet

the paper that started the deep learning revolution!

slide-15
SLIDE 15

CS6501: Vision and Language

image classification

Classify an image into 1000 possible classes: e.g. Abyssinian cat, Bulldog, French Terrier, Cormorant, Chickadee, red fox, banjo, barbell, hourglass, knot, maze, viaduct, etc. cat, tabby cat (0.71) Egyptian cat (0.22) red fox (0.11) …..

train on the ImageNet challenge dataset, ~1.2 million images

slide-16
SLIDE 16

CS6501: Vision and Language

Alexnet

https://www.saagie.com/fr/blog/object- detection-part1

slide-17
SLIDE 17

CS6501: Vision and Language

Alexnet

https://www.saagie.com/fr/blog/object- detection-part1

conv+pool conv+pool conv conv conv linear linear linear+ softmax

slide-18
SLIDE 18

CS6501: Vision and Language

What is happening?

https://www.saagie.com/fr/blog/object- detection-part1

slide-19
SLIDE 19

CS6501: Vision and Language

Slide by Mohammad Rastegari

slide-20
SLIDE 20
slide-21
SLIDE 21

(

)

CNN =

softmax: predict ‘truck’

at the end of the day, we generate a fixed size vector from an image and run a classifier over it

slide-22
SLIDE 22

(

)

CNN =

key insight: this vector is useful for many more tasks than just image classification! we can use it for transfer learning

slide-23
SLIDE 23

simple visual QA

  • i = CNN(image) > use an existing

network trained for image classification and freeze weights

  • q = RNN(question) > learn weights
  • answer = softmax(linear([i;q]))

why isn’t this a good way of doing visual QA?

slide-24
SLIDE 24

How many benches are shown?

slide-25
SLIDE 25

visual attention

  • Use the question representation q to determine

where in the image to look How many benches are shown?

slide-26
SLIDE 26

How many benches are shown?

0.2 0.2 0.3 0.1 0.05 0.05 0.05 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

softmax: predict answer

attention over final convolutional layer in network: 196 boxes, captures color and positional information

slide-27
SLIDE 27

How many benches are shown?

0.2 0.2 0.3 0.1 0.05 0.05 0.05 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

softmax: predict answer

attention over final convolutional layer in network: 196 boxes, captures color and positional information

how can we compute these attention scores?

slide-28
SLIDE 28

How many benches are shown?

0.0 0.0 1.0 0.0 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

softmax: predict answer

attention over final convolutional layer in network: 196 boxes, captures color and positional information

we can use reinforcement learning to focus on just

  • ne box

hard attention

slide-29
SLIDE 29

Grounded question answering

yes Is there a red shape above a circle?

Slide credit: Jacob Andreas

slide-30
SLIDE 30

Neural nets learn lexical groundings

yes

[Iyyer et al. 2014, Bordes et al. 2014, Yang et al. 2015, Malinowski et al., 2015]

Is there a red shape above a circle?

Slide credit: Jacob Andreas

slide-31
SLIDE 31

Semantic parsers learn composition

yes

[Wong & Mooney 2007, Kwiatkowski et al. 2010, Liang et al. 2011, A et al. 2013]

Is there a red shape above a circle?

Slide credit: Jacob Andreas

slide-32
SLIDE 32

Neural module networks learn both!

yes Is there a red shape above a circle?

Slide credit: Jacob Andreas

slide-33
SLIDE 33

Neural module networks

Is there a red shape above a circle?

red exists true ↦ ↦ above ↦

Slide credit: Jacob Andreas

slide-34
SLIDE 34

Neural module networks

Is there a red shape above a circle?

red exists true ↦ ↦ above ↦ circle red above exists and

Slide credit: Jacob Andreas

slide-35
SLIDE 35

Neural module networks

yes Is there a red shape above a circle?

red exists true ↦ ↦ above ↦ circle red above exists and

Slide credit: Jacob Andreas

slide-36
SLIDE 36

Is there a red shape above a circle?

exists and red above circle

Sentence meanings are computations

Slide credit: Jacob Andreas

slide-37
SLIDE 37

NLVR2: natural language for visual reasoning! (Suhr et al., 2018)

TRUE OR FALSE: the left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.

slide-38
SLIDE 38

image captioning

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
  • this is our

ImageNet CNN, now used as a feature extractor

slide-42
SLIDE 42
  • this is our

ImageNet CNN, now used as a feature extractor

slide-43
SLIDE 43
slide-44
SLIDE 44
  • let’s use the

image features to create a conditional LM

slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55

VilBERT (vision and language BERT)

Lu et al., 2019 (“VilBERT”)

slide-56
SLIDE 56

Suhr et al., 2019 (“CEREALBAR”)

slide-57
SLIDE 57

Suhr et al., 2019 (“CEREALBAR”)