vision & language CS 685, Fall 2020 Introduction to Natural - PowerPoint PPT Presentation

vision & language CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides adapted from Vicente Ordonez, Fei-Fei Li, and Jacob Andreas

Next week • Tues (11/3): exam review, will go over some important topics, quiz questions, prev. exam questions • Thu (11/5): no class, work on your exams • We’ll release an overleaf link • You’re highly encouraged to type out your answers (in LaTeX or with some word processing software); we will also accept hand-written answers if necessary • Exam will be released at 8AM Thursday, due 8AM Saturday (US Eastern time) on Gradescope

image captioning a red truck is parked on a street lined with trees 3

visual question answering • Is this truck considered “vintage”? • Does the road look new? • What kind of tree is behind the truck? 4

we’ve seen how to compute representations of words and sentences. what about images?

grayscale images are matrices What we see What a computer sees what range of values can each pixel take? CS6501: Vision and Language

color images are tensors 𝑑 h 𝑏𝑜𝑜𝑓𝑚 𝑦 h 𝑓𝑗𝑕 h 𝑢 𝑦 𝑥𝑗𝑒𝑢 h Channels are usually RGB: Red, Green, and Blue Other color spaces: HSV, HSL, LUV, XYZ, Lab, CMYK, etc CS6501: Vision and Language

Convolution operator 𝑙 ( 𝑦 , 𝑧 ) 𝑕 ( 𝑦 , 𝑧 ) = ∑ 𝑤 ∑ 𝑙 ( 𝑣 , 𝑤 ) 𝑔 ( 𝑦 − 𝑣 , 𝑧 − 𝑤 ) 𝑣 CS6501: Vision and Language Image Credit: http://what-when-how.com/introduction-to-video-and-image-processing/neighborhood-processing-introduction-to-video- and-image-processing-part-1/

(filter, kernel) ? CS6501: Vision and Language

demo: http://setosa.io/ev/image-kernels/

Convolutional Layer (with 4 filters) weights: 4x1x9x9 Output: 4x224x224 Input: 1x224x224 if zero padding, and stride = 1 CS6501: Vision and Language

Convolutional Layer (with 4 filters) weights: 4x1x9x9 Input: 1x224x224 Output: 4x112x112 if zero padding, but stride = 2 CS6501: Vision and Language

pooling layers also used to reduce dimensionality Convolutional Layers: slide a set of small filters over the image Pooling Layers: reduce dimensionality of representation why reduce dimensionality? image: https://cs231n.github.io/convolutional-networks/ 13

Alexnet the paper that started the CS6501: Vision and Language deep learning revolution!

image classification Classify an image into 1000 possible classes: e.g. Abyssinian cat, Bulldog, French Terrier, Cormorant, Chickadee, red fox, banjo, barbell, hourglass, knot, maze, viaduct, etc. cat, tabby cat (0.71) Egyptian cat (0.22) red fox (0.11) ….. train on the ImageNet CS6501: Vision and Language challenge dataset, ~1.2 million images

Alexnet CS6501: Vision and Language https://www.saagie.com/fr/blog/object- detection-part1

Alexnet conv+pool linear linear conv+pool conv conv conv linear+ softmax CS6501: Vision and Language https://www.saagie.com/fr/blog/object- detection-part1

What is happening? CS6501: Vision and Language https://www.saagie.com/fr/blog/object- detection-part1

CS6501: Vision and Language Slide by Mohammad Rastegari

��

at the end of the day, we generate a fixed size vector from an image and run a classifier over it ) ( CNN = softmax: predict ‘truck’

key insight: this vector is useful for many more tasks than just image classification! we can use it for transfer learning ) ( CNN =

simple visual QA • i = CNN (image) > use an existing network trained for image classification and freeze weights • q = RNN (question) > learn weights • answer = softmax(linear([i;q])) why isn’t this a good way of doing visual QA?

How many benches are shown?

visual attention • Use the question representation q to determine where in the image to look How many benches are shown?

softmax: predict answer attention over final convolutional layer in network: 196 boxes, captures color and positional information 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.05 0.05 0.0 0.0 0.0 0.2 0.1 0.05 0.0 0.0 0.0 0.3 0.2 0.05 0.0 0.0 0.0 How many benches are shown?

softmax: predict answer attention over final convolutional layer in network: 196 boxes, captures color and positional information 0.0 0.0 0.0 0.0 0.0 0.0 how can we 0.0 0.05 0.05 0.0 0.0 0.0 compute these 0.2 0.1 0.05 0.0 0.0 0.0 attention scores? 0.3 0.2 0.05 0.0 0.0 0.0 How many benches are shown?

hard attention softmax: predict answer attention over final convolutional layer in network: 196 boxes, captures color and positional information 0.0 0.0 0.0 0.0 0.0 0.0 we can use reinforcement 0.0 0.0 0.0 0.0 0.0 0.0 learning to 0.0 0.0 0.0 0.0 0.0 0.0 focus on just one box 1.0 0.0 0.05 0.0 0.0 0.0 How many benches are shown?

Grounded question answering Is there a red shape above yes a circle? Slide credit: Jacob Andreas

Neural nets learn lexical groundings Is there a red shape above yes a circle? [Iyyer et al. 2014, Bordes et al. 2014, Yang et al. 2015, Malinowski et al., 2015] Slide credit: Jacob Andreas

Semantic parsers learn composition Is there a red shape above yes a circle? [Wong & Mooney 2007, Kwiatkowski et al. 2010, Liang et al. 2011, A et al. 2013] Slide credit: Jacob Andreas

Neural module networks learn both! Is there a red shape above yes a circle? Slide credit: Jacob Andreas

Neural module networks Is there a red shape above a circle? ↦ red ↦ true exists ↦ above Slide credit: Jacob Andreas

Neural module networks Is there a red shape above a circle? exists and red above ↦ circle red ↦ true exists ↦ above Slide credit: Jacob Andreas

Neural module networks yes Is there a red shape above a circle? exists and red above ↦ circle red ↦ true exists ↦ above Slide credit: Jacob Andreas

Sentence meanings are computations Is there a red shape above a circle? exists and red above circle Slide credit: Jacob Andreas

NLVR 2 : natural language for visual reasoning! (Suhr et al., 2018) TRUE OR FALSE: the left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.

image captioning

vision & language CS 685, Fall 2020 Introduction to Natural - PowerPoint PPT Presentation

vision & language CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides adapted from

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Deciding the weak definability of B uchi definable tree languages Michael Vanden Boom

Tree Manipula,on Language (TML) Mo,va,on Designing a language

MSO Queries on Trees Enumerating Answers under Updates Using Forest Algebras Matthias Niewerth

Infinite Automata, Logics and Games Angeliki Chalki NTUA March 28, 2017 . . . . . . . .

The Natural Language of Actions Guy Tennenholtz , Shie Mannor ICML 2019 Technion Institute of

Tree-like reticulation networks Andrew R Francis Centre for Research in Mathematics University

3 forms of convexity in graphs & networks joint work with Lovro Subelj Tilen Marc

Structural Evolution of the Internet Topology Hamed Haddadi Hamed.haddadi@cl.cam.ac.uk 9th

vision & language CS 685, Fall 2020 Introduction to Natural - PowerPoint PPT Presentation

vision & language CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides adapted from

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

Vision Services Vision Services &amp; &amp; Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Deciding the weak definability of B uchi definable tree languages Michael Vanden Boom

Tree Manipula,on Language (TML) Mo,va,on Designing a language

MSO Queries on Trees Enumerating Answers under Updates Using Forest Algebras Matthias Niewerth

Infinite Automata, Logics and Games Angeliki Chalki NTUA March 28, 2017 . . . . . . . .

The Natural Language of Actions Guy Tennenholtz , Shie Mannor ICML 2019 Technion Institute of

Tree-like reticulation networks Andrew R Francis Centre for Research in Mathematics University

3 forms of convexity in graphs &amp; networks joint work with Lovro Subelj Tilen Marc

Structural Evolution of the Internet Topology Hamed Haddadi Hamed.haddadi@cl.cam.ac.uk 9th

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

3 forms of convexity in graphs & networks joint work with Lovro Subelj Tilen Marc