CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut - - PowerPoint PPT Presentation

cmp784
SMART_READER_LITE
LIVE PREVIEW

CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut - - PowerPoint PPT Presentation

photo by unsplash user @tuvaloland CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut Erdem // Hacettepe University // Spring 2020 latent by Tom White Previously on CMP784 Motivation for Variational Autoencoders (VAEs)


slide-1
SLIDE 1

Lecture #12 – Self-Supervised Learning

Aykut Erdem // Hacettepe University // Spring 2020

CMP784

DEEP LEARNING

photo by unsplash user @tuvaloland

slide-2
SLIDE 2

Previously on CMP784

  • Motivation for Variational

Autoencoders (VAEs)

  • Mechanics of VAEs
  • Separatibility of VAEs
  • Training of VAEs
  • Evaluating representations
  • Vector Quantized Variational

Autoencoders (VQ-VAEs)

2

latent by Tom White

slide-3
SLIDE 3

Lecture Overview

  • Predictive / Self-supervised learning
  • Self-supervised learning in NLP
  • Self-supervised learning in vision

Discl sclaimer: Much of the material and slides for this lecture were borrowed from

—Andrej Risteski's CMU 10707 class —Jimmy Ba's UToronto CSC413/2516 class

3

slide-4
SLIDE 4

Unsupervised Learning

  • Learning from data without labels.
  • What can we hope to do:

– Task A: Fit a parametrized structure (e.g. clustering, low-dimensional subspace, manifold) to data to reveal something meaningful about data (Structure learning) – Task B: Learn a (parametrized) distribution close to data generating

  • distribution. (Distribution learning)

– Task C: Learn a (parametrized) distribution that implicitly reveals an “embedding”/“representation” of data for downstream tasks. (Representation/feature learning)

  • Entangled! The “structure” and “distribution” often reveals an

embedding.

4

slide-5
SLIDE 5

Self-Supervised/Predictive Learning

  • Given unlabeled data, design supervised tasks that induce a good

representation for downstream tasks.

  • No good mathematical formalization, but the intuition is to “force”

the predictor used in the task to learn something “semantically meaningful” about the data.

5

slide-6
SLIDE 6

Self-Supervised/Predictive Learning

► Predict any part of the input from any

  • ther part.

► Predict the future from the past. ► Predict the future from the recent past. ► Predict the past from the present. ► Predict the top from the bottom. ► Predict the occluded from the visible ► Pretend there is a part of the input you don’t know and predict that.

6

Slide by Yann LeCun

slide-7
SLIDE 7

7

Y LeCun

How Much Information Does the Machine Need to Predict?

“Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category

  • r a few numbers for each input

Predicting human-supplied data 10 10,000 bits per sample → Unsupervised/Predictive Learning (cake) The machine predicts any part of its input for any observed part. Predicts future frames in videos Millions of bits per sample (Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)

  • LeCun’s original cake

analogy slide, presented at his keynote speech in NIPS 2016.

slide-8
SLIDE 8

8

  • Y. LeCun

How Much Information is the Machine Given during Learning?

“Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10→10,000 bits per sample Self-Supervised Learning (cake génoise) The machine predicts any part of its input for any

  • bserved part.

Predicts future frames in videos Millions of bits per sample

  • Updated version at (ISSCC 2019, where he

replaced “unsupervised learning” with “self-supervised learning”.

slide-9
SLIDE 9

Self-Supervised Learning in NLP

9

slide-10
SLIDE 10

Word Embeddings

  • Semantically meaningful ve

vect ctor represe sentat ations of words

10

Tiger Lion Table

Ex si

Tiger Lion Table

Example: Inner product (possibly scaled, i.e. cosine similarity) correlates with word similarity.

slide-11
SLIDE 11

Word Embeddings

  • Semantically meaningful ve

vect ctor represe sentat ations of words

11

Example: Can use embeddings to do sentiment classification by training a simple (e.g. linear) classifier

Semantically meaningful

Te ece gea fa ad fed

se si

"The service is great, fast and friendly!"

slide-12
SLIDE 12

Word Embeddings

  • Semantically meaningful ve

vect ctor represe sentat ations of words

12

Example: Can train a “simple” network that if fed word embeddings for two languages, can effectively translate.

Semantically meaningful

Englih I aining

  • ide

Geman E regnet draussen

Ex Can ain a imple ne for tra

English: "It’s raining

  • utside".

German: "Es regnet draussen".

slide-13
SLIDE 13

Word Embeddings via Predictive Learning

  • Basic task: predict the next word, given a few previous ones.

In other words, optimize for

13

: predict the next word, given a few previous ones. max

  • log , , … ,

In other words, optimize for

I am running a little ????

Late: 0.9 Early: 0.05 Tired: 0.04 Table: 0.01

max

θ

X

t

log pθ (xt|xt−1, xt−2, . . . , xt−L)

<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>
slide-14
SLIDE 14

Word Embeddings via Predictive Learning

14

  • Basic task: predict the next word, given a few previous ones.

Inspired by classical assumptions in NLP that the underlying distribution is Markov – that is, !" only depends on the previous few words. (Of course, this is violated if you wish to model long texts like paragraphs / books.)

max

θ

X

t

log pθ (xt|xt−1, xt−2, . . . , xt−L)

<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>
  • g pθ (xt|xt−1, xt−2, . . . , xt−L)
<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>

The The mai ain issu ssue: The trivial way of parametrizing is a “lookup table” with #$ entries.

slide-15
SLIDE 15

Word Embeddings via Predictive Learning

[Bengio-Ducharme-Vincent-Janvin ‘2003]: A neural parametrization of the above probabilities. Main ingredients:

  • Embeddings: A word embedding !(") for all words " in

dictionary.

  • Non-linear transforms: Potentially deep network taking as

inputs i, !(#$−1), !(#$−2),...,!(#$−&), and outputting some vector o. Can be recurrent net too.

  • Softmax: Softmax distribution for #$ with parameters

given by o.

15

  • Basic task: predict the next word, given a few previous ones.

max

θ

X

t

log pθ (xt|xt−1, xt−2, . . . , xt−L)

<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>
slide-16
SLIDE 16

Word Embeddings via Predictive Learning

16

  • Related: predict middle word in a sentence, given surrounding ones.

CBOW (Continuous Bag of Words): proposed by Mikolov et al. ‘13

max

θ

X

t

log pθ (xt|xt−L, . . . , xt−1, xt+1, . . . , xt+L)

<latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit><latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit><latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit><latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit>

Parametrization is chosen s.t.

pθ (xt|xt−L, . . . , xt−1, xt+1, . . . , xt+L) ∝ exp ⇣ vxt, Pt+L

i=t−L wti

<latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit><latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit><latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit><latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit>

vectors v vectors w

slide-17
SLIDE 17

Word Embeddings via Predictive Learning

17

  • Related: predict middle word in a sentence, given surrounding ones.

Skip-Gram: also proposed by Mikolov et al. ‘13

Parametrization is chosen s.t. In practice, lots of other tricks are tacked on to deal with the slowest part of training: the softmax distribution (partition function sums over entire vocabulary). Common ones are negative sampling, hierarchical softmax, etc.

vectors v vectors w

max

θ

X

t t+L

X

i=tL,i6=t

log pθ (xi|xt)

<latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit>

pθ (xi|xt) ∝ exp (vxi, wxt)

<latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit><latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit><latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit><latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit>
slide-18
SLIDE 18

Word Embeddings via Predictive Learning

18

  • Related: predict middle word in a sentence, given surrounding ones.

Skip-Gram: also proposed by Mikolov et al. ‘13

vectors v vectors w

max

θ

X

t t+L

X

i=tL,i6=t

log pθ (xi|xt)

<latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit>
slide-19
SLIDE 19

Evaluating Word Embeddings

  • First variant (predict next word, given previous ones) can be used as

a generative model for text. (Also called language model.) The other

  • nes cannot.
  • In former case, a natural measure is the cross-entropy
  • For convenience, we often take exponential of this (called perplexity)
  • If we do not have a generative model, we have to use indirect

means.

19

−Ex1,x2,...,xT log pθ (x≤T ) = Ex1,x2,...,xT X

t

log pθ (xt|x<t)

<latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit><latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit><latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit><latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit>
slide-20
SLIDE 20

Evaluating Word Embeddings

  • Intrinsic tasks: Test performance of word embeddings on tasks

measuring their “semantic” properties. Examples include solving “which is the most similar word” queries, analogy queries (i.e. “man is to woman as king is to ??”

  • Extrinsic tasks: How well can we “finetune” the word embeddings

to solve some (supervised) downstream task. “Finetune” usually means train a (relatively small) feedforward network. Examples of such tasks include:

– Part-of-Speech Tagging (determine whether a word is noun/verb/...), – Named Entity Recognition (recognizing named entities like persons, places) – e.g. label a sentence as Picasso[person] died in France[country], many

  • thers.

20

slide-21
SLIDE 21

Semantic Similarity

  • Obse

servat vation: similar words tend to have larger (renormalized) inner products (also called cosine similarity).

  • Precisely, if we look at the word embeddings for words i,j

Example: the nearest neighbors to “Frog” look like

  • To solve semantic similarity query like “which is the most similar word

to”, output the word with the highest cosine similarity.

21

⌧ wi ||wi||, wj ||wj||

  • = cos (wi, wj)
<latexit sha1_base64="rEe7WczUNfU1xXU3NZFRJTpFY=">ACU3icbVHLSsNAFJ3Ed31VXboZLIKCSCKCbgTRjcsKVoWmlMn0p06mYSZG6Wk+UcRXPgjblzoNC3F14WBw3nM40yYSmHQ894cd2Z2bn5hcamyvLK6tl7d2Lw1SaY5NHgiE30fMgNSKGigQAn3qQYWhxLuwofLkX73CNqIRN3gIVWzLpKRIztFS72g8kRBhIproSgkgznj+1c1EU+XBISzQcFgd0qvSnSn+kBFp0exjoMn4W8MTQcsO9Mnow9o1N+1qzTv0yqF/gT8BNTKZerv6EnQSnsWgkEtmTNP3UmzlTKPgEopKkBlIGX9gXWhaqFgMpWXnR01zIdGiXaLoW0ZL8nchYbM4hD64wZ9sxvbUT+pzUzjE5buVBphqD4+KAokxQTOiqYdoQGjnJgAeNa2LtS3mO2PrTfULEl+L+f/BfcHh36Fl8f184vJnUskm2yQ/aIT07IObkidIgnDyTd/LpEOfV+XBd3ZsdZ1JZov8GHf1C2Aat50=</latexit><latexit sha1_base64="rEe7WczUNfU1xXU3NZFRJTpFY=">ACU3icbVHLSsNAFJ3Ed31VXboZLIKCSCKCbgTRjcsKVoWmlMn0p06mYSZG6Wk+UcRXPgjblzoNC3F14WBw3nM40yYSmHQ894cd2Z2bn5hcamyvLK6tl7d2Lw1SaY5NHgiE30fMgNSKGigQAn3qQYWhxLuwofLkX73CNqIRN3gIVWzLpKRIztFS72g8kRBhIproSgkgznj+1c1EU+XBISzQcFgd0qvSnSn+kBFp0exjoMn4W8MTQcsO9Mnow9o1N+1qzTv0yqF/gT8BNTKZerv6EnQSnsWgkEtmTNP3UmzlTKPgEopKkBlIGX9gXWhaqFgMpWXnR01zIdGiXaLoW0ZL8nchYbM4hD64wZ9sxvbUT+pzUzjE5buVBphqD4+KAokxQTOiqYdoQGjnJgAeNa2LtS3mO2PrTfULEl+L+f/BfcHh36Fl8f184vJnUskm2yQ/aIT07IObkidIgnDyTd/LpEOfV+XBd3ZsdZ1JZov8GHf1C2Aat50=</latexit><latexit sha1_base64="rEe7WczUNfU1xXU3NZFRJTpFY=">ACU3icbVHLSsNAFJ3Ed31VXboZLIKCSCKCbgTRjcsKVoWmlMn0p06mYSZG6Wk+UcRXPgjblzoNC3F14WBw3nM40yYSmHQ894cd2Z2bn5hcamyvLK6tl7d2Lw1SaY5NHgiE30fMgNSKGigQAn3qQYWhxLuwofLkX73CNqIRN3gIVWzLpKRIztFS72g8kRBhIproSgkgznj+1c1EU+XBISzQcFgd0qvSnSn+kBFp0exjoMn4W8MTQcsO9Mnow9o1N+1qzTv0yqF/gT8BNTKZerv6EnQSnsWgkEtmTNP3UmzlTKPgEopKkBlIGX9gXWhaqFgMpWXnR01zIdGiXaLoW0ZL8nchYbM4hD64wZ9sxvbUT+pzUzjE5buVBphqD4+KAokxQTOiqYdoQGjnJgAeNa2LtS3mO2PrTfULEl+L+f/BfcHh36Fl8f184vJnUskm2yQ/aIT07IObkidIgnDyTd/LpEOfV+XBd3ZsdZ1JZov8GHf1C2Aat50=</latexit><latexit sha1_base64="rEe7WczUNfU1xXU3NZFRJTpFY=">ACU3icbVHLSsNAFJ3Ed31VXboZLIKCSCKCbgTRjcsKVoWmlMn0p06mYSZG6Wk+UcRXPgjblzoNC3F14WBw3nM40yYSmHQ894cd2Z2bn5hcamyvLK6tl7d2Lw1SaY5NHgiE30fMgNSKGigQAn3qQYWhxLuwofLkX73CNqIRN3gIVWzLpKRIztFS72g8kRBhIproSgkgznj+1c1EU+XBISzQcFgd0qvSnSn+kBFp0exjoMn4W8MTQcsO9Mnow9o1N+1qzTv0yqF/gT8BNTKZerv6EnQSnsWgkEtmTNP3UmzlTKPgEopKkBlIGX9gXWhaqFgMpWXnR01zIdGiXaLoW0ZL8nchYbM4hD64wZ9sxvbUT+pzUzjE5buVBphqD4+KAokxQTOiqYdoQGjnJgAeNa2LtS3mO2PrTfULEl+L+f/BfcHh36Fl8f184vJnUskm2yQ/aIT07IObkidIgnDyTd/LpEOfV+XBd3ZsdZ1JZov8GHf1C2Aat50=</latexit>

tends to be larger for similar words i,j

  • 0. frog
  • 1. frogs
  • 2. toad
  • 3. litoria
  • 4. leptodactylidae
  • 5. rana
  • 6. lizard
  • 7. eleutherodactylus
  • 3. litoria
  • 4. leptodactylidae
  • 5. rana
  • 7. eleutherodactylus
slide-22
SLIDE 22

Semantic Clustering

  • Consequence: clustering word embeddings should give “semantically”

relevant clusters.

22

t-SNE projection of word embeddings for artists (clustered by genre). Image from https://medium.com/free-code- camp/learn-tensorflow-the- word2vec-model-and-the-tsne-algorithm-using-rock-bands-97c99b5dcb3a

slide-23
SLIDE 23

Analogies

  • Obse

servat vation: You can solve analogy queries by linear algebra. Precisely, ! = queen will be the solution to:

23

argminw kvw vking (vwoman vman)k2

<latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit><latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit><latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit><latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit>

Man King Queen Woman

slide-24
SLIDE 24

Language Models (LMs)

  • A statistical model that assigns probabilities to the words in a

sentences.

  • Most commonly: Given previous words, what should the next one

be?

  • Neural language model: Model the probability of words given
  • thers using neural networks.

24

slide-25
SLIDE 25

Recurrent Architectures for LM

  • We can use recurrent architectures.
  • LSTM, GRU ...
  • Great for variable length inputs, like sentences.

25

slide-26
SLIDE 26

Recurrent Architectures for LM

  • What are some of the problems with recurrent architectures?

– Not parallelizable across instances. – Cannot model long dependences. – Optimization difficulties (vanishing gradients).

  • Attention to the rescue!

26

slide-27
SLIDE 27

Transformers

Properties of the transformer architecture:

  • Fully feed forward.
  • Equivariance properties of scaled dot

product attention (important):

– How does the output change if we permute the order of queries? (equivariance) – How does the output change if we permute the key-value pairs in unison? (invariance)

27

slide-28
SLIDE 28

Performance Comparison

28

slide-29
SLIDE 29

Pretraining Language Models

  • Can we use large amounts of text data to pretrain language models?
  • Considerations:

►How can we fuse both left-right and right-left context? ►How can we facilitate non-trivial interactions between input tokens?

  • Previous approaches:

►ELMO (Peters. et. al., 2017): Bidirectional, but shallow. ►GPT (Radford et. al., 2018): Deep, but unidirectional. ►BERT (Devlin et. al., 2018): Deep and bidirectional!

29

slide-30
SLIDE 30

BERT Workflow

  • The BERT workflow includes:

► Pretrain on generic, self-supervised tasks, using large amounts of data (like

all of Wikipedia)

► Fine-tune on specific tasks with limited, labelled data.

  • The pretraining tasks (will talk about this in more detail later):

►Masked Language Modelling (to learn contextualized token representations) ►Next Sentence Prediction (summary vector for the whole input)

30

slide-31
SLIDE 31

BERT Architecture

31

slide-32
SLIDE 32

BERT Architecture

Properties:

  • Two input sequences.

►Many NLP tasks have two inputs (question answering, paraphrase detection,

entailment detection etc. )

  • Computes embeddings

► Both token, position and segment embeddings. ► Special start and separation tokens.

  • Architecture

►Basically the same as transformer encoder.

  • Outputs:

► Contextualized token representations. ► Special tokens for context.

32

slide-33
SLIDE 33

BERT Embeddings

  • How we tokenize the inputs is very important!
  • BERT uses the WordPiece tokenizer (Wu et. al. 2016)

33

slide-34
SLIDE 34

(Aside) Tokenizers

  • Tokenizers have to balance the following:

– Being comprehensive (rare words? translation to different languages) – Total number of tokens – How semantically meaningful each token is.

  • This is an activate area of research.

34

slide-35
SLIDE 35

Pretraining tasks

  • Masked Language Modelling, i.e. Cloze Task (Taylor, 1953)
  • Next sentence prediction

35

slide-36
SLIDE 36

Masked Language Modelling

  • Mask 15% of the input tokens. (i.e. replace with a dummy masking

token)

  • Run the model, obtain the embeddings for the masked tokens.
  • Using these embeddings, try to predict the missing token.
  • ”I love to eat peanut ___ and jam. ” Can you guess what’s missing?
  • This procedure forces the model to encode context information

in the features of all of the tokens.

36

slide-37
SLIDE 37

Next Sentence Prediction

  • Goal is to summarize the complete context (i.e. the two segments) in

a single feature vector.

  • Procedure for generating data

► Pick a sentence from the training corpus and feed it as ”segment A”. ► With 50% probability, pick the following sentence and feed that as

”segment B”.

► With 50% probability, pick the a random sentence and feed it as ”segment

B”.

  • Using the features for the context token, predict whether segment B

is the following sentence of segment A.

  • Turns out to be a very effective pretraining technique!

37

slide-38
SLIDE 38

Fine Tuning

Procedure:

  • Add a final layer on top of BERT representations.
  • Train the whole network on the fine-tuning dataset.
  • Pre-training time: In the order of days on TPUs.
  • Fine tuning task: Takes only a few hours max.

38

slide-39
SLIDE 39

Fine Tuning

39

slide-40
SLIDE 40

40

Self-Supervised Learning in Vision

slide-41
SLIDE 41

Inpainting

  • The most obvious analogy to word embeddings: predict parts of image from

remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting

41

slide-42
SLIDE 42

Inpainting

  • The most obvious analogy to word embeddings: predict parts of image from

remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting

  • Much

ch trickier than in NLP: As we have seen, meaningful losses for vision are much more difficult to design. Choice of region to mask out is much more impactful.

42

Architecture: An encoder E takes a part of image, constructs a representation. A decoder D takes representation, tries to reconstruct missing part.

slide-43
SLIDE 43

Inpainting

  • The most obvious analogy to word embeddings: predict parts of image from

remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting

  • If reconstruction loss is L2: tendency

to produce blurry images. Remember: one of the usefulness of GANs is to provide a better loss for images.

43

slide-44
SLIDE 44

Inpainting

  • The most obvious analogy to word embeddings: predict parts of image from

remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting

  • If reconstruction loss is L2: tendency to produce blurry images.
  • Remember: one of the usefulness of GANs is to provide a better loss for images.

44

Mask DC-GAN objective Composition of encoder+decoder

slide-45
SLIDE 45

Inpainting

  • The most obvious analogy to word embeddings: predict parts of image from

remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting

45

slide-46
SLIDE 46

Inpainting

  • The most obvious analogy to word embeddings: predict parts of image from

remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting

  • How to choose the region?

46

Task should be “solvable”, but not “too easy”.

  • Fixed (central region): tends to produce less

generalizeable representations

  • Random blocks: slightly better, but square

borders still hurt.

  • Random silhouette (fully random doesn’t

make sense – prediction task is too ill-defined) – even better!

slide-47
SLIDE 47

Jigsaw puzzles

  • In principle, what we want is a task “hard enough”, that any model that does

well on it, should learn something “meaningful” about the task. Doersch et al. ’15: Unsupervised Visual Representation Learning by Context Prediction

  • Task: Predict ordering of two randomly chosen pieces from the image.

47

Representation: penultimate layer of a neural net used to solve task. Intuition: understanding relative positioning of pieces of an image requires some understanding

  • f how images are composed.

In incile ha e an i a ak had enogh ha an model ha doe ell

  • n i hold lean omehing meaningfl abo he ak

e al : Predict ordering of two random n p s co

slide-48
SLIDE 48

Jigsaw puzzles

  • In principle, what we want is a task “hard enough”, that any model that does

well on it, should learn something “meaningful” about the task. Doersch et al. ’15: Unsupervised Visual Representation Learning by Context Prediction

  • Quite finnicky: one needs to make sure the predictor cannot take any obvious

“shortcuts”.

48

– Boundary texture continuity is a big clue: include gaps in tiles. – Long lines spanning tiles are a clue: jitter location of tiles. – Chromatic aberration (some cameras tend to focus different wavelengths at different position – e.g. green shifts towards center of image): randomly drop 2

  • f the 3 channels
slide-49
SLIDE 49

Predicting rotations

  • In principle, what we want is a task

“hard enough”, that any model that does well on it, should learn something “meaningful” about the task. Gidaris et al. ’18: Unsupervised representation learning via predicting image rotations

  • Task

ask: predict one of 4 possible rotations of an image.

49

slide-50
SLIDE 50

Predicting rotations

  • In principle, what we want is a task

“hard enough”, that any model that does well on it, should learn something “meaningful” about the task. Gidaris et al. ’18: Unsupervised representation learning via predicting image rotations

  • Task

ask: predict one of 4 possible rotations of an image.

50

– Representation: penultimate layer

  • f a neural net used to solve task.

– Intuition: a rotation is a global

  • transformation. ConvNets are

much better at capturing local transformations (as convolutions are local), so there is no obvious way to “cheat”.

slide-51
SLIDE 51

Predicting rotations

  • In principle, what we want is a task

“hard enough”, that any model that does well on it, should learn something “meaningful” about the task. Gidaris et al. ’18: Unsupervised representation learning via predicting image rotations

  • Task

ask: predict one of 4 possible rotations of an image.

51

– Less finicky to get right: no

  • bvious artifacts the model can

make use of to cheat. – The 90 deg. rotations also don’t introduce any additional artifacts due to discretization.

slide-52
SLIDE 52

Contrastive divergence

  • Another natural idea: if features are “semantically” relevant, a “distortion” of an

image should produce similar features. Some instances of distortions:

52

slide-53
SLIDE 53

Contrastive divergence

  • Another natural idea: if features are “semantically” relevant, a “distortion” of an

image should produce similar features. Some instances of distortions:

  • Contrastive divergence framework:

For every training sample, produce multiple augmented samples by applying various transformations. Train an encoder E (i.e. map that produces features) to predict whether two samples are augmentations of the same base sample. A common way is to train E to make ⟨"(#) ,"(#′)⟩ big if #, #′ are two augmentations from same sample, small otherwise, e.g.

53

lx,x0 = log ✓

exp(τhE(x),E(x0)i P

x,x0 exp(τhE(x),E(x0)i)

◆ min P

x,x0 augments of each other lx,x0

<latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit><latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit><latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit><latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit>
slide-54
SLIDE 54

Contrastive divergence

  • Another natural idea: if features are “semantically” relevant, a “distortion” of an

image should produce similar features. Some instances of distortions:

  • Many works follow this framework, starting with Oord ‘18: Representation

Learning with Contrastive Predictive Learning.

  • Current state of the art for self-supervised learning is in fact using this

framework: Chen, Kornblith, Norouizi, Hinton ‘20: A Simple Framework for Contrastive Learning of Visual Representations

54

Several tricks needed to gain this improvement. Most important one seems to be that augmentations that work best are compositions of a geometric one (e.g. crop/rotation/..) and an appearance one (color distortion/blur/..)

slide-55
SLIDE 55

Troubling fact: Architecture Matters

  • Kolesnikov et al. ’19: Revisiting Self-Supervised Visual Representation Learning

55

slide-56
SLIDE 56

Troubling fact: Architecture Matters

  • Kolesnikov et al. ’19: Revisiting Self-Supervised Visual Representation Learning

56