Lecture #12 – Self-Supervised Learning
Aykut Erdem // Hacettepe University // Spring 2020
CMP784
DEEP LEARNING
photo by unsplash user @tuvaloland
CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut - - PowerPoint PPT Presentation
photo by unsplash user @tuvaloland CMP784 DEEP LEARNING Lecture #12 Self-Supervised Learning Aykut Erdem // Hacettepe University // Spring 2020 latent by Tom White Previously on CMP784 Motivation for Variational Autoencoders (VAEs)
Lecture #12 – Self-Supervised Learning
Aykut Erdem // Hacettepe University // Spring 2020
DEEP LEARNING
photo by unsplash user @tuvaloland
Previously on CMP784
Autoencoders (VAEs)
Autoencoders (VQ-VAEs)
2
latent by Tom White
Lecture Overview
Discl sclaimer: Much of the material and slides for this lecture were borrowed from
—Andrej Risteski's CMU 10707 class —Jimmy Ba's UToronto CSC413/2516 class
3
Unsupervised Learning
– Task A: Fit a parametrized structure (e.g. clustering, low-dimensional subspace, manifold) to data to reveal something meaningful about data (Structure learning) – Task B: Learn a (parametrized) distribution close to data generating
– Task C: Learn a (parametrized) distribution that implicitly reveals an “embedding”/“representation” of data for downstream tasks. (Representation/feature learning)
embedding.
4
Self-Supervised/Predictive Learning
representation for downstream tasks.
the predictor used in the task to learn something “semantically meaningful” about the data.
5
Self-Supervised/Predictive Learning
► Predict any part of the input from any
► Predict the future from the past. ► Predict the future from the recent past. ► Predict the past from the present. ► Predict the top from the bottom. ► Predict the occluded from the visible ► Pretend there is a part of the input you don’t know and predict that.
6
Slide by Yann LeCun
7
Y LeCun
How Much Information Does the Machine Need to Predict?
“Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category
Predicting human-supplied data 10 10,000 bits per sample → Unsupervised/Predictive Learning (cake) The machine predicts any part of its input for any observed part. Predicts future frames in videos Millions of bits per sample (Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)
analogy slide, presented at his keynote speech in NIPS 2016.
8
How Much Information is the Machine Given during Learning?
“Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10→10,000 bits per sample Self-Supervised Learning (cake génoise) The machine predicts any part of its input for any
Predicts future frames in videos Millions of bits per sample
replaced “unsupervised learning” with “self-supervised learning”.
9
Word Embeddings
vect ctor represe sentat ations of words
10
Tiger Lion Table
Ex si
Tiger Lion Table
Example: Inner product (possibly scaled, i.e. cosine similarity) correlates with word similarity.
Word Embeddings
vect ctor represe sentat ations of words
11
Example: Can use embeddings to do sentiment classification by training a simple (e.g. linear) classifier
Semantically meaningful
Te ece gea fa ad fed
se si
"The service is great, fast and friendly!"
Word Embeddings
vect ctor represe sentat ations of words
12
Example: Can train a “simple” network that if fed word embeddings for two languages, can effectively translate.
Semantically meaningful
Englih I aining
Geman E regnet draussen
Ex Can ain a imple ne for tra
English: "It’s raining
German: "Es regnet draussen".
Word Embeddings via Predictive Learning
In other words, optimize for
13
: predict the next word, given a few previous ones. max
In other words, optimize for
I am running a little ????
Late: 0.9 Early: 0.05 Tired: 0.04 Table: 0.01
max
θ
X
t
log pθ (xt|xt−1, xt−2, . . . , xt−L)
<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>Word Embeddings via Predictive Learning
14
Inspired by classical assumptions in NLP that the underlying distribution is Markov – that is, !" only depends on the previous few words. (Of course, this is violated if you wish to model long texts like paragraphs / books.)
max
θ
X
t
log pθ (xt|xt−1, xt−2, . . . , xt−L)
<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>The The mai ain issu ssue: The trivial way of parametrizing is a “lookup table” with #$ entries.
Word Embeddings via Predictive Learning
[Bengio-Ducharme-Vincent-Janvin ‘2003]: A neural parametrization of the above probabilities. Main ingredients:
dictionary.
inputs i, !(#$−1), !(#$−2),...,!(#$−&), and outputting some vector o. Can be recurrent net too.
given by o.
15
max
θ
X
t
log pθ (xt|xt−1, xt−2, . . . , xt−L)
<latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit><latexit sha1_base64="ScNKP790di/h+G0e2S4ONFwg=">ACRnicbZBPa9tAEMVHTtu47j8nOfay1BRSaIMUCskxJcenChTgKWEKv1yF6yqxW7oxCj6NPlknNu+Qi9JBSeu3aFqVNOrDsj/dmN2XlUo6CsPboLP26PGT9e7T3rPnL16+6m9sHjtTWYEjYZSxpxl3qGSBI5Kk8LS0yHWm8CQ7O1r4J+donTFV5qXmGg+LWQuBScvpf0k1vyCpXVMyTesNhVOq3JgzJTVv4xYoU5bV8srUu2uD9EzfsV7HqI1cSQa4XPTWzldEbv0v4g3AmXxR5C1MIA2hqm/Zt4YkSlsSChuHPjKCwpqbklKRQ2vbhyWHJxqc49lhwjS6plzE07K1XJiw31p+C2FL9e6Lm2rm5znyn5jRz972F+D9vXFG+n9SyKCvCQqwW5ZViZNgiUzaRFgWpuQcurPRvZWLGLRfk+/5EKL7X34Ix7s7kecvHwcHh20cXgNb2AbItiDA/gEQxiBgCv4BnfwI7gOvgc/g1+r1k7QzmzBP9WB36QmscI=</latexit>Word Embeddings via Predictive Learning
16
CBOW (Continuous Bag of Words): proposed by Mikolov et al. ‘13
max
θ
X
t
log pθ (xt|xt−L, . . . , xt−1, xt+1, . . . , xt+L)
<latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit><latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit><latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit><latexit sha1_base64="dafJ6iaAnRSkQYSEcOTKgBLyixk=">ACV3icbZFdSyMxFIYzo1tr3V2rXnoTtgujKzLOil6I0XihsVeiUIZOeaUPzMSRnxDLOnxRv/CveaPrBslv3QMjD+5DkjdZIYXDKHoJwpXVT4215npr4/OXr5vtre0bZ0rLocuNPYuYw6k0NBFgRLuCgtMZRJus/H51L+9B+uE0b9xUkBfsaEWueAMvZS2daLYA02rBEeArKaJK1VaoQdphrT4YyQSctx/mFmPdLr/uKwPfdfAoDucC3E9h4N4yTm4rBMrhiP8nrY70VE0K/oR4gV0yKu0vZTMjC8VKCRS+ZcL4K7FfMouAS6lZSOigYH7Mh9DxqpsD1q1kuNd3zyoDmxvqlkc7UvycqpybqMx3KoYjt+xNxf95vRLzk34ldFEiaD4/KC8lRUOnIdOBsMBRTjwboW/K+UjZhlH/xUtH0K8/OSPcPzKPZ8/atzeraIo0l2yTeyT2JyTE7JBbkiXcLJM3kNVoLV4CV4Cxthc94aBouZHfJPhVv1qi0Q=</latexit>Parametrization is chosen s.t.
pθ (xt|xt−L, . . . , xt−1, xt+1, . . . , xt+L) ∝ exp ⇣ vxt, Pt+L
i=t−L wti
⌘
<latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit><latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit><latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit><latexit sha1_base64="4B4REVaRM9lX20TC5q/ZQJe2D8=">ACnHicbVFdaxQxFM2MX3X8WvVJBAkuQqW1zIjQvhSL+lCxQgV3W9isQyaT2Q3NZEJyp+0y5lf5T3z35j5EHTrhZDOTknNzeZlsJCHP8KwmvXb9y8tXE7unP3v0Ho4ePpraqDeMTVsnKnGbUcikUn4AyU+14bTMJD/Jzt63+sk5N1ZU6iusNJ+XdKFEIRgFT6WjHyTjC6EagxduYa5SKcNgSUH6ojkBWxepg04/B23+6sjt42JzCuw2z2RuB5sJWvK1pEjRiyW8BITbSoNFSYkIvxS4z73PG26NZn6zJtxH6b/62z4gsvecr9CfFOlQ9dpqNxvBN3ha+CZABjNRxOvpJ8orVJVfAJLV2lsQa5j4NBJPcRaS2XFN2Rhd85qGiJbfzphuwy8k+OiMn4pwB37t6OhpbWrMvMnSwpLu615P+0WQ3F3rwRStfAFesvKmqJ/aTan8K5MJyBXHlAmRG+V8yW1FAG/j8jP4Rk/clXwfT1TuLxlzfjg3fDODbQU/QcbaIE7aIDdIiO0QSx4EnwNjgMPobPwg/hp/BzfzQMBs9j9E+F098NhM16</latexit>vectors v vectors w
Word Embeddings via Predictive Learning
17
Skip-Gram: also proposed by Mikolov et al. ‘13
Parametrization is chosen s.t. In practice, lots of other tricks are tacked on to deal with the slowest part of training: the softmax distribution (partition function sums over entire vocabulary). Common ones are negative sampling, hierarchical softmax, etc.
vectors v vectors w
max
θ
X
t t+L
X
i=tL,i6=t
log pθ (xi|xt)
<latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit>pθ (xi|xt) ∝ exp (vxi, wxt)
<latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit><latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit><latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit><latexit sha1_base64="N7URpLZSkTt8LWvO/EJPjNTJBF8=">ACPHicbZBLS8QwFIVT346vUZdugoOgINKoEvRjUtFR4XpUNLM7UwbUNyOzrU/jA3/gh3rty4UMStazOdEXxdCPk491ySe0IlhUHXfXRGRsfGJyanpiszs3PzC9XFpXOTZpDnacy1ZchMyBFAnUKOFSaWBxKOEivDrs9y+6oI1IkzPsKWjGrJ2ISHCGVgqpyrIfewAsKXEOH6TZCLgt5Se2Pha9Hu4Ab1lU4VptSHG0UHvm6Ql9Zik16XiMWXPajW3C23LPoXvCHUyLCOg+qD30p5FkOCXDJjGp6rsJkzjYJLKCp+ZkAxfsXa0LCYsBhMy+XL+iaVo0SrU9CdJS/T6Rs9iYXhxaZ8ywY373+uJ/vUaG0V4zF4nKEBI+eCjKJLU59JOkLaGBo+xZYFwL+1fKO0wzjbvig3B+73yXzjf3vIsn+zU9g+GcUyRFbJK1olHdsk+OSLHpE4uSNP5IW8OvfOs/PmvA+sI85wZpn8KOfjExXDsGg=</latexit>Word Embeddings via Predictive Learning
18
Skip-Gram: also proposed by Mikolov et al. ‘13
vectors v vectors w
max
θ
X
t t+L
X
i=tL,i6=t
log pθ (xi|xt)
<latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit><latexit sha1_base64="v3QqLA+qrci6cxXLAQZ+BRV8by8=">ACR3icbZBPSyMxGMYzXV21umtdj16CRVB0ZUYEvSyIXjx4UNiq0KlDJn2nDSaZMXlnaRn23nxurf9Cl48KOJx01rFfy8EfjzP+5DkiTMpLPr+P6/yZWz868TkVHV65tv32drcj2Ob5oZDg6cyNacxsyCFhgYKlHCaGWAqlnASn+8N/JM/YKxI9W/sZ9BSrKNFIjhDJ0W1s1CxHo2KELuArKShzVU4DOIX/jzYI0KGmq4oFieFbh64EyZdmj2kgolJLjc+slvaS9QT40otPFlahW9f94dCPEIygTkZzGNX+hu2U5wo0csmsbQZ+hq2CGRcQlkNcwsZ4+esA02HmimwrWLYQ0mXnNKmSWrc0UiH6utEwZS1fRW7TcWwa97A/Ezr5ljst0qhM5yBM2fLkpySTGlg1JpWxjgKPsOGDfCvZXyLjOMo6u+6koI3n/5IxvrAeOjzbrO7ujOibJAlkyQgW2SH7JND0iCcXJEbckfuvWv1nvwHp9WK94oM0/eTMX7DxvNsqQ=</latexit>Evaluating Word Embeddings
a generative model for text. (Also called language model.) The other
means.
19
−Ex1,x2,...,xT log pθ (x≤T ) = Ex1,x2,...,xT X
t
log pθ (xt|x<t)
<latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit><latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit><latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit><latexit sha1_base64="uCe9RA3Xb4UA1yXorg5rvC6M0Nc=">AClnicjVFtSxwxEM6utup21O/CH4JPQLrexKwQpVRBH9eMKdCrfHks3N3gWzm20yWzi/iT/jN/8N2bPE3xpwYEwzPzCQzSQopDAbBvefPzH74ODe/0Fhc+vT5S3N5dyoUnPociWVvkyYASly6KJACZeFBpYlEi6Sq6Nav/gL2giVd3BcQD9jw1ykgjN0VNy8/RFlDEdJYo+r2F7HNqy+U+e2nYvkQKGZhJ2qcqEa0iK2EY4AWRVJSHTaQ78oZ0q0mI4wm97+9nyiy2+P/GTrqpk3/jU/O42Qq2gonRtyCcghaZWjtu3kUDxcsMcuSGdMLgwL7lmkUXELViEoDBeNXbAg9B3OWgenbyVoruGYAU2VdidHOmGfV1iWGTPOEpdZz2xeazX5L61XYvqrb0VelAg5f7woLSVFRes/ogOhgaMcO8C4Fu6tlI+YZhzdTzbcEsLXI78F59tbocNnP1sHh9N1zJN18pVskpDskANyStqkS7i36u16h96Rv+bv+8f+yWOq701rVskL89sP+izLfQ=</latexit>Evaluating Word Embeddings
measuring their “semantic” properties. Examples include solving “which is the most similar word” queries, analogy queries (i.e. “man is to woman as king is to ??”
to solve some (supervised) downstream task. “Finetune” usually means train a (relatively small) feedforward network. Examples of such tasks include:
– Part-of-Speech Tagging (determine whether a word is noun/verb/...), – Named Entity Recognition (recognizing named entities like persons, places) – e.g. label a sentence as Picasso[person] died in France[country], many
20
Semantic Similarity
servat vation: similar words tend to have larger (renormalized) inner products (also called cosine similarity).
Example: the nearest neighbors to “Frog” look like
to”, output the word with the highest cosine similarity.
21
⌧ wi ||wi||, wj ||wj||
tends to be larger for similar words i,j
Semantic Clustering
relevant clusters.
22
t-SNE projection of word embeddings for artists (clustered by genre). Image from https://medium.com/free-code- camp/learn-tensorflow-the- word2vec-model-and-the-tsne-algorithm-using-rock-bands-97c99b5dcb3a
Analogies
servat vation: You can solve analogy queries by linear algebra. Precisely, ! = queen will be the solution to:
23
argminw kvw vking (vwoman vman)k2
<latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit><latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit><latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit><latexit sha1_base64="JM4xnbHDPRpi8WGUsSMUH8i7nWU=">ACXicbVFLT+MwEHbCq5RXYQ974BJRIcEBlFRIcES7F46stAWkplSO0mt+hHZE1CV5k/ubnwV3DSHniNZM+n75uRZz4nueAWw/C/56+srq1vtDbW9s7u3ud/YM7qwvDoM+0OYhoRYEV9BHjgIecgNUJgLuk+nvWr9/AmO5Vn9xlsNQ0kzxlDOKjhp1MNY5GIraKCqhpCaTXFWj8rmKBaQYz59qfObuWFKcGFlOucq6qyRT97xz1pSVX0obYjY8GyCp4sUzx/LXjXqdMPzsIngK4iWoEuWcTvq/IvHmhUSFDJBrR1EY5DNy1yJqBqx4WFnLIpzWDgYL2KHZaNO1Vw7JhxkGrjsKgYd93lFRaO5OJq6zntp+1mvxOGxSYXg1LrvICQbHFQ2khAtRBbXUw5gYipkDlBnuZg3YhBrK0H1I25kQfV75K7jrnUcO/7noXv9a2tEih+SInJCIXJrckNuSZ8w8uIRb9Nre6/+mr/t7y5KfW/Z84N8CP/nG0D8ukI=</latexit>Man King Queen Woman
Language Models (LMs)
sentences.
be?
24
Recurrent Architectures for LM
25
Recurrent Architectures for LM
– Not parallelizable across instances. – Cannot model long dependences. – Optimization difficulties (vanishing gradients).
26
Transformers
Properties of the transformer architecture:
product attention (important):
– How does the output change if we permute the order of queries? (equivariance) – How does the output change if we permute the key-value pairs in unison? (invariance)
27
Performance Comparison
28
Pretraining Language Models
►How can we fuse both left-right and right-left context? ►How can we facilitate non-trivial interactions between input tokens?
►ELMO (Peters. et. al., 2017): Bidirectional, but shallow. ►GPT (Radford et. al., 2018): Deep, but unidirectional. ►BERT (Devlin et. al., 2018): Deep and bidirectional!
29
BERT Workflow
► Pretrain on generic, self-supervised tasks, using large amounts of data (like
all of Wikipedia)
► Fine-tune on specific tasks with limited, labelled data.
►Masked Language Modelling (to learn contextualized token representations) ►Next Sentence Prediction (summary vector for the whole input)
30
BERT Architecture
31
BERT Architecture
Properties:
►Many NLP tasks have two inputs (question answering, paraphrase detection,
entailment detection etc. )
► Both token, position and segment embeddings. ► Special start and separation tokens.
►Basically the same as transformer encoder.
► Contextualized token representations. ► Special tokens for context.
32
BERT Embeddings
33
(Aside) Tokenizers
– Being comprehensive (rare words? translation to different languages) – Total number of tokens – How semantically meaningful each token is.
34
Pretraining tasks
35
Masked Language Modelling
token)
in the features of all of the tokens.
36
Next Sentence Prediction
a single feature vector.
► Pick a sentence from the training corpus and feed it as ”segment A”. ► With 50% probability, pick the following sentence and feed that as
”segment B”.
► With 50% probability, pick the a random sentence and feed it as ”segment
B”.
is the following sentence of segment A.
37
Fine Tuning
Procedure:
38
Fine Tuning
39
40
Inpainting
remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting
41
Inpainting
remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting
ch trickier than in NLP: As we have seen, meaningful losses for vision are much more difficult to design. Choice of region to mask out is much more impactful.
42
Architecture: An encoder E takes a part of image, constructs a representation. A decoder D takes representation, tries to reconstruct missing part.
Inpainting
remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting
to produce blurry images. Remember: one of the usefulness of GANs is to provide a better loss for images.
43
Inpainting
remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting
44
Mask DC-GAN objective Composition of encoder+decoder
Inpainting
remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting
45
Inpainting
remainder of image. Pathak et al. ’16: Context Encoders: Feature Learning by Inpainting
46
Task should be “solvable”, but not “too easy”.
generalizeable representations
borders still hurt.
make sense – prediction task is too ill-defined) – even better!
Jigsaw puzzles
well on it, should learn something “meaningful” about the task. Doersch et al. ’15: Unsupervised Visual Representation Learning by Context Prediction
47
Representation: penultimate layer of a neural net used to solve task. Intuition: understanding relative positioning of pieces of an image requires some understanding
In incile ha e an i a ak had enogh ha an model ha doe ell
e al : Predict ordering of two random n p s co
Jigsaw puzzles
well on it, should learn something “meaningful” about the task. Doersch et al. ’15: Unsupervised Visual Representation Learning by Context Prediction
“shortcuts”.
48
– Boundary texture continuity is a big clue: include gaps in tiles. – Long lines spanning tiles are a clue: jitter location of tiles. – Chromatic aberration (some cameras tend to focus different wavelengths at different position – e.g. green shifts towards center of image): randomly drop 2
Predicting rotations
“hard enough”, that any model that does well on it, should learn something “meaningful” about the task. Gidaris et al. ’18: Unsupervised representation learning via predicting image rotations
ask: predict one of 4 possible rotations of an image.
49
Predicting rotations
“hard enough”, that any model that does well on it, should learn something “meaningful” about the task. Gidaris et al. ’18: Unsupervised representation learning via predicting image rotations
ask: predict one of 4 possible rotations of an image.
50
– Representation: penultimate layer
– Intuition: a rotation is a global
much better at capturing local transformations (as convolutions are local), so there is no obvious way to “cheat”.
Predicting rotations
“hard enough”, that any model that does well on it, should learn something “meaningful” about the task. Gidaris et al. ’18: Unsupervised representation learning via predicting image rotations
ask: predict one of 4 possible rotations of an image.
51
– Less finicky to get right: no
make use of to cheat. – The 90 deg. rotations also don’t introduce any additional artifacts due to discretization.
Contrastive divergence
image should produce similar features. Some instances of distortions:
52
Contrastive divergence
image should produce similar features. Some instances of distortions:
For every training sample, produce multiple augmented samples by applying various transformations. Train an encoder E (i.e. map that produces features) to predict whether two samples are augmentations of the same base sample. A common way is to train E to make ⟨"(#) ,"(#′)⟩ big if #, #′ are two augmentations from same sample, small otherwise, e.g.
53
lx,x0 = log ✓
exp(τhE(x),E(x0)i P
x,x0 exp(τhE(x),E(x0)i)
◆ min P
x,x0 augments of each other lx,x0
<latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit><latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit><latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit><latexit sha1_base64="Ga3rcv9IUJOm6YIfoz2HrEaOE=">ADNHicrVJNixNBEO0ZP3aNX1k9eikMQgJrSERwL8KiLIinFczuQjobejo1k2Z7eobuGpkwzI/y4g/xIoIHRbz6G+wkE9DsHi0Y5vV71fVqairKtXI0GHwNwmvXb9zc2b3Vun3n7r37b0HJy4rMSRzHRmzyLhUCuDI1Kk8Sy3KNJI42l08Xqpn35A61Rm3tMix0kqEqNiJQV5aroXvOURJspUwlqxqCtZt/S0KvehPOe5VSnWL59ynSXANcbU5bEVsuJY5huCRLFCXAuTaISjbtnbh6O1Wp5XTRVuVTKn3vrF7Sp3fejXFXdF2phu8mv4rya9jQact3iqDFzhCZywJKhAFEmKhxkMaCQc8hojhZ8U3q7zRZHM2uGN213Bv3BKuAyGDagw5o4nrY/81kmi6WX1MK58XCQ08RXIyU1+tqFw1zIC5Hg2EMjUnSTavXTa3jimRnEmfWPIVixf9+oROrcIo18Zipo7ra1JXmVNi4oPphUyuQFoZFro7jQBksNwhmyqIkvfBASKt8ryDnwu8F+T1r+SEMtz/5Mjh51h96/O5/BVM45d9og9Zl02ZC/YIXvDjtmIyeBj8CX4HvwIP4Xfwp/hr3VqGDR3HrJ/Ivz9B9nsDc=</latexit>Contrastive divergence
image should produce similar features. Some instances of distortions:
Learning with Contrastive Predictive Learning.
framework: Chen, Kornblith, Norouizi, Hinton ‘20: A Simple Framework for Contrastive Learning of Visual Representations
54
Several tricks needed to gain this improvement. Most important one seems to be that augmentations that work best are compositions of a geometric one (e.g. crop/rotation/..) and an appearance one (color distortion/blur/..)
Troubling fact: Architecture Matters
55
Troubling fact: Architecture Matters
56