A tour of machine learning ... guided by a complete amateur - - PowerPoint PPT Presentation

a tour of machine learning
SMART_READER_LITE
LIVE PREVIEW

A tour of machine learning ... guided by a complete amateur - - PowerPoint PPT Presentation

A tour of machine learning ... guided by a complete amateur Thomas Dullien, Google Topics to cover 1. Logistic regression 2. Word embeddings 3. t-SNE 4. Deep Networks (and some transfer learning) 5. Hidden Markov Models for sequence


slide-1
SLIDE 1

Thomas Dullien, Google

A tour of machine learning ...

… guided by a complete amateur

slide-2
SLIDE 2

Topics to cover

  • 1. Logistic regression
  • 2. Word embeddings
  • 3. t-SNE
  • 4. Deep Networks (and some transfer learning)
  • 5. Hidden Markov Models for sequence tagging
  • 6. Conditional Random Fields for sequence tagging
  • 7. Reinforcement learning
  • 8. Approximate NN and k-NN methods
  • 9. Tree ensemble methods
slide-3
SLIDE 3

Logistic Regression

  • Also known as “maximum entropy modelling”
  • Mathematically simple, easy to diagnose / inspect
  • Idea: Approximate a conditional probability

distribution from (labeled) training data

  • Consider output classes and features
slide-4
SLIDE 4

Logistic Regression

  • Parameters that are learnt are a k x n matrix of

weights

  • Easily diagnosable: For each decision, contribution of

each feature can be easily read-off

  • Features need to be provided / engineered

distribution from (labeled) training data

  • Various subleties need to be observed:

○ Lots of correlated features can make training convergence arbitrarily slow ○ Features with arbitrary values can be permitted ○ Various optimization algorithms: “Iterative Scaling”, L-BFGS, SGD

slide-5
SLIDE 5

Logistic Regression

Example implementations: Maxent Toolkit:

https://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html

Tensorflow Tutorial:

https://www.tensorflow.org/get_started/mnist/beginners

slide-6
SLIDE 6

Word embeddings

  • Extracting “meaning” from a word is difficult
  • Words in a language are often related, but this

relationship is not easily inferred from the written form

  • f the word
  • Letter-by-letter-similarity does not imply any semantic

similarity

  • Is it possible to build a dictionary

that maps words into a space where some semantic relationships are represented?

  • Yes - word2vec et. al.
slide-7
SLIDE 7

Word embeddings

  • Idea: Try to train a model that predicts contexts for a

given word

  • Train in a way that produces a vector representation
  • f the word
  • Vector representations are then used as stand-in for

the written word in further applications

slide-8
SLIDE 8

Word embeddings: Word2Vec

“The quick brown fox jumped over the lazy dog”

Target word Context Context

slide-9
SLIDE 9

Word embeddings: Word2Vec

Let be training vectors consisting of target words & their context. Then optimize the following function:

slide-10
SLIDE 10

Word embeddings: Word2Vec

“For each word find two vectors v_in and v_out so that the performance of the prediction of the words surrounding it is maximized.” Strange results of the embedding: Vectors were successfully used for solving analogies. Some controversy exists about how much semantics are extracted, and if the strange linear relationships are better explained by “noise”. Words used in similar contexts are “close” in embedding.

slide-11
SLIDE 11

Word embeddings: Word2Vec

Example implementation: https://github.com/dav/word2vec

slide-12
SLIDE 12

t-SNE

  • Common problem in ML: Understanding relationships

between high-dimensional vectors

  • Difficult to plot :-)
  • t-SNE: Commonly used algorithm to visualize

high-dimensional data in 2D or 3D

  • Attempts to optimize a mapping so that nearby points

are close in the projection, and non-near points are at distance in the projection

slide-13
SLIDE 13

t-SNE

Example implementation: https://github.com/lvdmaaten/bhtsne/

slide-14
SLIDE 14

Deep Neural Networks

  • Big hype since Hinton’s 2006 breakthrough results
  • Didn’t work for decades, started working in 2006
  • Reasons why they started working are still poorly

understood

slide-15
SLIDE 15

Deep Neural Networks

  • Big hype since Hinton’s 2006 breakthrough results
  • Didn’t work for decades, started working in 2006
  • Reasons why they started working are still poorly

understood

Last layer is just logistic regression

slide-16
SLIDE 16

Deep Neural Networks

Last layer is just logistic regression Lower layers can be viewed as feature extractors for the last-layer logistic regression.

slide-17
SLIDE 17

Deep Neural Networks

  • Mathematically essentially iterated matrix

multiplication with interleaved nonlinear function

  • Each layer is of the form:
slide-18
SLIDE 18

Deep Neural Networks

  • Structure of the DNN is encoded in restrictions on the

shape of the matrices

  • Convolutional NN’s also force many weights in the

lower layers to be the same (translation invariance, locality)

  • Modern DNNs often use ReLu etc. instead of sigm

sigmoid some other non-linear options

slide-19
SLIDE 19

Deep Neural Networks

  • Huge success in areas where feature engineering

was traditionally very hard ○ Image processing tasks ○ Speech recognition tasks ○ ...

  • Data-hungry: Many parameters to estimate, clearly
  • ne needs a fair amount of data to estimate them

well

  • Good way to think about non-recurrent DNNs:

Sophisticated feature extractors for logistic regression.

slide-20
SLIDE 20

Deep Neural Networks

Lots of competing implementations now. Simply google “deep learning framework”. Tensorflow, Keras, Torch, Caffee etc.

slide-21
SLIDE 21

Transfer learning

  • Lower layers of DNN extract structure from input
  • Image processing example: Edge detection, shapes

etc.

  • Low-level features for task A may be useful features

for task B, too

  • Transfer learning: Take DNN trained on task A, then

try to re-train it to perform task B

  • Example: Google inception NN, Hotdog / not Hotdog

app, other example later

slide-22
SLIDE 22

HMMs for sequence tagging

  • Consider the problem of assigning a sequence of

syllables to an audio sample

  • Space to classify over grows exponentially
  • Think of a person’s voice as a state machine
slide-23
SLIDE 23

HMMs for sequence tagging

  • Depending on what syllable is currently pronounced,

audio spectrum changes

  • Voice probabilistically transitions between states
  • Training an HMM:

○ Specify the structure of the state machine ○ Provide labeled data to infer … ■ Transition probabilities between states ■ Distribution of data emitted at state

  • Inference in HMMs:

○ Provide data sequence to infer … ■ Most likely sequence through state machine that would have produced the data sequenc

slide-24
SLIDE 24

HMMs for sequence tagging

  • Limitation: Independence assumption:

○ only the current state determines data distribution ○ only the current state determines transition probabilities to the next state

  • Generative model:

○ Easy to “sample” from the distribution the model learnt ○ Everybody has seen Markov Twitter Bots?

slide-25
SLIDE 25

HMMs for sequence tagging

Example implementation: http://ghmm.sourceforge.net/ghmm-python-tutorial.html Rabiner’s very accessible HMM tutorial: https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf

slide-26
SLIDE 26

CRFs for sequence tagging

  • HMM independence assumption for state transitions
  • ften not true in practice
  • Example: Part-of-speech-tagging

○ Probability of a word being of a particular type depends on the type assigned to previous word

  • HMMs model joint distribution, but we normally want

conditional distribution

  • CRFs are the sequence-form of logistic regression:
  • Linear-chain CRFs computationally tractable
  • More complex dependencies can make them

intractable

slide-27
SLIDE 27

CRFs for sequence tagging

Pretty high-performance example implementation: https://wapiti.limsi.fr/ Corresponding paper: “Practical very large scale CRFs” http://www.aclweb.org/anthology/P10-1052

slide-28
SLIDE 28
slide-29
SLIDE 29

Approximate Nearest Neighbor Search

Consider a family of hash functions (from the domain you wish to search to some domain) is locality-sensitive if there is

slide-30
SLIDE 30

What does this mean?

“For similar objects, the odds of a randomly drawn hash function to evaluate to the same value should be higher than for dissimilar

  • bjects.”
slide-31
SLIDE 31

LSH for similarity search

  • Often a matter of designing a good hash

function family for your domain

  • Rest of the implementation is mostly

“pluggable”

  • For Euclidean and angular distance, several

good, public, FOSS libraries exist that can be used off-the-shelf

slide-32
SLIDE 32

ANNoy and FalcoNN

ANNoy

  • Partition space into

halves by random sampling & centroids

  • Build a tree structure out
  • f these halves
  • Build N such trees

FalcoNN

  • Use a particular

polytope hash Both work pretty well -- FOSS C++ libraries, easy-to-use Python bindings.

slide-33
SLIDE 33

Geometric intuition behind ANNoy

slide-34
SLIDE 34

Pick two random points to start

slide-35
SLIDE 35

Pick a new random point

slide-36
SLIDE 36

Measure distance to initial points

slide-37
SLIDE 37

Pick closer element

slide-38
SLIDE 38

Calculate average

slide-39
SLIDE 39

Repeat with new point

slide-40
SLIDE 40

Result: Two “centroids”

slide-41
SLIDE 41

Split space in the middle between

slide-42
SLIDE 42

Repeat on both sides until buckets small

slide-43
SLIDE 43

Repeat on both sides until buckets small

slide-44
SLIDE 44

Result: Tree tiling of our space

slide-45
SLIDE 45

Each color: Tree-leaf / hash bucket

slide-46
SLIDE 46

ANNoy intuition

  • Each tree is a “hash function” (maps a point

to a bucket)

  • Easy to generate a new tree (sample random

points, two centroids etc)

  • Nearby points have higher probability to end

up in same bucket than far-away points

  • ⇒ A family of locality-sensitive hashes
slide-47
SLIDE 47

Example: Image similarity search...

… in < 100 lines of Python.

  • How to best turn pictures into vectors of

reals?

  • Image-classification Deep Neural Networks

do this - if you just cut off the last layer

  • Step 1: Convert image files to real vectors by

using a pre-trained image classification CNN and “cut off” the last layer

slide-48
SLIDE 48

Example: Image similarity search...

Different classes of images, pre-trained by Google on massive data and compute.

slide-49
SLIDE 49

Example: Image similarity search...

Different classes of images, pre-trained by Google on massive data and compute.

slide-50
SLIDE 50

Example: Image similarity search...

Different classes of images, pre-trained by Google on massive data and compute. Vector of 2048 real numbers

slide-51
SLIDE 51

Example: Image similarity search...

  • Example of “Transfer learning” - repurposing

pre-trained neural networks

  • Input to the classification layer is a real vector
  • f 2048 numbers
  • Use ANNoy to build an index
  • Change-resilient image similarity search in
  • ne afternoon
slide-52
SLIDE 52

Query Best match 2nd best match

slide-53
SLIDE 53

Example: Image similarity search...

ANNoy Library: https://github.com/spotify/annoy Example code:

https://gist.github.com/thomasdullien/79d38da49c b4f4a511d74d780e53743a http://goo.gl/TCG34i

Copy short URL

slide-54
SLIDE 54

Tree Ensembles

  • Decision trees: Classifiers where …

○ leaves are classes (or values, or linear functions) ○ Inner nodes test for particular properties

Ear length > 10cm Height > 30cm Donkey Hare Tortoise True True False False

slide-55
SLIDE 55

Tree Ensembles

  • Popular algorithm to build decision trees:

○ C4.5 and CART

  • Discussion only of C4.5 here:
  • Recursively partition training set via tests
  • Choose partitions that maximize

information gain:

  • Favor partitions that reduce entropy in the

parts

  • Balance against complexity of partition
slide-56
SLIDE 56

Tree Ensembles: Random forests

  • Subsample training data randomly
  • Build decision tree
  • Repeat until N trees are constructed
  • Classify by “voting” of N trees
  • Provides probabilities (fraction of trees that

voted for a given class)

slide-57
SLIDE 57

Tree Ensembles: Tree Boosting

  • Construct tree of given complexity, this time

for numerical output

  • Calculate “residual”:

○ Difference between predicted value and actual value, for each element

  • Construct next tree to approximate the

residual

  • Add the trees together

(Boosting: Constructing a sequence of classifiers that are trained on the “mistakes” of the previous classifiers)

slide-58
SLIDE 58

Tree Ensembles: Tree Boosting

  • Kaggle competitions are routinely won by

ensembles of xgboost (a tree boosting implementation) + X (other classifiers)

  • Random Forests & Tree boosting are widely

regarded as the most “fire and forget” classifiers available https://github.com/dmlc/xgboost Paper: https://arxiv.org/abs/1603.02754

slide-59
SLIDE 59

Topics to cover

  • 1. Logistic regression
  • 2. Word embeddings
  • 3. t-SNE
  • 4. Deep Networks (and some transfer learning)
  • 5. Hidden Markov Models for sequence tagging
  • 6. Conditional Random Fields for sequence tagging
  • 7. Reinforcement learning
  • 8. Approximate NN and k-NN methods
  • 9. Tree ensemble methods
slide-60
SLIDE 60

Questions? :-)