Thomas Dullien, Google
A tour of machine learning ... guided by a complete amateur - - PowerPoint PPT Presentation
A tour of machine learning ... guided by a complete amateur - - PowerPoint PPT Presentation
A tour of machine learning ... guided by a complete amateur Thomas Dullien, Google Topics to cover 1. Logistic regression 2. Word embeddings 3. t-SNE 4. Deep Networks (and some transfer learning) 5. Hidden Markov Models for sequence
Topics to cover
- 1. Logistic regression
- 2. Word embeddings
- 3. t-SNE
- 4. Deep Networks (and some transfer learning)
- 5. Hidden Markov Models for sequence tagging
- 6. Conditional Random Fields for sequence tagging
- 7. Reinforcement learning
- 8. Approximate NN and k-NN methods
- 9. Tree ensemble methods
Logistic Regression
- Also known as “maximum entropy modelling”
- Mathematically simple, easy to diagnose / inspect
- Idea: Approximate a conditional probability
distribution from (labeled) training data
- Consider output classes and features
Logistic Regression
- Parameters that are learnt are a k x n matrix of
weights
- Easily diagnosable: For each decision, contribution of
each feature can be easily read-off
- Features need to be provided / engineered
distribution from (labeled) training data
- Various subleties need to be observed:
○ Lots of correlated features can make training convergence arbitrarily slow ○ Features with arbitrary values can be permitted ○ Various optimization algorithms: “Iterative Scaling”, L-BFGS, SGD
Logistic Regression
Example implementations: Maxent Toolkit:
https://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html
Tensorflow Tutorial:
https://www.tensorflow.org/get_started/mnist/beginners
Word embeddings
- Extracting “meaning” from a word is difficult
- Words in a language are often related, but this
relationship is not easily inferred from the written form
- f the word
- Letter-by-letter-similarity does not imply any semantic
similarity
- Is it possible to build a dictionary
that maps words into a space where some semantic relationships are represented?
- Yes - word2vec et. al.
Word embeddings
- Idea: Try to train a model that predicts contexts for a
given word
- Train in a way that produces a vector representation
- f the word
- Vector representations are then used as stand-in for
the written word in further applications
Word embeddings: Word2Vec
“The quick brown fox jumped over the lazy dog”
Target word Context Context
Word embeddings: Word2Vec
Let be training vectors consisting of target words & their context. Then optimize the following function:
Word embeddings: Word2Vec
“For each word find two vectors v_in and v_out so that the performance of the prediction of the words surrounding it is maximized.” Strange results of the embedding: Vectors were successfully used for solving analogies. Some controversy exists about how much semantics are extracted, and if the strange linear relationships are better explained by “noise”. Words used in similar contexts are “close” in embedding.
Word embeddings: Word2Vec
Example implementation: https://github.com/dav/word2vec
t-SNE
- Common problem in ML: Understanding relationships
between high-dimensional vectors
- Difficult to plot :-)
- t-SNE: Commonly used algorithm to visualize
high-dimensional data in 2D or 3D
- Attempts to optimize a mapping so that nearby points
are close in the projection, and non-near points are at distance in the projection
t-SNE
Example implementation: https://github.com/lvdmaaten/bhtsne/
Deep Neural Networks
- Big hype since Hinton’s 2006 breakthrough results
- Didn’t work for decades, started working in 2006
- Reasons why they started working are still poorly
understood
Deep Neural Networks
- Big hype since Hinton’s 2006 breakthrough results
- Didn’t work for decades, started working in 2006
- Reasons why they started working are still poorly
understood
Last layer is just logistic regression
Deep Neural Networks
Last layer is just logistic regression Lower layers can be viewed as feature extractors for the last-layer logistic regression.
Deep Neural Networks
- Mathematically essentially iterated matrix
multiplication with interleaved nonlinear function
- Each layer is of the form:
Deep Neural Networks
- Structure of the DNN is encoded in restrictions on the
shape of the matrices
- Convolutional NN’s also force many weights in the
lower layers to be the same (translation invariance, locality)
- Modern DNNs often use ReLu etc. instead of sigm
sigmoid some other non-linear options
Deep Neural Networks
- Huge success in areas where feature engineering
was traditionally very hard ○ Image processing tasks ○ Speech recognition tasks ○ ...
- Data-hungry: Many parameters to estimate, clearly
- ne needs a fair amount of data to estimate them
well
- Good way to think about non-recurrent DNNs:
Sophisticated feature extractors for logistic regression.
Deep Neural Networks
Lots of competing implementations now. Simply google “deep learning framework”. Tensorflow, Keras, Torch, Caffee etc.
Transfer learning
- Lower layers of DNN extract structure from input
- Image processing example: Edge detection, shapes
etc.
- Low-level features for task A may be useful features
for task B, too
- Transfer learning: Take DNN trained on task A, then
try to re-train it to perform task B
- Example: Google inception NN, Hotdog / not Hotdog
app, other example later
HMMs for sequence tagging
- Consider the problem of assigning a sequence of
syllables to an audio sample
- Space to classify over grows exponentially
- Think of a person’s voice as a state machine
HMMs for sequence tagging
- Depending on what syllable is currently pronounced,
audio spectrum changes
- Voice probabilistically transitions between states
- Training an HMM:
○ Specify the structure of the state machine ○ Provide labeled data to infer … ■ Transition probabilities between states ■ Distribution of data emitted at state
- Inference in HMMs:
○ Provide data sequence to infer … ■ Most likely sequence through state machine that would have produced the data sequenc
HMMs for sequence tagging
- Limitation: Independence assumption:
○ only the current state determines data distribution ○ only the current state determines transition probabilities to the next state
- Generative model:
○ Easy to “sample” from the distribution the model learnt ○ Everybody has seen Markov Twitter Bots?
HMMs for sequence tagging
Example implementation: http://ghmm.sourceforge.net/ghmm-python-tutorial.html Rabiner’s very accessible HMM tutorial: https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
CRFs for sequence tagging
- HMM independence assumption for state transitions
- ften not true in practice
- Example: Part-of-speech-tagging
○ Probability of a word being of a particular type depends on the type assigned to previous word
- HMMs model joint distribution, but we normally want
conditional distribution
- CRFs are the sequence-form of logistic regression:
- Linear-chain CRFs computationally tractable
- More complex dependencies can make them
intractable
CRFs for sequence tagging
Pretty high-performance example implementation: https://wapiti.limsi.fr/ Corresponding paper: “Practical very large scale CRFs” http://www.aclweb.org/anthology/P10-1052
Approximate Nearest Neighbor Search
Consider a family of hash functions (from the domain you wish to search to some domain) is locality-sensitive if there is
What does this mean?
“For similar objects, the odds of a randomly drawn hash function to evaluate to the same value should be higher than for dissimilar
- bjects.”
LSH for similarity search
- Often a matter of designing a good hash
function family for your domain
- Rest of the implementation is mostly
“pluggable”
- For Euclidean and angular distance, several
good, public, FOSS libraries exist that can be used off-the-shelf
ANNoy and FalcoNN
ANNoy
- Partition space into
halves by random sampling & centroids
- Build a tree structure out
- f these halves
- Build N such trees
FalcoNN
- Use a particular
polytope hash Both work pretty well -- FOSS C++ libraries, easy-to-use Python bindings.
Geometric intuition behind ANNoy
Pick two random points to start
Pick a new random point
Measure distance to initial points
Pick closer element
Calculate average
Repeat with new point
Result: Two “centroids”
Split space in the middle between
Repeat on both sides until buckets small
Repeat on both sides until buckets small
Result: Tree tiling of our space
Each color: Tree-leaf / hash bucket
ANNoy intuition
- Each tree is a “hash function” (maps a point
to a bucket)
- Easy to generate a new tree (sample random
points, two centroids etc)
- Nearby points have higher probability to end
up in same bucket than far-away points
- ⇒ A family of locality-sensitive hashes
Example: Image similarity search...
… in < 100 lines of Python.
- How to best turn pictures into vectors of
reals?
- Image-classification Deep Neural Networks
do this - if you just cut off the last layer
- Step 1: Convert image files to real vectors by
using a pre-trained image classification CNN and “cut off” the last layer
Example: Image similarity search...
Different classes of images, pre-trained by Google on massive data and compute.
Example: Image similarity search...
Different classes of images, pre-trained by Google on massive data and compute.
Example: Image similarity search...
Different classes of images, pre-trained by Google on massive data and compute. Vector of 2048 real numbers
Example: Image similarity search...
- Example of “Transfer learning” - repurposing
pre-trained neural networks
- Input to the classification layer is a real vector
- f 2048 numbers
- Use ANNoy to build an index
- Change-resilient image similarity search in
- ne afternoon
Query Best match 2nd best match
Example: Image similarity search...
ANNoy Library: https://github.com/spotify/annoy Example code:
https://gist.github.com/thomasdullien/79d38da49c b4f4a511d74d780e53743a http://goo.gl/TCG34i
Copy short URL
Tree Ensembles
- Decision trees: Classifiers where …
○ leaves are classes (or values, or linear functions) ○ Inner nodes test for particular properties
Ear length > 10cm Height > 30cm Donkey Hare Tortoise True True False False
Tree Ensembles
- Popular algorithm to build decision trees:
○ C4.5 and CART
- Discussion only of C4.5 here:
- Recursively partition training set via tests
- Choose partitions that maximize
information gain:
- Favor partitions that reduce entropy in the
parts
- Balance against complexity of partition
Tree Ensembles: Random forests
- Subsample training data randomly
- Build decision tree
- Repeat until N trees are constructed
- Classify by “voting” of N trees
- Provides probabilities (fraction of trees that
voted for a given class)
Tree Ensembles: Tree Boosting
- Construct tree of given complexity, this time
for numerical output
- Calculate “residual”:
○ Difference between predicted value and actual value, for each element
- Construct next tree to approximate the
residual
- Add the trees together
(Boosting: Constructing a sequence of classifiers that are trained on the “mistakes” of the previous classifiers)
Tree Ensembles: Tree Boosting
- Kaggle competitions are routinely won by
ensembles of xgboost (a tree boosting implementation) + X (other classifiers)
- Random Forests & Tree boosting are widely
regarded as the most “fire and forget” classifiers available https://github.com/dmlc/xgboost Paper: https://arxiv.org/abs/1603.02754
Topics to cover
- 1. Logistic regression
- 2. Word embeddings
- 3. t-SNE
- 4. Deep Networks (and some transfer learning)
- 5. Hidden Markov Models for sequence tagging
- 6. Conditional Random Fields for sequence tagging
- 7. Reinforcement learning
- 8. Approximate NN and k-NN methods
- 9. Tree ensemble methods