Scalable Machine Learning 10. Distributed Inference and Applications - PowerPoint PPT Presentation

Fully asynchronous sampler • For 1000 iterations do (independently per computer) • For each thread/core do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Generate computer local (word, topic) message • In parallel update local (word, topic) table • In parallel update global (word, topic) table network memory table out blocking bound inefficient of sync minimal continuous barrier concurrent cpu hdd net view sync free

Multicore Architecture Intel Threading Building Blocks tokens sampler diagnostics sampler file count output to sampler & topics combiner updater file sampler optimization sampler topics joint state table • Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler • Joint state table • much less memory required • samplers syncronized (10 docs vs. millions delay) • Hyperparameter update via stochastic gradient descent • No need to keep documents in memory (streaming)

Cluster Architecture sampler sampler sampler sampler ice ice ice ice • Distributed (key,value) storage via memcached • Background asynchronous synchronization • single word at a time to avoid deadlocks • no need to have joint dictionary • uses disk, network, cpu simultaneously

Cluster Architecture sampler sampler sampler sampler ice ice ice ice • Distributed (key,value) storage via ICE • Background asynchronous synchronization • single word at a time to avoid deadlocks • no need to have joint dictionary • uses disk, network, cpu simultaneously

Making it work • Startup • Randomly initialize topics on each node (read from disk if already assigned - hotstart) • Sequential Monte Carlo for startup much faster • Aggregate changes on the fly • Failover • State constantly being written to disk (worst case we lose 1 iteration out of 1000) • Restart via standard startup routine • Achilles heel: need to restart from checkpoint if even a single machine dies.

Easily extensible • Better language model (topical n-grams) can process millions of users (vs 1000s) • Conditioning on side information (upstream) estimate topic based on authorship, source, joint user model ... • Conditioning on dictionaries (downstream) integrate topics between different languages • Time dependent sampler for user model approximate inference per episode

Alternatives

V1 - Brute force α α maximization θ • Integrate out latent parameters θ and ѱ y y p ( X, Y | α , β ) • Discrete maximization x x problem in Y ѱ • Hard to implement • Overfits a lot (mode is β β not a typical sample) • Parallelization infeasible Hal Daume; Joey Gonzalez

V2 - Brute force α α maximization θ θ • Integrate out latent parameters y p ( X, ψ , θ | α , β ) y • Continuous nonconvex optimization problem in θ and ѱ • Solve by stochastic gradient descent x x over documents • Easy to implement • Does not overfit much ѱ ѱ • Great for small datasets • Parallelization difficult/impossible • Memory storage/access is O(T W) β β (this breaks for large models) • 1M words, 1000 topics = 4GB • Per document 1MFlops/iteration Hoffmann, Blei, Bach (in VW)

V3 - Variational α α approximation θ θ • Approximate intractable joint distribution by tractable factors y y log p ( x ) � log p ( x ) � D ( q ( y ) k p ( y | x )) Z = dq ( y ) [log p ( x ) + log p ( y | x ) � q ( y )] Z = dq ( y ) log p ( x, y ) + H [ q ] x • Alternating convex optimization problem • Dominant cost is matrix matrix multiply • Easy to implement ѱ ѱ • Great for small topics/vocabulary • Parallelization easy (aggregate statistics) β β • Memory storage is O(T W) (this breaks for large models) • Model not quite as good as sampling Blei, Ng, Jordan

V4 - Uncollapsed α Sampling 2 θ • Sample y ij |rest Can be done in parallel y 1 • Sample θ |rest and ѱ |rest Can be done in parallel • Compatible with MapReduce x (only aggregate statistics) • Easy to implement ѱ 2 • Children can be conditionally independent* • Memory storage is O(T W) β (this breaks for large models) • Mixes slowly *for the right model

V5 - Collapsed α α Sampling θ • Integrate out latent parameters θ and ѱ y y p ( X, Y | α , β ) • Sample one topic assignment y ij |X,Y -ij at a time from x x n − ij ( t, d ) + α t n − ij ( t, w ) + β t n − i ( d ) + P n − i ( t ) + P t β t t α t ѱ • Fast mixing • Easy to implement β β • Memory efficient • Parallelization infeasible (variables lock each other) Griffiths & Steyvers 2005

V6 - Approximating α the Distribution • Collapsed sampler per machine y n − ij ( t, d ) + α t n − ij ( t, w ) + β t n − i ( d ) + P n − i ( t ) + P t β t t α t • Defer synchronization x between machines • no problem for n(t) • big problem for n(t,w) • Easy to implement • Can be memory efficient β • Easy parallelization Asuncion, Smyth, Welling, ... UCI • Mixes slowly/worse likelihood Mimno, McCallum, ... UMass

V7 - Better Approximations α of the Distribution • Collapsed sampler n − ij ( t, d ) + α t n − ij ( t, w ) + β t n − i ( d ) + P n − i ( t ) + P t β t y t α t • Make local copies of state • Implicit for multicore (delayed updates from samplers) x • Explicit copies for multi-machine • Not a hierarchical model (Welling, Asuncion, et al. 2008) • Memory efficient (only need to view its own sufficient statistics) β • Multicore / Multi-machine • Convergence speed depends on S. and Narayanamurthy, 2009 synchronizer quality Ahmed, Gonzalez, et al., 2012

V8 - Sequential α Monte Carlo • Integrate out latent θ and ѱ p ( X, Y | α , β ) y y y • Chain conditional probabilities ... m Y p ( X, Y | α , β ) = p ( x i , y i | x 1 , y 1 , . . . x i − 1 , y i − 1 , α , β ) x x x i =1 • For each particle sample y i ∼ p ( y i | x i , x 1 , y 1 , . . . x i − 1 , y i − 1 , α , β ) • Reweight particle by next step data likelihood p ( x i +1 | x 1 , y 1 , . . . x i , y i , α , β ) β • Resample particles if weight distribution is too uneven Canini, Shi, Griffiths, 2009 Ahmed et al., 2011

V8 - Sequential Monte Carlo • One pass through data • Data sequential • Integrate out latent θ and ѱ parallelization is open problem • Nontrivial to implement p ( X, Y | α , β ) • Chain conditional probabilities • Sampler is easy • Inheritance tree through particles m Y p ( X, Y | α , β ) = p ( x i , y i | x 1 , y 1 , . . . x i − 1 , y i − 1 , α , β ) is messy i =1 • For each particle sample • Need to estimate data likelihood (integration over y), e.g. as part of y i ∼ p ( y i | x i , x 1 , y 1 , . . . x i − 1 , y i − 1 , α , β ) sampler • Reweight particle by next step data likelihood • This is multiplicative update p ( x i +1 | x 1 , y 1 , . . . x i , y i , α , β ) algorithm with log loss ... • Resample particles if weight distribution is too uneven Canini, Shi, Griffiths, 2009 Ahmed et al., 2011

Collapsed Variational Collapsed topic Uncollapsed natural approximation assignments parameters easy to optimize easy big memory overfits parallelization overfits Optimization footprint too costly big memory too costly difficult footprint parallelization fast mixing difficult parallelization slow mixing approximate sampling Sampling conditionally n.a. inference by difficult independent delayed updates particle filtering sequential

P a r a l l e l I n f e r e n c e

3 Problems cluster ID mean data variance cluster weight

3 Problems local state data global state

3 Problems only local huge too big for single machine

3 Problems Vanilla LDA global state local state User profiling data global state

3 Problems local state does not fit is too large into memory network load & barriers global state is too large does not fit into memory

3 Problems local state does not fit stream local data from disk is too large into memory network load & barriers global state is too large does not fit into memory

3 Problems local state does not fit stream local data from disk is too large into memory network load asynchronous synchronization & barriers global state is too large does not fit into memory

3 Problems local state does not fit stream local data from disk is too large into memory network load asynchronous synchronization & barriers global state is too large does not fit partial view into memory

Distribution global replica cluster rack

Synchronization • Child updates local state • Start with common state • Child stores old and new state • Parent keeps global state • Transmit differences asynchronously • Inverse element for difference • Abelian group for commutativity (sum, log-sum, cyclic group, exponential families) local to global global to local δ ← x − x old x ← x + ( x global − x old ) x old ← x x old ← x global x global ← x global + δ

Synchronization • Naive approach (dumb master) • Global is only (key,value) storage • Local node needs to lock/read/write/unlock master • Needs a 4 TCP/IP roundtrips - latency bound • Better solution (smart master) • Client sends message to master / in queue / master incorporates it • Master sends message to client / in queue / client incorporates it • Bandwidth bound (>10x speedup in practice) local to global global to local δ ← x − x old x ← x + ( x global − x old ) x old ← x x old ← x global x global ← x global + δ

Distribution • Dedicated server for variables • Insufficient bandwidth (hotspots) • Insufficient memory • Select server e.g. via consistent hashing m ( x ) = argmin h ( x, m ) m ∈ M

Distribution & fault tolerance • Storage is O(1/k) per machine • Communication is O(1) per machine • Fast snapshots O(1/k) per machine (stop sync and dump state per vertex) • O(k) open connections per machine • O(1/k) throughput per machine m ( x ) = argmin h ( x, m ) m ∈ M

Synchronization • Data rate between machines is O(1/k) • Machines operate asynchronously (barrier free) • Solution • Schedule message pairs • Communicate with r random machines simultaneously local r=1 global

Synchronization • Data rate between machines is O(1/k) • Machines operate asynchronously (barrier free) • Solution • Schedule message pairs • Communicate with r random machines simultaneously local r=2 0.78 < eff. < 0.89 global 0 2 3 3 3 1 1 3 2

Synchronization • Data rate between machines is O(1/k) • Machines operate asynchronously (barrier free) • Solution • Schedule message pairs • Communicate with r random machines simultaneously • Use Luby-Rackoff PRPG for load balancing • Efficiency guarantee 4 simultaneous connections are sufficient

Scalability

Summary Variable Replication • Global shared variable computer x y z x y z y’ synchronize local copy • Make local copy • Distributed (key,value) storage table for global copy • Do all bookkeeping locally (store old versions) • Sync local copies asynchronously using message passing (no global locks are needed) • This is an approximation!

Summary Asymmetric Message Passing • Large global shared state space (essentially as large as the memory in computer) • Distribute global copy over several machines (distributed key,value storage) global state current copy old copy

Summary Out of core storage • Very large state space x z y • Gibbs sampling requires us to traverse the data sequentially many times (think 1000x) • Stream local data from disk and update coupling variable each time local data is accessed • This is exact tokens sampler diagnostics sampler file count output to sampler & topics combiner updater file sampler optimization sampler topics

Advanced Modeling

Advances in Representation

Extensions to topic models • Prior over document topic vector α • Usually as Dirichlet distribution • Use correlation between topics (CTM) • Hierarchical structure over topics θ • Document structure • Bag of words • n-grams (Li & McCallum) y • Simplical Mixture (Girolami & Kaban) • Side information • Upstream conditioning (Mimno & McCallum) x • Downstream conditioning (Petterson et al.) • Supervised LDA (Blei and McAulliffe 2007; Lacoste, Sha and Jordan 2008; Zhu, Ahmed and Xing 2009)

Correlated topic models • Dirichlet distribution • Can only model which topics are hot • Does not model relationships between topics • Key idea • We expect to see documents about sports and health but not about sports and politics • Uses a logistic normal distribution as a prior • Conjugacy is no longer maintained • Inference is harder than in LDA Blei & Lafferty 2005; Ahmed & Xing 2007

Dirichlet prior on topics

Log-normal prior on topics θ = e η − g ( η ) with η ∼ N ( µ, Σ ) with

Correlated topics Blei and Lafferty 2005

Correlated topics

Pachinko Allocation • Model the prior as a Directed Acyclic Graph • Each document is modeled as multiple paths • To sample a word, first select a path and then sample a word from the final topic • The topics reside on the leaves of the tree Li and McCallum 2006

Pachinko Allocation Li and McCallum 2006

Topic Hierarchies • Topics can appear anywhere in the tree • Each document is modeled as • Single path over the tree (Blei et al., 2004) • Multiple paths over the tree (Mimno et al.,2007)

Topic Hierarchies Blei et al. 2004

Topical n-grams α • Documents as bag of words • Exploit sequential structure • N-gram models θ • Capture longer phrases • Switch variables to y y determine segments • Dynamic programming x x x needed Girolami & Kaban, 2003; Wallach, 2006; Wang & McCallum, 2007

Topic n-grams

Side information • Upstream conditioning (Mimno et al., 2008) • Document features are informative for topics • Estimate topic distribution e.g. based on authors, links, timestamp • Downstream conditioning (Petterson et al., 2010) • Word features are informative on topics • Estimate topic distribution for words e.g. based on dictionary, lexical similarity, distributional similarity • Class labels (Blei and McAulliffe 2007; Lacoste, Sha and Jordan 2008; Zhu, Ahmed and Xing 2009) • Joint model of unlabeled data and labels • Joint likelihood - semisupervised learning done right!

Downstream conditioning Europarl corpus without alignment

Recommender Systems Agarwal & Chen, 2010

Chinese Restaurant Process φ 1 φ 2 φ 3

Problem • How many clusters should we pick? • How about a prior for infinitely many clusters? • Finite model n ( y ) + α y p ( y | Y, α ) = n + P y 0 α y 0 • Infinite model Assume that the total smoother weight is constant n ( y ) α and new p ( y | Y, α ) = y 0 α y 0 and p (new | Y, α ) = n + P n + α

Chinese Restaurant Metaphor φ 1 φ 2 φ 3 the rich get richer Genera=ve ¡Process -‑For ¡data ¡point ¡x i ¡ -‑ ¡Choose ¡table ¡ j ¡ ∝ ¡m j ¡ ¡ ¡ ¡ and ¡ ¡Sample ¡x i ¡~ ¡f( φ j ) -‑ ¡Choose ¡a ¡new ¡table ¡ ¡K+1 ¡ ∝ ¡ α ¡ -‑ ¡Sample ¡ φ K+1 ¡~ ¡G 0 ¡ ¡ ¡and ¡Sample ¡x i ¡~ ¡f( φ K+1 ) Pitman; Antoniak; Ishwaran; Jordan et al.; Teh et al.;

Scalable Machine Learning 10. Distributed Inference and Applications - PowerPoint PPT Presentation

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Outline Latent Dirichlet Allocation Basic model Sampling and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research &

Dependency Parsing Lecture 2 Overview Nivre's Arc-Eager / Arc-Standard Algorithm

How is God Revealed? Scripture Nature Conscience Jesus Do we sometimes focus on

ACCT 420: Topic modeling and anomaly detection Session 9 Dr. Richard M. Crowley 1 Front matter

Entropy-based artificial viscosity Parabolic regularization and related topics Jean-Luc Guermond

On oscillating systems of interacting Hawkes processes Susanne Ditlevsen Eva L ocherbach

G en o M3 Building middleware-independent robotic components A. Mallet, C. Pasteur, M. Herrb, S.

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational

Scalable Machine Learning 10. Distributed Inference and Applications - PowerPoint PPT Presentation

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Outline Latent Dirichlet Allocation Basic model Sampling and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research &amp;

Dependency Parsing Lecture 2 Overview Nivre's Arc-Eager / Arc-Standard Algorithm

How is God Revealed? Scripture Nature Conscience Jesus Do we sometimes focus on

ACCT 420: Topic modeling and anomaly detection Session 9 Dr. Richard M. Crowley 1 Front matter

Entropy-based artificial viscosity Parabolic regularization and related topics Jean-Luc Guermond

On oscillating systems of interacting Hawkes processes Susanne Ditlevsen Eva L ocherbach

G en o M3 Building middleware-independent robotic components A. Mallet, C. Pasteur, M. Herrb, S.

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research &