A Blueprint of Standardized and Composable ML Eric Xing and - PowerPoint PPT Presentation

A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1

The universe of problems ML/AI is trying to solve 2 2

Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1 Rewards Data examples Auxiliary agents Constraints … And all combinations of of that … Adversaries 3 3

How human beings solve them ALL? 4 4

The Zoo of ML/AI Models ● Neural networks ● Kernel machines ◯ Convolutional networks Radial Basis Function Networks ◯ Gaussian processes ◯ AlexNet, GoogleNet, ResNet ◯ Deep kernel learning ◯ Recurrent networks, LSTM ◯ Maximum margin ◯ Transformers ◯ SVMs ◯ ◯ BERT, GPT2 ● Decision trees ● Graphical models ● PCA, Probabilistic PCA, Kernel ◯ Bayesian networks PCA, ICA ◯ Markov Random fields ● Boosting ◯ Topic models, LDA ◯ HMM, CRF 5 5

The Zoo of algorithms and heuristics maxim aximum um likelihood likelihood estim estimation ation reinfor einforcem cement lear ent learning ning as infer as inference ence data re da re-weig weighting hting inverse R inver se RL active learning active lear ning policy op olicy optim timization ization re reward rd-aug augmented ented m maxim aximum um likelihood likelihood da data augme gmentation actor-cr actor critic itic softm softmax ax policy g olicy grad adient ient label sm lab el smoothing oothing im imitation lear itation learning ning ad adver versar sarial d ial dom omain ad ain adap aptation tation poster osterior ior r reg egular ularization ization GANs GA Ns constraint- constr aint-driven lear iven learning ning knowled knowledge d e distillation istillation intrinsic r intr insic rewar eward gener eneralized alized exp expectation ectation pred ediction m iction minim inimization ization reg egular ularized ized B Bayes ayes lear learning ning fr from om m measur easurem ements ents energy-based ener ased GA GANs Ns weak/d weak/distant sup istant super ervision vision 6 6

Really hard to navigate, and to realize ● Depending on individual expertise and creativity ● Bespoke, delicate pieces of art ● Like an airport with different runways for every different types of aircrafts 7 7

Physics in the 1800’s ● Electricity & magnetism: Coulomb’s law, Ampère, Faraday, ... ◯ ● Theory of light beams: Particle theory: Isaac Newton, Laplace, Plank ◯ Wave theory: Grimaldi, Chris Huygens, Thomas Young, Maxwell ◯ ● Law of gravity Aristotle, Galileo, Newton, … ◯ 8 8

Maxwell’s equations Further simplified w/ Maxwell’s Eqns: Simplified w/ symmetry of special original form rotational symmetry relativity Diverse ε uvk λ ∂ v F k λ = 0 electro- magnetic ∂ v F uV = 4 π theories c j u 9 9

How about a blueprint of ML ● Loss ● Optimization solver ● Model architecture min $ ℒ & Optimization Loss Model solver architecture 10 10

How about a blueprint of ML ● Loss ss ● Optimization solver ● Model architecture !"# ' − ) − ℍ $, & Experience Divergence Uncertainty 11 11

MLE at a close look: ● The most classical learning algorithm 1 ● Supervised: log + , (( ∗ |% ∗ ) min − 2 % ∗ ,( ∗ ∼ ! ◯ Observe data ! = {(% ∗ , ( ∗ )} , ◯ Solve with SGD ● Unsupervised: 1 % ∗ + , (% ∗ , () ◯ Observe ! = , ( is latent variable log 8 min − 2 % ∗ ∼ ! , ◯ Posterior + , ((|%) ( ◯ Solve with EM: E-step imputes latent variable ( through expectation on complete likelihood § M-step: supervised MLE § 12 12

MLE as Entropy Maximization ● Duality between Supervised MLE and maximum entropy, when ! is exponential family Shannon entropy + %(',)) + ! min features 0(', )) s.t. / % 0(', )) = / (2 ∗ ,4 ∗ )∼6 0(', )) data as constraints ⇒ Solve w/ Lagrangian method Lagrangian multiplier ; ! ', ) = exp ; ⋅ 0 ' / >(;) min −/ (' ∗ ,) ∗ )∼6 ; ⋅ 0(', )) + log >(;) Negative log-likelihood ? 13 13

MLE as Entropy Maximization ● Unsupervised MLE can be achieved by maximizing the negative free energy: Introduce auxiliary distribution !(#|%) (and then play with its entropy and cross entropy, etc.) ◯ + , (% ∗ , #) = 0 1(#|% ∗ ) log + , % ∗ , # + KL ! # % ∗ || + , # % ∗ log * ! # % ∗ # ≥ 6 ! #|% ∗ + 0 1(#|% ∗ ) log + , (% ∗ , #) 14 14

Algorithms for Unsupervised MLE 1 6 * (0 ∗ , .) min − + 0 ∗ ∼ ; log = * . 1) Solve with EM EM 6 * (0 ∗ , .) = + ,(.|0 ∗ ) log 6 * 0 ∗ , . + KL " . 0 ∗ || 6 * . 0 ∗ log = " . 0 ∗ . ≥ D " .|0 ∗ + + ,(.|0 ∗ ) log 6 * (0 ∗ , .) E-step: Maximize ℒ ", $ w.r.t " , equivalent to minimizing KL by setting q " . 0 ∗ = 6 * ?@A (.|0 ∗ ) + ,(.|0 ∗ ) log 6 * 0 ∗ , . M-step: Maximize ℒ ", $ w.r.t $ : max q * 15 15

Algorithms for Unsupervised MLE (cont’d) ! 4 (& ∗ , $) = 5 6($|& ∗ ) log ! 4 & ∗ , $ + KL + $ & ∗ || ! 4 $ & ∗ log : + $ & ∗ $ ≥ > + $|& ∗ + 5 6($|& ∗ ) log ! 4 (& ∗ , $) 2) When model ! " is complex, directly working with the true posterior ! " ($|& ∗ ) is intractable ⇒ Vari riational EM Consider a sufficiently restricted family * of +($|&) so that minimizing the § KL is tractable E.g., parametric distributions, factorized distributions q E-step: Maximize ℒ +, " w.r.t + ∈ * , equivalent to minimizing KL § 5 6($|& ∗ ) log ! 4 & ∗ , $ M-step: Maximize ℒ +, " w.r.t " : max § 4 16 16

Algorithms for Unsupervised MLE (cont’d) 5 ) (/ ∗ , -) = * +(-|/ ∗ ) log 5 ) / ∗ , - + KL ! - / ∗ || 5 ) - / ∗ log ; ! - / ∗ - ≥ ? ! -|/ ∗ + * +(-|/ ∗ ) log 5 ) (/ ∗ , -) 3) When ! is complex, e.g., deep NNs, optimizing ! in E-step is difficult (e.g., high variance) ⇒ Wake ke-S -Sleep algori rithm m [Hinton et al., 1995] KL(5 ) - / ∗ ||! 8 - / ∗ ) min • Sleep-phase (E-step): Reverse KL 8 * +(-|/ ∗ ) log 5 ) / ∗ , - • Wake-phase (M-step): Maximize ℒ !, % w.r.t % : max ) Other tricks: reparameterization in VAE (‘2014), control variates in NVIL (‘2014) 17 17

Quick summary of MLE ● Supervised: ◯ Duality with MaxEnt ◯ Solve with SGD, IPF … ● Unsupervised: Lower bounded by negative free energy ◯ ◯ Solve with EM, VEM, Wake-Sleep, … ● Close connections to MaxEnt ● With MaxEnt, algorithms (e.g., EM) arises naturally 18 18

Posterior Regularization (PR) ● Make use of constraints in Bayesian learning An auxiliary posterior distribution ! " ◯ Slack variable # , constant weight $ = & > 0 ◯ 1 log 7 8 9, : min ,, . − $0 ! − &1 , + # ;. =. −1 , > 8 9 , : ≤ # [Ganchev et al., 2010] E.g., max-margin constraint for linear regression [Jaakkola et al., 1999] and ◯ general models (e.g., LDA, NNs) [Zhu et al., 2014] –– more later ● Solution for ! & log 7 8 (9, :) + > 9 , : D ! @ = exp / F D $ 19 19

More general learning leveraging PR ● No need to limit to Bayesian learning ● E.g., Complex rule constraints on general models [Hu et al., 2016], where ! can be over arbitrary variables, e.g., !(#, %) ◯ ' ( #, % is NNs of arbitrary architectures with parameters ) ◯ 1 E.g., 3(#, %) is a 1st-order logical rule: log ' ( #, % ., (,7 − 89 ! − :- . min + 1 If sentence # contains word ``but’’ 1 ⇒ its sentiment % is the same as the 1 − 3(#, %) *. ,. - . #,% ≤ 1 sentiment after “but” 20 20

EM for the general PR ● Rewrite without slack variable: 1 1 min ;, 1 − 6> ! − ,: ; − : ; ",$ 5 " , $ log 0 1 ", $ Solve with EM ◯ , log 0 1 (", $) + 5 " , $ ) ! ", $ = exp / + E-step: § ) 6 1 min : ; log 0 1 ", $ M-step: § 1 21 21

Reformulating unsupervised MLE with PR + , (# ∗ , ") ≥ 2 ! "|# ∗ + 5 6("|# ∗ ) log + , (# ∗ , ") log * " ● Introduce arbitrary ! " # 1 6, ,, : − <2 ! − =5 6 min + ? log + , #, " Data as constraint. Data as constr aint. 1 Given # ∼ % , this constraint doesn’t @. B. −5 6 < ? D # ; % influence the solution of ! and & D # ; % ∶= log 5 H ∗ ∼% I # ∗ # ◯ A constraint saying # must equal to one of the true data points § Or alternatively, the (log) expected similarity of # to dataset % , with § I ⋅ as the similarity measure (we’ll come back to this later) < = = = 1 ◯ 22 22

The standard equation min % ℒ ' 1 Optimization Loss Model 0 7, 8 , : & 7, 8 $, &, '() *+ min − .ℍ 0 + 2 solver architecture 1 ; 7 , 8 3. 5. −6 $ 7,8 < 2 Equivalently: 1 1 0 7, 8 , : & 7, 8 ; 7 , 8 min $, & − 6 $ 7,8 + *+ − .ℍ 0 3 terms: Experiences Divergence Uncertainty (exogenous regularizations) (fitness) (self-regularization) e.g., data examples, rules e.g., Cross Entropy e.g., Shannon entropy Uncertainty Textbook Student Teacher ; 7 , 8| . : & 7, 8 0 7, 8 23 23

Re-visit unsupervised MLE under SE 1 1 ! $, < min 3, 5 − 78 9 − :, 3 − , 3 log = 5 $, < 7 = : = 1 ! ≔ !( $ ; &) = log , $ ∗ ∼& / $ ∗ $ 9 = 9(<|$) 24 24

Re-visit supervised MLE under SE 1 10 ., / log 4 & ., / min $, & − () * − +, $ − , $ .,/ ( = 1, + = 6 0: = 0 . , / ; 9 = log , (. ∗ , / ∗ )∼9 > (. ∗ ,/ ∗ ) ., / 25 25

A Blueprint of Standardized and Composable ML Eric Xing and - PowerPoint PPT Presentation

A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1 The universe of problems ML/AI is trying to solve 2 2 Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

Effective Presentation Blueprint: Sales Achievement Effective Presentation Blueprint: Sales

Effective Presentation Blueprint: Sales Achievement DNA Effective Presentation Blueprint: Sales

Standardized Recipes Maine Department of Education Child Nutrition 2018 Standardized Recipes

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

Blueprint for Restoring Safety and Soundness to the GSEs June 2017 This presentation summarizes

Building Tomorrows Public Service Together Peter Edwards Adam Fritz John Kehoe Blueprint 2020

HILLTOP 1 PALMETTO-WESTGATE 60% DESIGN MEETING Blueprint Hilltop 1: Palmetto-Westgate Agenda

Working Draft of Hawaii s Blueprint for Public Education (December 6, 2016) 1 Working Draft of

Standardized Curriculum WAP Training Plans and Resources Goals of Standardized Curriculum

Toward a Standardized Toward a Standardized Architecture for CAx Architecture for CAx Model

IREDES IREDES Standardized Data Communication in Rock Excavation Standardized Data Communication

Standardized Tests in 201 9 Presented by Jay Casale 15 Year Test Prep Veteran

The development of Potent and Selective RET inhibitors Rami Rahal, PhD Blueprint Medicines

Blueprint for Restoring Safety and Soundness to the GSEs: One Year Later November 2018 Safety

Neural Program Synthesis Rishabh Singh, Google Brain Great Collaborators! Deep Learning and

Invertebrate Learning Computational Models of Neural Systems Lecture 3.3 David S. Touretzky

The influence of forward model conductivities on EEG/MEG source reconstruction Jens Haueisen

Overview of the NSF-CDI project (Year-3) and Research Progress Ming-Hsiang (Ming) Tsou

2015 Stroke Advances: Case 1 A Chance to Cut is a Chance to. A 75 year old man presents

Abduction Abduction is an assumption-based reasoning strategy where H is a set of assumptions

Enumerating coherent configurations of small order Matan Ziv-Av Ben Gurion University of the

U q ( gl ( m | 1 )) and canonical bases Sean Clark Northeastern University / Max Planck Institute

A Blueprint of Standardized and Composable ML Eric Xing and - PowerPoint PPT Presentation

A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1 The universe of problems ML/AI is trying to solve 2 2 Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1

T T T The CDO Blueprint: Enabling the he CDO Blueprint: Enabling the he CDO Blueprint:

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE &amp; EXTENSIBLE A FLEXIBLE, COMPOSABLE &amp;

Effective Presentation Blueprint: Sales Achievement Effective Presentation Blueprint: Sales

Effective Presentation Blueprint: Sales Achievement DNA Effective Presentation Blueprint: Sales

Standardized Recipes Maine Department of Education Child Nutrition 2018 Standardized Recipes

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

Blueprint for Restoring Safety and Soundness to the GSEs June 2017 This presentation summarizes

Building Tomorrows Public Service Together Peter Edwards Adam Fritz John Kehoe Blueprint 2020

HILLTOP 1 PALMETTO-WESTGATE 60% DESIGN MEETING Blueprint Hilltop 1: Palmetto-Westgate Agenda

Working Draft of Hawaii s Blueprint for Public Education (December 6, 2016) 1 Working Draft of

Standardized Curriculum WAP Training Plans and Resources Goals of Standardized Curriculum

Toward a Standardized Toward a Standardized Architecture for CAx Architecture for CAx Model

IREDES IREDES Standardized Data Communication in Rock Excavation Standardized Data Communication

Standardized Tests in 201 9 Presented by Jay Casale 15 Year Test Prep Veteran

The development of Potent and Selective RET inhibitors Rami Rahal, PhD Blueprint Medicines

Blueprint for Restoring Safety and Soundness to the GSEs: One Year Later November 2018 Safety

Neural Program Synthesis Rishabh Singh, Google Brain Great Collaborators! Deep Learning and

Invertebrate Learning Computational Models of Neural Systems Lecture 3.3 David S. Touretzky

The influence of forward model conductivities on EEG/MEG source reconstruction Jens Haueisen

Overview of the NSF-CDI project (Year-3) and Research Progress Ming-Hsiang (Ming) Tsou

2015 Stroke Advances: Case 1 A Chance to Cut is a Chance to. A 75 year old man presents

Abduction Abduction is an assumption-based reasoning strategy where H is a set of assumptions

Enumerating coherent configurations of small order Matan Ziv-Av Ben Gurion University of the

U q ( gl ( m | 1 )) and canonical bases Sean Clark Northeastern University / Max Planck Institute

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &