a blueprint of standardized and composable ml
play

A Blueprint of Standardized and Composable ML Eric Xing and - PowerPoint PPT Presentation

A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1 The universe of problems ML/AI is trying to solve 2 2 Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1


  1. A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1

  2. The universe of problems ML/AI is trying to solve 2 2

  3. Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1 Rewards Data examples Auxiliary agents Constraints … And all combinations of of that … Adversaries 3 3

  4. How human beings solve them ALL? 4 4

  5. The Zoo of ML/AI Models ● Neural networks ● Kernel machines ◯ Convolutional networks Radial Basis Function Networks ◯ Gaussian processes ◯ AlexNet, GoogleNet, ResNet ◯ Deep kernel learning ◯ Recurrent networks, LSTM ◯ Maximum margin ◯ Transformers ◯ SVMs ◯ ◯ BERT, GPT2 ● Decision trees ● Graphical models ● PCA, Probabilistic PCA, Kernel ◯ Bayesian networks PCA, ICA ◯ Markov Random fields ● Boosting ◯ Topic models, LDA ◯ HMM, CRF 5 5

  6. The Zoo of algorithms and heuristics maxim aximum um likelihood likelihood estim estimation ation reinfor einforcem cement lear ent learning ning as infer as inference ence data re da re-weig weighting hting inverse R inver se RL active learning active lear ning policy op olicy optim timization ization re reward rd-aug augmented ented m maxim aximum um likelihood likelihood da data augme gmentation actor-cr actor critic itic softm softmax ax policy g olicy grad adient ient label sm lab el smoothing oothing im imitation lear itation learning ning ad adver versar sarial d ial dom omain ad ain adap aptation tation poster osterior ior r reg egular ularization ization GANs GA Ns constraint- constr aint-driven lear iven learning ning knowled knowledge d e distillation istillation intrinsic r intr insic rewar eward gener eneralized alized exp expectation ectation pred ediction m iction minim inimization ization reg egular ularized ized B Bayes ayes lear learning ning fr from om m measur easurem ements ents energy-based ener ased GA GANs Ns weak/d weak/distant sup istant super ervision vision 6 6

  7. Really hard to navigate, and to realize ● Depending on individual expertise and creativity ● Bespoke, delicate pieces of art ● Like an airport with different runways for every different types of aircrafts 7 7

  8. Physics in the 1800’s ● Electricity & magnetism: Coulomb’s law, Ampère, Faraday, ... ◯ ● Theory of light beams: Particle theory: Isaac Newton, Laplace, Plank ◯ Wave theory: Grimaldi, Chris Huygens, Thomas Young, Maxwell ◯ ● Law of gravity Aristotle, Galileo, Newton, … ◯ 8 8

  9. Maxwell’s equations Further simplified w/ Maxwell’s Eqns: Simplified w/ symmetry of special original form rotational symmetry relativity Diverse ε uvk λ ∂ v F k λ = 0 electro- magnetic ∂ v F uV = 4 π theories c j u 9 9

  10. How about a blueprint of ML ● Loss ● Optimization solver ● Model architecture min $ ℒ & Optimization Loss Model solver architecture 10 10

  11. How about a blueprint of ML ● Loss ss ● Optimization solver ● Model architecture !"# ' − ) − ℍ $, & Experience Divergence Uncertainty 11 11

  12. MLE at a close look: ● The most classical learning algorithm 1 ● Supervised: log + , (( ∗ |% ∗ ) min − 2 % ∗ ,( ∗ ∼ ! ◯ Observe data ! = {(% ∗ , ( ∗ )} , ◯ Solve with SGD ● Unsupervised: 1 % ∗ + , (% ∗ , () ◯ Observe ! = , ( is latent variable log 8 min − 2 % ∗ ∼ ! , ◯ Posterior + , ((|%) ( ◯ Solve with EM: E-step imputes latent variable ( through expectation on complete likelihood § M-step: supervised MLE § 12 12

  13. MLE as Entropy Maximization ● Duality between Supervised MLE and maximum entropy, when ! is exponential family Shannon entropy + %(',)) + ! min features 0(', )) s.t. / % 0(', )) = / (2 ∗ ,4 ∗ )∼6 0(', )) data as constraints ⇒ Solve w/ Lagrangian method Lagrangian multiplier ; ! ', ) = exp ; ⋅ 0 ' / >(;) min −/ (' ∗ ,) ∗ )∼6 ; ⋅ 0(', )) + log >(;) Negative log-likelihood ? 13 13

  14. MLE as Entropy Maximization ● Unsupervised MLE can be achieved by maximizing the negative free energy: Introduce auxiliary distribution !(#|%) (and then play with its entropy and cross entropy, etc.) ◯ + , (% ∗ , #) = 0 1(#|% ∗ ) log + , % ∗ , # + KL ! # % ∗ || + , # % ∗ log * ! # % ∗ # ≥ 6 ! #|% ∗ + 0 1(#|% ∗ ) log + , (% ∗ , #) 14 14

  15. Algorithms for Unsupervised MLE 1 6 * (0 ∗ , .) min − + 0 ∗ ∼ ; log = * . 1) Solve with EM EM 6 * (0 ∗ , .) = + ,(.|0 ∗ ) log 6 * 0 ∗ , . + KL " . 0 ∗ || 6 * . 0 ∗ log = " . 0 ∗ . ≥ D " .|0 ∗ + + ,(.|0 ∗ ) log 6 * (0 ∗ , .) E-step: Maximize ℒ ", $ w.r.t " , equivalent to minimizing KL by setting q " . 0 ∗ = 6 * ?@A (.|0 ∗ ) + ,(.|0 ∗ ) log 6 * 0 ∗ , . M-step: Maximize ℒ ", $ w.r.t $ : max q * 15 15

  16. Algorithms for Unsupervised MLE (cont’d) ! 4 (& ∗ , $) = 5 6($|& ∗ ) log ! 4 & ∗ , $ + KL + $ & ∗ || ! 4 $ & ∗ log : + $ & ∗ $ ≥ > + $|& ∗ + 5 6($|& ∗ ) log ! 4 (& ∗ , $) 2) When model ! " is complex, directly working with the true posterior ! " ($|& ∗ ) is intractable ⇒ Vari riational EM Consider a sufficiently restricted family * of +($|&) so that minimizing the § KL is tractable E.g., parametric distributions, factorized distributions q E-step: Maximize ℒ +, " w.r.t + ∈ * , equivalent to minimizing KL § 5 6($|& ∗ ) log ! 4 & ∗ , $ M-step: Maximize ℒ +, " w.r.t " : max § 4 16 16

  17. Algorithms for Unsupervised MLE (cont’d) 5 ) (/ ∗ , -) = * +(-|/ ∗ ) log 5 ) / ∗ , - + KL ! - / ∗ || 5 ) - / ∗ log ; ! - / ∗ - ≥ ? ! -|/ ∗ + * +(-|/ ∗ ) log 5 ) (/ ∗ , -) 3) When ! is complex, e.g., deep NNs, optimizing ! in E-step is difficult (e.g., high variance) ⇒ Wake ke-S -Sleep algori rithm m [Hinton et al., 1995] KL(5 ) - / ∗ ||! 8 - / ∗ ) min • Sleep-phase (E-step): Reverse KL 8 * +(-|/ ∗ ) log 5 ) / ∗ , - • Wake-phase (M-step): Maximize ℒ !, % w.r.t % : max ) Other tricks: reparameterization in VAE (‘2014), control variates in NVIL (‘2014) 17 17

  18. Quick summary of MLE ● Supervised: ◯ Duality with MaxEnt ◯ Solve with SGD, IPF … ● Unsupervised: Lower bounded by negative free energy ◯ ◯ Solve with EM, VEM, Wake-Sleep, … ● Close connections to MaxEnt ● With MaxEnt, algorithms (e.g., EM) arises naturally 18 18

  19. Posterior Regularization (PR) ● Make use of constraints in Bayesian learning An auxiliary posterior distribution ! " ◯ Slack variable # , constant weight $ = & > 0 ◯ 1 log 7 8 9, : min ,, . − $0 ! − &1 , + # ;. =. −1 , > 8 9 , : ≤ # [Ganchev et al., 2010] E.g., max-margin constraint for linear regression [Jaakkola et al., 1999] and ◯ general models (e.g., LDA, NNs) [Zhu et al., 2014] –– more later ● Solution for ! & log 7 8 (9, :) + > 9 , : D ! @ = exp / F D $ 19 19

  20. More general learning leveraging PR ● No need to limit to Bayesian learning ● E.g., Complex rule constraints on general models [Hu et al., 2016], where ! can be over arbitrary variables, e.g., !(#, %) ◯ ' ( #, % is NNs of arbitrary architectures with parameters ) ◯ 1 E.g., 3(#, %) is a 1st-order logical rule: log ' ( #, % ., (,7 − 89 ! − :- . min + 1 If sentence # contains word ``but’’ 1 ⇒ its sentiment % is the same as the 1 − 3(#, %) *. ,. - . #,% ≤ 1 sentiment after “but” 20 20

  21. EM for the general PR ● Rewrite without slack variable: 1 1 min ;, 1 − 6> ! − ,: ; − : ; ",$ 5 " , $ log 0 1 ", $ Solve with EM ◯ , log 0 1 (", $) + 5 " , $ ) ! ", $ = exp / + E-step: § ) 6 1 min : ; log 0 1 ", $ M-step: § 1 21 21

  22. Reformulating unsupervised MLE with PR + , (# ∗ , ") ≥ 2 ! "|# ∗ + 5 6("|# ∗ ) log + , (# ∗ , ") log * " ● Introduce arbitrary ! " # 1 6, ,, : − <2 ! − =5 6 min + ? log + , #, " Data as constraint. Data as constr aint. 1 Given # ∼ % , this constraint doesn’t @. B. −5 6 < ? D # ; % influence the solution of ! and & D # ; % ∶= log 5 H ∗ ∼% I # ∗ # ◯ A constraint saying # must equal to one of the true data points § Or alternatively, the (log) expected similarity of # to dataset % , with § I ⋅ as the similarity measure (we’ll come back to this later) < = = = 1 ◯ 22 22

  23. The standard equation min % ℒ ' 1 Optimization Loss Model 0 7, 8 , : & 7, 8 $, &, '() *+ min − .ℍ 0 + 2 solver architecture 1 ; 7 , 8 3. 5. −6 $ 7,8 < 2 Equivalently: 1 1 0 7, 8 , : & 7, 8 ; 7 , 8 min $, & − 6 $ 7,8 + *+ − .ℍ 0 3 terms: Experiences Divergence Uncertainty (exogenous regularizations) (fitness) (self-regularization) e.g., data examples, rules e.g., Cross Entropy e.g., Shannon entropy Uncertainty Textbook Student Teacher ; 7 , 8| . : & 7, 8 0 7, 8 23 23

  24. Re-visit unsupervised MLE under SE 1 1 ! $, < min 3, 5 − 78 9 − :, 3 − , 3 log = 5 $, < 7 = : = 1 ! ≔ !( $ ; &) = log , $ ∗ ∼& / $ ∗ $ 9 = 9(<|$) 24 24

  25. Re-visit supervised MLE under SE 1 10 ., / log 4 & ., / min $, & − () * − +, $ − , $ .,/ ( = 1, + = 6 0: = 0 . , / ; 9 = log , (. ∗ , / ∗ )∼9 > (. ∗ ,/ ∗ ) ., / 25 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend