A Blueprint of Standardized and Composable ML Eric Xing and - - PowerPoint PPT Presentation

a blueprint of standardized and composable ml
SMART_READER_LITE
LIVE PREVIEW

A Blueprint of Standardized and Composable ML Eric Xing and - - PowerPoint PPT Presentation

A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1 The universe of problems ML/AI is trying to solve 2 2 Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1


slide-1
SLIDE 1

1

Eric Xing and Zhiting Hu

Petuum & Carnegie Mellon

A Blueprint of Standardized and Composable ML

slide-2
SLIDE 2

2

The universe of problems ML/AI is trying to solve

2

slide-3
SLIDE 3

3

Data and experiences of all kinds

3

Data examples Rewards Auxiliary agents Constraints

Type-2 diabetes is 90% more common than type-1

Adversaries And all combinations of

  • f that …

slide-4
SLIDE 4

4

How human beings solve them ALL?

4

slide-5
SLIDE 5

5

The Zoo of ML/AI Models

  • Neural networks

◯ Convolutional networks ◯ AlexNet, GoogleNet, ResNet ◯ Recurrent networks, LSTM ◯ Transformers ◯ BERT, GPT2

  • Graphical models

◯ Bayesian networks ◯ Markov Random fields ◯ Topic models, LDA ◯ HMM, CRF

5

  • Kernel machines

Radial Basis Function Networks

Gaussian processes

Deep kernel learning

Maximum margin

SVMs

  • Decision trees
  • PCA, Probabilistic PCA, Kernel

PCA, ICA

  • Boosting
slide-6
SLIDE 6

6

The Zoo of algorithms and heuristics

6

actor actor-cr critic itic im imitation lear itation learning ning softm softmax ax policy g

  • licy grad

adient ient policy op

  • licy optim

timization ization poster

  • sterior

ior r reg egular ularization ization constr constraint- aint-driven lear iven learning ning reg egular ularized ized B Bayes ayes GA GANs Ns active lear active learning ning intr intrinsic r insic rewar eward inver inverse R se RL knowled knowledge d e distillation istillation ener energy-based ased GA GANs Ns maxim aximum um likelihood likelihood estim estimation ation pred ediction m iction minim inimization ization gener eneralized alized exp expectation ectation lear learning ning fr from

  • m m

measur easurem ements ents ad adver versar sarial d ial dom

  • main ad

ain adap aptation tation reinfor einforcem cement lear ent learning ning as infer as inference ence da data augme gmentation da data re re-weig weighting hting lab label sm el smoothing

  • othing

weak/d weak/distant sup istant super ervision vision re reward rd-aug augmented ented m maxim aximum um likelihood likelihood

slide-7
SLIDE 7

7

Really hard to navigate,

  • Depending on individual

expertise and creativity

  • Bespoke, delicate pieces of art
  • Like an airport with different

runways for every different types of aircrafts

7

and to realize

slide-8
SLIDE 8

8

Physics in the 1800’s

  • Electricity & magnetism:

Coulomb’s law, Ampère, Faraday, ...

  • Theory of light beams:

Particle theory: Isaac Newton, Laplace, Plank

Wave theory: Grimaldi, Chris Huygens, Thomas Young, Maxwell

  • Law of gravity

Aristotle, Galileo, Newton, …

8

slide-9
SLIDE 9

9

Maxwell’s equations

9

Diverse electro- magnetic theories

∂vF uV = 4π c ju

εuvkλ∂vFkλ = 0

Maxwell’s Eqns:

  • riginal form

Simplified w/ rotational symmetry Further simplified w/ symmetry of special relativity

slide-10
SLIDE 10

10

How about a blueprint of ML

  • Loss
  • Optimization solver
  • Model architecture

10

Optimization solver Loss Model architecture

min$ ℒ &

slide-11
SLIDE 11

11

How about a blueprint of ML

  • Loss

ss

  • Optimization solver
  • Model architecture

11

!"#

$, &

' − ) − ℍ Experience Divergence Uncertainty

slide-12
SLIDE 12

12

MLE at a close look:

  • The most classical learning algorithm
  • Supervised:

◯ Observe data ! = {(%∗, (∗)} ◯ Solve with SGD

  • Unsupervised:

◯ Observe ! =

%∗ , ( is latent variable

◯ Posterior +,((|%) ◯ Solve with EM: §

E-step imputes latent variable ( through expectation on complete likelihood

§

M-step: supervised MLE

12

min

,

− 2 %∗,(∗ ∼ ! 1 log +,((∗|%∗) min

,

− 2%∗∼ ! 1 log 8

(

+,(%∗, ()

slide-13
SLIDE 13

13

MLE as Entropy Maximization

  • Duality between Supervised MLE and maximum entropy, when ! is

exponential family

13

min

%(',)) + !

s.t. /% 0(', )) = /(2∗,4∗)∼6 0(', ))

features 0(', ))

data as constraints

Shannon entropy +

Solve w/ Lagrangian method

! ', ) = exp ; ⋅ 0 ' / >(;)

Lagrangian multiplier ;

min

?

−/('∗,)∗)∼6 ; ⋅ 0(', )) + log >(;)

Negative log-likelihood

slide-14
SLIDE 14

14

MLE as Entropy Maximization

  • Unsupervised MLE can be achieved by maximizing the negative free energy:

Introduce auxiliary distribution !(#|%) (and then play with its entropy and cross entropy, etc.)

14

log *

#

+,(%∗, #) = 01(#|%∗) log +, %∗, # ! # %∗ + KL ! # %∗ || +, # %∗ ≥ 6 ! #|%∗ + 01(#|%∗) log +,(%∗, #)

slide-15
SLIDE 15

15

Algorithms for Unsupervised MLE

1) Solve with EM EM

q

E-step: Maximize ℒ ", $ w.r.t ", equivalent to minimizing KL by setting

q

M-step: Maximize ℒ ", $ w.r.t $: max

*

+,(.|0∗) log 6* 0∗, .

15

min

*

− +0∗∼ ; 1 log =

.

6*(0∗, .) " . 0∗ = 6*?@A(.|0∗) log =

.

6*(0∗, .) = +,(.|0∗) log 6* 0∗, . " . 0∗ + KL " . 0∗ || 6* . 0∗ ≥ D " .|0∗ + +,(.|0∗) log 6*(0∗, .)

slide-16
SLIDE 16

16

Algorithms for Unsupervised MLE (cont’d)

2) When model !" is complex, directly working with the true posterior !"($|&∗) is intractable ⇒ Vari riational EM

§

Consider a sufficiently restricted family * of +($|&) so that minimizing the KL is tractable

q

E.g., parametric distributions, factorized distributions

§

E-step: Maximize ℒ +, " w.r.t + ∈ *, equivalent to minimizing KL

§

M-step: Maximize ℒ +, " w.r.t " : max

4

56($|&∗) log !4 &∗, $

16

log :

$

!4(&∗, $) = 56($|&∗) log !4 &∗, $ + $ &∗ + KL + $ &∗ || !4 $ &∗ ≥ > + $|&∗ + 56($|&∗) log !4(&∗, $)

slide-17
SLIDE 17

17

Algorithms for Unsupervised MLE (cont’d)

3) When ! is complex, e.g., deep NNs, optimizing ! in E-step is difficult (e.g., high variance) ⇒ Wake ke-S

  • Sleep algori

rithm m [Hinton et al., 1995]

17

  • Sleep-phase (E-step):
  • Wake-phase (M-step): Maximize ℒ !, % w.r.t % : max

)

*+(-|/∗) log 5) /∗, - min

8

KL(5) - /∗ ||!8 - /∗ )

Reverse KL

Other tricks: reparameterization in VAE (‘2014), control variates in NVIL (‘2014) log ;

  • 5)(/∗, -) = *+(-|/∗) log 5) /∗, -

! - /∗ + KL ! - /∗ || 5) - /∗ ≥ ? ! -|/∗ + *+(-|/∗) log 5)(/∗, -)

slide-18
SLIDE 18

18

Quick summary of MLE

  • Supervised:

◯ Duality with MaxEnt ◯ Solve with SGD, IPF …

  • Unsupervised:

Lower bounded by negative free energy

◯ Solve with EM, VEM, Wake-Sleep, …

  • Close connections to MaxEnt
  • With MaxEnt, algorithms (e.g., EM) arises naturally

18

slide-19
SLIDE 19

19

Posterior Regularization (PR)

  • Make use of constraints in Bayesian learning

An auxiliary posterior distribution ! "

Slack variable #, constant weight $ = & > 0

E.g., max-margin constraint for linear regression [Jaakkola et al., 1999] and general models (e.g., LDA, NNs) [Zhu et al., 2014] –– more later

  • Solution for !

19

min

,, . − $0 ! − &1,

1 + # log 78 9, :

[Ganchev et al., 2010]

;. =. −1, >

8 9 , :

≤ # ! @ = exp

D D

/ F & log 78(9, :) + > 9 , : $

slide-20
SLIDE 20

20

More general learning leveraging PR

  • No need to limit to Bayesian learning
  • E.g., Complex rule constraints on general models [Hu et al., 2016], where

! can be over arbitrary variables, e.g., !(#, %)

'( #, % is NNs of arbitrary architectures with parameters )

20

*. ,. -. #,% 1 ≤ 1 1 − 3(#, %) min

., (,7 − 89 ! − :-.

1 + 1 log '( #, % E.g., 3(#, %) is a 1st-order logical rule: If sentence # contains word ``but’’

⇒ its sentiment % is the same as the

sentiment after “but”

slide-21
SLIDE 21

21

EM for the general PR

  • Rewrite without slack variable:

Solve with EM

§

E-step:

§

M-step:

21

! ", $ = exp

) )

/ + , log 01(", $) + 5 " , $ 6 min

1

:; 1 min

;, 1 − 6> ! − ,:;

1 − :; ",$ 1 5 " , $ log 01 ", $ log 01 ", $

slide-22
SLIDE 22

22

Reformulating unsupervised MLE with PR

  • Introduce arbitrary ! " #

22

Data as constr Data as constraint. aint. Given # ∼ %, this constraint doesn’t influence the solution of ! and &

log *

"

+,(#∗, ") ≥ 2 ! "|#∗ + 56("|#∗) log +,(#∗, ") min

6, ,, : − <2 ! − =56

1 + ? @. B. −56 1 < ? D # ; % log +, #, "

D # ; % ∶= log 5H∗∼% I#∗ #

§

A constraint saying # must equal to one of the true data points

§

Or alternatively, the (log) expected similarity of # to dataset %, with I ⋅ as the similarity measure (we’ll come back to this later)

< = = = 1

slide-23
SLIDE 23

23

The standard equation

Equivalently:

23

min

$, &, '() *+

1 − .ℍ 0 + 2

  • 3. 5. −6$ 7,8

1 < 2 0 7, 8 , :& 7, 8 ; 7 , 8 min

$, & − 6$ 7,8

1 + *+ 1 − .ℍ 0 ; 7 , 8 0 7, 8 , :& 7, 8 3 terms: Experiences (exogenous regularizations) e.g., data examples, rules

Textbook ; 7 , 8| .

Divergence (fitness) e.g., Cross Entropy

Teacher 0 7, 8 Student :& 7, 8

Optimization solver Loss Model architecture

min% ℒ '

Uncertainty (self-regularization) e.g., Shannon entropy

Uncertainty

slide-24
SLIDE 24

24

! ≔ !($ ; &) = log ,$∗∼& /$∗ $

min

3, 5 − 78 9 − :,3

1 − ,3 1 ! $, < log =5 $, <

Re-visit unsupervised MLE under SE

24

7 = : = 1

9 = 9(<|$)

slide-25
SLIDE 25

25

min

$, & − () * − +,$

1 − ,$ .,/ 10 ., / log 4& ., /

Re-visit supervised MLE under SE

25

( = 1, + = 6

0: = 0 . , / ; 9 = log , (.∗, /∗)∼9 >(.∗,/∗) ., /

slide-26
SLIDE 26

26

Active learning under SE

26

! ≔ !($, & ; ()*+,-) + 0($)

1 = 3 (> 0), 6 = 7

!($, & ; ()*+,-) = log ; $∗∼>, &∗∼?@ABCD($∗) E($∗,&∗) $, &

prediction uncertainty on $, e.g., Shannon entropy F(GH & $ )

Equivalent to:

  • Draw a data point $∗ according to exp{0($)/3}
  • Get label &∗ for $∗ from the oracle
  • Maximize log likelihood on ($∗, &∗)

min

R, H − 1F T − 6;R

1 − ;R $,& 1! $, & log GH $, &

slide-27
SLIDE 27

27

Reinforcement learning (RL) under SE -- I

  • RL-as-inference

[Dayan’97; Levine’18, …]

27

  • Map to RL

language

! = # = $ (> 0)

) *, , ∶= ./0 *, , min

4, / − !6 7 − #84

1 − 84 *,, 1) *, , log =/ *, ,

  • * − >?@?A >, , − @B?CDE @
  • =F * − state distribution
  • ./0 *, , − expected future reward of taking action , in state *

and continuing the current policy =/0 ./0 *, , = 8GH0 ∑JKL

M NO | *L = *, ,L = ,

slide-28
SLIDE 28

28

Reinforcement learning (RL) under SE -- II

  • Policy gradient

E-step

M-step

28

  • ! − #$%$& #, ' − %($)*+ %
  • ,- ! − state distribution
  • ./0 !, ' − expected future reward of taking action ' in state !

and continuing the current policy ,/0

2 = 4 = 1

6 !, ' ∶= log ./0 !, '

  • Map to RL

language

;<(!,') ∇/log ,/ '|! = 1/B ⋅ ;DE(!)DF('|!) ./(!, ') ∇/log ,/ '|! = 1/B ⋅ ∇/;DE(!)DF('|!) ./(!, ') G !, ' = ,- ! ,/0 ' ! ./0(!, ') / B

(Importance sampling est.) (Log-derivative trick) Conventional policy gradient objective

min

<, / − 2K G − 4;<

1 − ;< !,' 16 !, ' log ,/ !, '

./0 !, ' = ;DF0 ∑MNO

P QR | !O = !, 'O = '

slide-29
SLIDE 29

29

  • Same as supervised MLE: ! ≔ !($ ; &), ) = 1, , = -
  • M-step is to min

1

2

3

  • Solve with probability functional descent (PFD) [Chu et al., 2019]

41 $ can be optimized by minimizing 567 Ψ $ , where Ψ $ is the influence function for 2 at 419

Ψ is obtained with convex duality

So the whole optimization is

Adversarial learning under SE

29

4: $ , 41 $ Ψ $ = argmax? 567 @ $ − 2∗(@)

  • For notation simplicity, we use $ to replace ($, C)

min

D, 1 − )ℍ F + ,2

1 − 5D $ 1 ! $ F $ , 41 $ min1 max? 567 @ $ − 2∗(@)

Convex conjugate of 2

slide-30
SLIDE 30

30

Adversarial learning under SE

  • Same as supervised MLE: ! ≔ !($ ; &), ) = 1, , = -
  • M-step is to min

1

2

3

  • Solve with probability functional descent (PFD) [Chu et al., 2019]

41 $ can be optimized by minimizing 567 Ψ $ , where Ψ $ is the influence function for 2 at 419

Ψ is obtained with convex duality

So the whole optimization is

30

  • For notation simplicity, we use $ to replace ($, :)

4; $ , 41 $ min

<, 1 − )ℍ ? + ,2

1 − 5< $ 1 ! $ ? $ , 41 $ Ψ $ = argmaxE 567 F $ − 2∗(F) min1 maxE 567 F $ − 2∗(F)

Parameterize F with an NN HI. E.g., when 2 is JSD and Plugging into the equation recovers vanilla GAN training

FJ($) ≔ 0.5 log 1 − HI − 0.5 log2

slide-31
SLIDE 31

31

Adversarial learning under SE – alternative interpretation

  • Recall in MLE, ! is a fixed function
  • Intuitively, see ! as a similarity metric that measures similarity of sample

" against real data #

  • Instead of the above manually fixed metric, can we learn

rn a metric !

$?

31

! ≔ !(" ; #) = log -"∗∼#

0"∗ "

min

4, 6 − 8ℍ : + <=

1 − -4 " 1 ! " : " , ?6 "

slide-32
SLIDE 32

32

Adversarial learning under SE – alternative interpretation

  • Augment the standard objective to account for !:
  • Set " = 0, & = 1. Under mild conditions, the objective recovers:

◯ Vanilla GAN [Goodfellow et al., 2014], when ( is JS-divergence and )

* is a binary

classifier

◯ )-GAN [Nowozin et al., 2016], when ( is )-divergence ◯ W-GAN [Arjovsky et al., 2017], when ( is Wasserstein distance and )

* is a 1-

Lipschitz function

32

* Proofs adapted from Farnia & Tse 2018: “A Convex Duality Framework for GANs”

min

.

max

*

min

1

− "ℍ 4 + &( 1 − 61 7 1 + 689(7) 1 )

* 7

4 7 , <. 7 )

* 7

slide-33
SLIDE 33

33

More algorithms recovered by SE

  • Data augmentation / re-weighting / RAML
  • Unified EM (UEM) / Constraint-driven learning (CoDL)
  • Curiosity-driven RL
  • Knowledge distillation

33

!"

RL as inference MLE

augment RAML(’16) re-weighting active learning curiosity-driven RL(’91)

posterior regularization CoDL(’07) UEM(’12) GANs knowledge distillation

slide-34
SLIDE 34

34

A table of ALL models/paradigms

34

!

Paradigms not (yet) covered by SE:

Meta learning

Lifelong learning

Interesting future work to study the connections

slide-35
SLIDE 35

35

Learning with ALL experiences

35

Distinct experiences are used in learning in the same way

! ", $, %

" =

+ +

Focus on what to use, instead of worrying about how to use Plug arbitrary available experiences into the learning procedure!

() ⋅ " + | . (. ⋅ " + | (/ ⋅ " + |

+

(0 ⋅ " + |

+

slide-36
SLIDE 36

36

min

$, & − (ℍ * + ,-

1 − /$ 0 11(0) * 0 , 4& 0

The zoo of optimization solvers

  • Like the Standard Equation as a ma

mast ster r loss ss for many paradigms, is there a ma mast ster r so solve ver r for optimization of loss?

  • No (yet) such a general algorithm
  • Alternating GD:

Most widely used

EM, Variational EM (Variational inference), Wake-Sleep, …

36

Optimization of the loss, subject to * ∈ 6789:. Convex to * when (, , > 0 and - is convex

Optimization solver Loss Model architecture

min% ℒ '

slide-37
SLIDE 37

37

37

Generalization of the classic Variational EM

Generalized E-step

Support all types of experiences (Teacher)

M-step

(Student)

! " = exp 1 1 / ) * log ./(") + 3 " ; . 6 (1) reference in closed form: (2) matching the model to the reference:

The extended EM as a primal solver

min

:, / − 6ℍ ! + *>

1 − ?: " 1 3(" ; . ) ! " , ./ "

min

/ ?: "

1 log ./(")

when 6, * > 0 and > = CE

  • Limitations: e.g., not applicable when > is other divergence measures
  • The EM as a template has been further enhanced/adapted in different ways in various paradigms
  • in RL: TRPO, PPO, MaxEnt inverse RL, …
  • in GANs: many extensions to stabilize training
slide-38
SLIDE 38

38

Some “advanced” (specialized) techniques

  • Alternating GD:

EM, Variational EM (Variational inference), Wake-Sleep, …

SGD, Back-propagation (BP)

  • Convex duality, Lagrangian -- Kernel Tricks
  • Integer linear programming (ILP)
  • Probability functional descent (PFD) [Chu et al., 2019] -- Influence function,

gives a neat formulation of GAN-like optimization and a few others

38

Optimization of the loss, subject to ! ∈ #$%&'. Convex to ! when (, * > 0 and - is convex min

1, 2 − (ℍ ! + *-

1 − 71 8 19(8) ! 8 , <2 8

Optimization solver Loss Model architecture

min% ℒ '

slide-39
SLIDE 39

39

I: Duality

  • Structured MaxEnt Discrimination (SMED) [Zhu and Xing, 2013]:

Solve the (primal) Lagrangian:

Solve Lagrangian multipliers ! from the du dual l pr proble blem (wh

when )

min

%, '() − +, - − ./%

1 + 2(4)

  • 6. 8. −/% Δ:; <; > − Δℓ; <

≤ A; ∀C log G >

max

!(), ∑KLMN O ;,<P<L

∗R; < Δℓ;(<) − 1

2 O

;,<P<L

∗R; < ΔT

; < U

G > = W > 0, Y ; 2 A = ∑ A;,

Allows kernel trick for nonlinear interactions b/w experiences

  • > = exp 1

1 / ](!) . log G(>) + ∑;,<P<L

∗ R;(<)(Δ:

; <; > − Δℓ; < )

+

slide-40
SLIDE 40

40

II: Influence Function and Probability Functional Descent

  • Gradient descent in the space of probability measures ! "
  • Influence function Ψ$ % :
  • With a linear approximation &

ℐ ( to ℐ(() around (+:

  • Thus, once we obtain the influence function, we can optimize ( by

decreasing ,-∼$ Ψ$/ %

40

min

$∈!(4) ℐ(()

ℐ: ! " → ℝ : a probability functional 8ℐ$ 9 = ;

4

Ψ$ % 9 8% = ,< Ψ$ % − ,$ Ψ$ % Gateaux differential of ℐ at ( in the direction 9 = > − ( & ℐ ( = ℐ (+ + 8ℐ$@ ( − (+ = ,-∼$ Ψ$/ % + ABCDE.

[Chu et al., 2019]

slide-41
SLIDE 41

41

Adversarial learning using PFD

◯ Often no closed-form influence function, e.g., when ! is JSD or W-

distance

◯ Approximate with convex duality:

§

Convex conjugate ℐ∗ $ = sup

)

+ $ , - ./ − ℐ -

§

Influence function is obtained via

§

Parameterize $ as below to recover optimization of generator and discriminator

◯ The whole optimization of ℐ(2) is thus

41

ℐ 24 = ! 1 26 , , 24 ,

Ψ9: / = argmax@ A,∼9: $ , − ℐ∗($)

min4 max@A9EFGF log JK − A9: log (1 − JK)

[Chu et al., 2019]

$L(,) ≔ 0.5 log 1 − JK − 0.5 log2 ΨRS = argmaxK A9EFGF log JK − A9: log (1 − JK)

slide-42
SLIDE 42

42

RL using PFD

  • E.g., Policy iteration in RL

◯ (Conventional) loss: ◯ Influence function ◯ Thus, optimize !" by minimizing

42

ℐ !" = −&'((*)&',(-|*)[ 0(*, -) ] Ψ',4 - = −&'((*)[ 0 *, - ] &', Ψ',4 - = −&'((*)&',(-|*) 0 *, -

!5 * − state distribution; !"(-|*) − policy

[Chu et al., 2019]

slide-43
SLIDE 43

43

Model architecture

  • Relatively well explored:

◯ Neural network design ◯ Graphical model design ◯ Compositional architectures

43

Optimization solver Loss Model architecture

min% ℒ '

min

$, & − (ℍ * + ,-

1 − /$ 0 11(0) * 0 , 4& 0

slide-44
SLIDE 44

44

Model architecture

  • Relatively well explored:

◯ Neural network design ◯ Graphical model design ◯ Compositional architectures

44

AlexNet 8 layers VGG 19 layers GoogleNet 22 layers ResNet 152 layers

  • Activation functions

Linear and ReLU

Sigmoid and tanh

Etc.

  • Layers

Fully connected

Convolutional & pooling

Recurrent

ResNets

Etc.

slide-45
SLIDE 45

45

Cell: LSTM, GRU … Recur-Attention: Bah, Luo …

RNN

Layers: Conv, Dense …

FFNetwork

Classifier

Encoder

Encoder-Decoder

Decoder Transformer

Multi-head Attention

Embeder

WordEmbeder PositionEmbeder

Model architecture

  • Relatively well explored:

◯ Neural network design ◯ Graphical model design ◯ Compositional architectures

45

Neural network components

slide-46
SLIDE 46

46

Model architecture

  • Relatively well explored:

◯ Neural network design ◯ Graphical model design ◯ Compositional architectures

46

[Courtesy: Sutton & McCallum, 2010]

slide-47
SLIDE 47

47

Model architecture

  • Relatively well explored:

◯ Neural network design ◯ Graphical model design ◯ Compositional architectures

47

E D A ! " E D Prior ! " # E D M ! " C E D ! " # E D ! " C

0/1

E1 E2 E3 D1 D2 D3 "1 "2 "3 !1 !2 !3 ! E D1 D2 C "1 "2

0/1

(a) (b) (d) (c) (e) (f) (g)

E refers to encoder, D to decoder, C to Classifier, A to attention, Prior to prior distribution, and M to memory

slide-48
SLIDE 48

48

Summary: a blueprint of ML

  • Loss

◯ Standard equation

  • Algorithm

◯ The extended EM algorithm gives a general primal solution in many

cases

◯ PFD gives a neat formulation for some cases (e.g., GANs)

  • Model architecture: vast library of building blocks à compositionality

48

min

$, & − ($ ),*

1 + -. 1 − /ℍ 1 2 ) , * 1 ), * , 3& ), *

Next: practical implications of the ML blueprint

Optimization solver Loss Model architecture

min% ℒ '

slide-49
SLIDE 49

49

Why this is useful?

  • Learning with ALL experiences
  • Complex interaction between experiences
  • Multi-agent game theoretic learning using all experiences
slide-50
SLIDE 50

50

Learning with ALL experiences: Empowering algorithms

  • Unifying perspective of diverse paradigms (each tailored for a specific type of

experience) under SE

  • Combining or integrating different experiences
  • Re

Re-use se o

  • r r

repu purpo pose se o

  • rigin

iginally spe lly specia ialize lized a d algo lgorith ithms s

Systematic idea transfer and solution exchange

Solving challenges in one paradigm by applying well-known solutions from another

Accelerate innovations across research areas

50

slide-51
SLIDE 51

51

  • Rules in PR ⇔ Reward in RL
  • Empower re

reward rd learn rning algo. to le learnin ing r g rule les s [Hu et al., 2018]

51

Learning with ALL experiences: Empowering algorithms – Ex.1

"

slide-52
SLIDE 52

52

  • Data in supervised MLE ⇔ Reward in RL
  • Empower re

reward rd learn rning algo. to le learnin ing g da data augm gmentatio ion [Hu et al., 2019]

52

Learning with ALL experiences: Empowering algorithms – Ex.2

"

slide-53
SLIDE 53

53

  • GANs ⇔ RL ⇔ VI
  • Empower RL/V

RL/VI I algo. (e.g., PPO) to sta stabilize bilize GA GAN tr N train inin ing g [Wu et al., 2020]

53

Learning with ALL experiences: Empowering algorithms – Ex.3

"

slide-54
SLIDE 54

54

  • GANs ⇔ RL ⇔ VI
  • Empower RL/V

RL/VI I algo. (e.g., PPO) to sta stabilize bilize GA GAN tr N train inin ing g [Wu et al., 2020]

54

Learning with ALL experiences: Empowering algorithms – Ex.3

(a (a) ) Re-use use PPO ob

  • bject

ective ve for

  • r GAN trai

aini ning ng: discourage excessively large updates by “trapping” the update size around 1 (b (b) ) Re-use use impor

  • rtance

ance wei eight hting ng in n a a VI per ersp spect ective: ve: greatly reduced variance in both generator and discriminator losses Improved performance on a range of problems, including image generation, text generation, and text style transfer

slide-55
SLIDE 55

55

  • Distinct experiences are all modeled with ! " , $
  • Combine and plug different ! functions into SE to drive learning
  • Enable applications for controllable content generation

55

min

(, ) − +( ",$

1 + ./ 1 − 0ℍ 2 ! " , $ 2 ", $ , 3) ", $ 45 ⋅ !

7898 + 4: ⋅ ! ;<=>? + 4@ ⋅ ! ;>A8;7 + ⋯

=

Controlling sentiment

The film is full of imagination! The film is strictly routine! Pos Neg [Hu et al., 2017; Yang et al., 2018]

Controllable text generation ! = sentiment classifier + linguistic rules + language model

Learning with ALL experiences: Experience compositionality – Ex.1

slide-56
SLIDE 56

56

  • Distinct experiences are all modeled with ! " , $
  • Combine and plug different ! functions into SE to drive learning
  • Enable applications for controllable content generation

56

Source Generated images under different poses

[Hu et al., 2018]

Learning with ALL experiences: Experience compositionality – Ex.2

min

(, ) − +( ",$

1 + ./ 1 − 0ℍ 2 ! " , $ 2 ", $ , 3) ", $ 45 ⋅ !

7898 + 4: ⋅ ! ;<=>? + 4@ ⋅ ! ;>A8;7 + ⋯

=

Fashion image generation ! = (small) data + human gesture constraints

slide-57
SLIDE 57

57

Learning with ALL experiences: Experience compositionality – Ex.2

source target pose true target Base model + Learned knowledge (Ours) + Fixed knowledge

slide-58
SLIDE 58

58

Operational compositionality

  • Build ML applications like

composing music

58

E D A ! " E D Prior ! " # E D M ! " C E D ! " # E D ! " C

0/1

E1 E2 E3 D1 D2 D3 "1 "2 "3 !1 !2 !3 ! E D1 D2 C "1 "2

0/1

(a) (b) (d) (c) (e) (f) (g)

Open-source toolkit for composable ML

  • ptimization Loss

Model architecture

min$ ℒ &

slide-59
SLIDE 59

59

Texar Stack – Operationalized “View” of Composable ML

59

slide-60
SLIDE 60

60

Composable ML with Texar

  • Highly modularized programming

◯ Data, structure, loss, learning, … ◯ Intuitive conceptual-level APIs

  • Easy switch between learning algorithms

◯ Plug in & out modules ◯ No changes to irrelevant parts

60

<BOS> !" !" !$ !$ !% &'()*'+

Cross entropy loss

BLEU Policy Gradient Agent

rewards

!"#$%"&

data example '∗ sample ) ' …

!"#$%"&"'()*% !+$*,+% … <BOS>
real/fake

/ 01 / 02

slide-61
SLIDE 61

61

It is only slightly overstating the case to say that physics is the study of symmetry.

  • - Phil Anderson (1923-2020), Physicist, Nobel laureate

61

Food for thoughts: How far would this take us?

  • Ph

Physic ysics

slide-62
SLIDE 62

62

62

Maxwell’s equations General relativity Standard model Theory of everything 1861 1910s 1970s

Unified way of thinking

✦ Systematic understanding ✦ Automated solution creation ✦ Improved ML accessibility

62

Food for thoughts: How far would this take us?

  • Ph

Physic ysics

  • Ma

Machin ine Learn rnin ing g

slide-63
SLIDE 63

63

Toward unified theoretical analysis

  • How do we characterize learning with different experiences?

E.g., data examples, rules, reward, auxiliary models (discriminators), …

Combinations of above experiences

  • What’s the appropriate statistical tool to characterize learning with logical

rules? Can we guarantee performance improvement when using more experiences? What if experiences are noisy?

  • A possible direction:

Existing theoretical analyses deal with learning with data examples, online learning, reinforcement learning, .. in silos

With the standard equation, can we re-purpose the analyses to other paradigms, e.g., learning with logical rules?

63

slide-64
SLIDE 64

64

References

[1] Jun Zhu, Ning Chen, and Eric P Xing. 2014. Bayesian inference with posterior regularization and applications to infinite latent

  • SVMs. JMLR(2014).

[2] Jun Zhu and Eric P Xing. 2009. Maximum Entropy Discrimination Markov Networks. JMLR(2009). [3] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. 2016.Harnessing deep neural networks with logic

  • rules. In ACL.

[4] Zhiting Hu, Haoran Shi, Bowen Tan, Wentao Wang, Zichao Yang, Tiancheng Zhao, Junxian He, Lianhui Qin, Di Wang, Xuezhe Ma, et al. 2019. Texar: A modularized, versatile, and extensible toolkit for text generation. ACL(2019). [5] Zhiting Hu, Bowen Tan, Russ R Salakhutdinov, Tom M Mitchell, and Eric P Xing. 2019. Learning data manipulation for augmentation and weighting. In NeurIPS. [6] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In ICML. [7] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, and Eric P Xing. 2018. On Unifying Deep Generative Models. In ICLR. [8] Zhiting Hu, Zichao Yang, Russ R Salakhutdinov, Lianhui Qin, Xiaodan Liang, Haoye Dong, and Eric P Xing. 2018. Deep generative models with learnable knowledge constraints. In NeurIPS.

64

slide-65
SLIDE 65

Thanks!