On the Balcony T om Mi Mink nka Mi Microso soft t Res esea - - PowerPoint PPT Presentation

β–Ά
on the balcony
SMART_READER_LITE
LIVE PREVIEW

On the Balcony T om Mi Mink nka Mi Microso soft t Res esea - - PowerPoint PPT Presentation

From aut utomat omatic ic di differ erentia entiatio tion n to mes essa sage ge passi pa ssing ng On the Balcony T om Mi Mink nka Mi Microso soft t Res esea earch ch What I do Algorithms for probabilistic inference


slide-1
SLIDE 1

On the Balcony

From aut utomat

  • matic

ic di differ erentia entiatio tion n to mes essa sage ge pa passi ssing ng T

  • m Mi

Mink nka Mi Microso soft t Res esea earch ch

slide-2
SLIDE 2

What I do

Algorithms for probabilistic inference

  • Expectation Propagation
  • Non-conjugate variational message

passing

  • A* sampling

Probabilistic Programming TrueSkill

slide-3
SLIDE 3

On the Balcony

  • A machine learning language

should (among other things) simplify implementation of machine learning algorithms

Machine Learning Language

slide-4
SLIDE 4

On the Balcony

  • A general-purpose machine

learning language should (among other things) simplify implementation of all machine learning algorithms

Machine Learning Language

slide-5
SLIDE 5

On the Balcony

  • 1. Automatic Differentiation
  • 2. AutoDiff lacks approximation
  • 3. Message passing generalizes

AutoDiff

  • 4. Compiling to message passing

Roadmap

slide-6
SLIDE 6

1. Automatic / algorithmic differentiation

slide-7
SLIDE 7
  • β€œEvaluating derivatives” by Griewank

and Walther (2008) Recommended reading

slide-8
SLIDE 8
  • Programs can specify mathematical

functions more compactly than formulas

  • Program is not a black box: undergoes

analysis and transformation

  • Numbers are assumed to have infinite

precision

Programs are the new formulas

slide-9
SLIDE 9
  • As formulas:
  • 𝑔 = ς𝑗 𝑦𝑗
  • 𝑒𝑔 = σ𝑗 𝑒𝑦𝑗 Ο‚π‘˜β‰ π‘— π‘¦π‘˜

Multiply-all example

slide-10
SLIDE 10

Multiply-all example

Input program c[1] = x[1] for i = 2 to n c[i] = c[i-1]*x[i] f = c[n] Derivative program dc[1] = dx[1] for i = 2 to n dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] df = dc[n]

𝑔 = ΰ·‘

𝑗

𝑦𝑗 𝑒𝑔 = ෍

𝑗

𝑒𝑦𝑗 ΰ·‘

π‘˜β‰ π‘—

π‘¦π‘˜

slide-11
SLIDE 11
  • Execution
  • Replace every operation with a

linear one

  • Accumulation
  • Collect linear coefficients

Phases of AD

slide-12
SLIDE 12

Execution phase

x y

*

z

*

+

dx dy dz

+

y x z y

+ +

dx*y + x*dy + dy*z + y*dz x*y + y*z

1 1 Scale factors

slide-13
SLIDE 13

Accumulation phase

dx dy dz

+

y x z y

+ +

dx*y + x*dy + dy*z + y*dz

1 1

coefficient of dx = 1*y coefficient of dy = 1*x + 1*z coefficient of dz = 1*y Gradient vector = (1*y, 1*x + 1*z, 1*y)

(Forward) (Reverse)

slide-14
SLIDE 14

Linear composition

x y z

+

a b c d

+ +

e*(a*x + b*y) + f*(c*y + d*z)

e f

(e*a)*x + (e*b + f*c)*y + (f*d)*z

e e*a f f*d f*c e*b

slide-15
SLIDE 15

Dynamic programming

x y a b e f e (e+f)*a f (e+f)*b

  • Reverse accumulation

is dynamic programming

  • Backward message is

sum over paths to

  • utput
slide-16
SLIDE 16
  • Tracing approach builds a

graph during execution phase, then accumulates it

  • Source-to-source produces a

gradient program matching structure of original Source-to-source translation

slide-17
SLIDE 17

Multiply-all example

Input program c[1] = x[1] for i = 2 to n c[i] = c[i-1]*x[i] return c[n] Derivative program dc[1] = dx[1] for i = 2 to n dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] return dc[n]

c[i-1] x[i] c[i]

*

dc[i-1] dx[i] dc[i]

+ c[i-1] x[i]

slide-18
SLIDE 18

Multiply-all example

Gradient program dcB[n] = 1 for i = n downto 2 dcB[i-1] = dcB[i]*x[i] dxB[i] = dcB[i]*c[i-1] dxB[1] = dcB[1] return dxB Derivative program dc[1] = dx[1] for i = 2 to n dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] return dc[n]

dc[i-1] dx[i] dc[i]

+ c[i-1] x[i] dxB[i] dcB[i-1] dcB[i]

slide-19
SLIDE 19

General case

c = f(x,y) dc = df1(x,y) * dx + df2(x,y) * dy dxB = dcB * df1(x,y) dyB = dcB * df2(x,y)

dx dy dc

+ df2 df1 dyB dxB dcB

slide-20
SLIDE 20
  • If a variable is read multiple times, we

need to add its backward messages

  • Non-incremental approach:

transform program so that each variable is defined and used at most

  • nce on every execution path

Fan-out

slide-21
SLIDE 21

Fan-out example

Input program a = x * y b = y * z c = a + b Edge program (y1,y2) = dup(y) a = x * y1 b = y2 * z c = a + b Gradient program aB = cB bB = cB y2B = bB * z y1B = aB * x yB = y1B + y2B …

y y1 y2 y dup

slide-22
SLIDE 22

Summary of AutoDiff

AD Message passing Programs not formulas Yes Yes Graph structure / sparsity Yes Yes Source-to-source Yes Yes Only one execution path Yes Not always Single forward-backward sweep Yes Not always Exact Yes Not always

slide-23
SLIDE 23
  • 2. AutoDiff lacks

approximation

slide-24
SLIDE 24
  • Mini-batching
  • User changes input

program to be approximate, then computes exact gradient Approximate gradients for big models

βˆ‡ ෍

𝑗=1 π‘œ

𝑔

𝑗 πœ„ β‰ˆ

βˆ‡ π‘œ 𝑛 ෍

𝑑~(1:π‘œ)

𝑔

𝑑(πœ„) =

π‘œ 𝑛 ෍

𝑑~(1:π‘œ)

βˆ‡π‘”

𝑑(πœ„)

(AutoDiff)

slide-25
SLIDE 25

1. Approximate the marginal log- likelihood with a lower bound 2. Approximate the lower bound by importance sampling 3. Compute exact gradient of approximation

Black-box variational inference

∫ π‘ž 𝑦, 𝐸 𝑒𝑦 β‰₯ βˆ’πΏπ‘€ π‘Ÿ | π‘ž)

slide-26
SLIDE 26
  • AutoDiff can mechanically derive reverse summation

algorithms for tractable models

  • Markov chains, Bayesian networks (Darwiche, 2003)
  • Generative grammars, Parse trees (Eisner, 2016)
  • Posterior expectations are derivatives of marginal

log-likelihood, which can be computed exactly

  • User must provide forward summation algorithm

AutoDiff in Tractable Models

𝑇 β†’ 𝐡𝐡 𝐡 β†’ 𝐡𝐢 𝐡 𝐡 𝐢

slide-27
SLIDE 27
  • Approximation is useful in tractable models
  • Sparse forward-backward (Pal et al, 2006)
  • Beam parsing (Goodman, 1997)
  • Cannot be obtained through AutoDiff of an

approximate model

  • Neither can Viterbi

Approximation in Tractable Models

slide-28
SLIDE 28
  • Expectations
  • Fixed-point iteration
  • Optimization
  • Root finding
  • Should all be natively supported

MLL should facilitate approximations

slide-29
SLIDE 29
  • 3. Message-passing

generalizes autodiff

slide-30
SLIDE 30
  • Approximate reasoning about exponential state

space of a program, along all execution paths

  • Propagates state summaries in both directions
  • Forward can depend on backward and vice

versa

  • Iterate to convergence

Message-passing

slide-31
SLIDE 31
  • What is largest and smallest value

each variable could have?

  • Each operation in program is

interpreted as a constraint between inputs and output

  • Propagates information forward and

backward until convergence

Interval constraint propagation

slide-32
SLIDE 32

Find (𝑦, 𝑧) that satisfies 𝑦2 + 𝑧2 = 1 and 𝑧 = 𝑦2 Circle-parabola example

slide-33
SLIDE 33

Circle-parabola program

Input program y = x^2 yy = y^2 z = y + yy assert(z == 1)

x y yy

z

slide-34
SLIDE 34

Interval propagation program

Edge program y = x^2 (y1,y2) = dup(y) yy = y1^2 z = y2 + yy assert(z == 1)

y1 y2

Input program y = x^2 yy = y^2 z = y + yy assert(z == 1)

y yy x z dup ^2 + ^2

slide-35
SLIDE 35

Interval propagation program

Message program Until convergence: yF = xF^2 y1F = yF ∩ y2B y2F = yF ∩ y1B yyF = y1F^2 y1B = sqrt(y1F, yyB) y2B = zB – yyF yyB = zB – y2F zB = [1,1] Edge program y = x^2 (y1,y2) = dup(y) yy = y1^2 z = y2 + yy assert(z == 1)

y1F y1B y1 y2 y yy x z dup ^2 + ^2

slide-36
SLIDE 36

Running ^2 backwards

y1B = sqrt(y1F, yyB) = project[ y1F ∩ sqrt(yyB) ]

y1F y1B y1 yy ^2

yy = y1^2 yyB = [1, 4] sqrt(yyB) = [-2, -1] βˆͺ [1, 2] y1F = [0, 10] y1F ∩ sqrt(yyB) = [] βˆͺ [1, 2] project[ y1F ∩ sqrt(yyB) ] = [1, 2] y1F ∩ project[ sqrt(yyB) ] = [0, 2]

slide-37
SLIDE 37
  • If all intervals start (βˆ’βˆž, ∞)

then 𝑦 β†’ βˆ’1,1 (overestimate)

  • Apply subdivision
  • Starting at 𝑦 = (0.1,1) gives

𝑦 β†’ (0.786, 0.786)

Results

slide-38
SLIDE 38

Interval propagation program

yF = xF^2 zB = [1,1] Until convergence: (perform updates) yB = y1B ∩ y2B xB = sqrt(xF, yB) Until convergence: yF = xF^2 xB = sqrt(xF, yB) yB = y1B ∩ y2B y1F = yF ∩ y2B y2F = yF ∩ y1B … zB = [1,1]

y1 y2 y yy x z dup ^2 + ^2

slide-39
SLIDE 39

1. Pass messages into the loopy core 2. Iterate 3. Pass messages out of the loopy core Analogous to Stan’s β€œtransformed data” and β€œgenerated quantities”

Typical message-passing program

slide-40
SLIDE 40
  • Message dependencies dictate execution
  • If forward messages do not depend on

backward, becomes non-iterative

  • If forward messages only include single

state, only one execution path is explored

  • AutoDiff has both properties

Simplifications of message-passing

slide-41
SLIDE 41

Other message-passing algorithms

slide-42
SLIDE 42
  • Probabilistic programs are the

new Bayesian networks

  • Using a program to specify a

probabilistic model

  • Program is not a black box:

undergoes analysis and transformation to help inference

Probabilistic Programming

slide-43
SLIDE 43
  • Loopy belief propagation has same structure as

interval propagation, but using distributions

  • Gives forward and backward summations for tractable

models

  • Expectation propagation adds projection steps
  • Approximate expectations for intractable models
  • Parameter estimation in non-conjugate models

Loopy belief propagation

slide-44
SLIDE 44
  • Parameters send current value
  • ut, receive gradients in, take a

step

  • Gradients fall out of EP equations
  • Part of the same iteration loop

Gradient descent

ΞΈ

πœ„ βˆ‡π‘”(πœ„)

𝑔(πœ„)

slide-45
SLIDE 45
  • Variables send current value
  • ut, receive conditional

distributions in

  • Collapsed variables

send/receive distributions as in BP

  • No need to collapse in the model

Gibbs sampling

x y

π‘ž(𝑦)

π‘ž(𝑧 = 𝑧𝑒|𝑦) 𝑧𝑒 𝑦𝑒 π‘ž(𝑧|𝑦 = 𝑦𝑒)

slide-46
SLIDE 46

On the Balcony

Thanks!

Model-based machine learning book: http://mbmlbook.com/ Infer.NET is open source: http://dotnet.github.io/infer