On the Balcony T om Mi Mink nka Mi Microso soft t Res esea - PowerPoint PPT Presentation

From aut utomat omatic ic di differ erentia entiatio tion n to mes essa sage ge passi pa ssing ng On the Balcony T om Mi Mink nka Mi Microso soft t Res esea earch ch

What I do Algorithms for probabilistic inference Probabilistic • Expectation Propagation Programming • Non-conjugate variational message passing • A* sampling TrueSkill

Machine Learning Language • A machine learning language should (among other things) simplify implementation of On the Balcony machine learning algorithms

Machine Learning Language • A general-purpose machine learning language should (among other things) simplify On the Balcony implementation of all machine learning algorithms

Roadmap 1. Automatic Differentiation 2. AutoDiff lacks approximation On the Balcony 3. Message passing generalizes AutoDiff 4. Compiling to message passing

1. Automatic / algorithmic differentiation

Recommended reading “Evaluating derivatives” by Griewank • and Walther (2008)

Programs are the new formulas Programs can specify mathematical • functions more compactly than formulas Program is not a black box: undergoes • analysis and transformation Numbers are assumed to have infinite • precision

Multiply-all example As formulas: • • 𝑔 = ς 𝑗 𝑦 𝑗 • 𝑒𝑔 = σ 𝑗 𝑒𝑦 𝑗 ς 𝑘≠𝑗 𝑦 𝑘

Multiply-all example Input program Derivative program c[1] = x[1] dc[1] = dx[1] for i = 2 to n for i = 2 to n c[i] = c[i-1]*x[i] dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] f = c[n] df = dc[n] 𝑔 = ෑ 𝑦 𝑗 𝑒𝑔 = ෍ 𝑒𝑦 𝑗 ෑ 𝑦 𝑘 𝑗 𝑗 𝑘≠𝑗

Phases of AD Execution • • Replace every operation with a linear one Accumulation • • Collect linear coefficients

Execution phase dx*y + x*dy + dy*z + y*dz x*y + y*z x y z dx dy dz y z y x + + Scale * * factors 1 1 + +

Accumulation phase (Forward) dx*y + x*dy + dy*z + y*dz (Reverse) dx dz dy coefficient of dx = 1*y y z y x + + coefficient of dy = 1*x + 1*z 1 1 coefficient of dz = 1*y + Gradient vector = (1*y, 1*x + 1*z, 1*y)

Linear composition e*(a*x + b*y) + f*(c*y + d*z) x z y e*b f*c (e*a)*x + d c a b e*a f*d + + (e*b + f*c)*y + e f e f (f*d)*z +

Dynamic programming Reverse accumulation • is dynamic x y programming a b (e+f)*b (e+f)*a Backward message is • e f e f sum over paths to output

Source-to-source translation Tracing approach builds a • graph during execution phase, then accumulates it Source-to-source produces a • gradient program matching structure of original

Multiply-all example Input program Derivative program c[1] = x[1] dc[1] = dx[1] for i = 2 to n for i = 2 to n c[i] = c[i-1]*x[i] dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] return c[n] return dc[n] c[i-1] x[i] dc[i-1] dx[i] c[i-1] x[i] * + c[i] dc[i]

Multiply-all example Derivative program Gradient program dc[1] = dx[1] dcB[n] = 1 for i = 2 to n for i = n downto 2 dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] dcB[i-1] = dcB[i]*x[i] return dc[n] dxB[i] = dcB[i]*c[i-1] dc[i-1] dx[i] dxB[1] = dcB[1] return dxB c[i-1] x[i] dcB[i-1] dxB[i] + dc[i] dcB[i]

General case c = f(x,y) dx dy dc = df1(x,y) * dx + df2(x,y) * dy df2 df1 dxB dyB + dc dxB = dcB * df1(x,y) dyB = dcB * df2(x,y) dcB

Fan-out If a variable is read multiple times, we • need to add its backward messages Non-incremental approach: • transform program so that each variable is defined and used at most once on every execution path

Fan-out example Input program Edge program Gradient program a = x * y (y1,y2) = dup(y) aB = cB b = y * z a = x * y1 bB = cB c = a + b b = y2 * z y2B = bB * z c = a + b y1B = aB * x yB = y1B + y2B … y y dup y2 y1

Summary of AutoDiff AD Message passing Programs not formulas Yes Yes Graph structure / sparsity Yes Yes Source-to-source Yes Yes Only one execution path Yes Not always Single forward-backward sweep Yes Not always Exact Yes Not always

2. AutoDiff lacks approximation

Approximate gradients for big models Mini-batching • 𝑜 ∇ ෍ 𝑔 𝑗 𝜄 ≈ User changes input • 𝑗=1 program to be ∇ 𝑜 ෍ 𝑔 𝑡 (𝜄) = 𝑛 approximate, then 𝑡~(1:𝑜) 𝑜 computes exact ෍ ∇𝑔 𝑡 (𝜄) (AutoDiff) 𝑛 𝑡~(1:𝑜) gradient

Black-box variational inference 1. Approximate the marginal log- ∫ 𝑞 𝑦, 𝐸 𝑒𝑦 likelihood with a lower bound ≥ −𝐿𝑀 𝑟 | 𝑞) 2. Approximate the lower bound by importance sampling 3. Compute exact gradient of approximation

AutoDiff in Tractable Models AutoDiff can mechanically derive reverse summation • algorithms for tractable models 𝑇 → 𝐵𝐵 • Markov chains, Bayesian networks (Darwiche, 2003) 𝐵 → 𝐵𝐶 𝐵 • Generative grammars, Parse trees (Eisner, 2016) 𝐵 𝐶 Posterior expectations are derivatives of marginal • log-likelihood, which can be computed exactly • User must provide forward summation algorithm

Approximation in Tractable Models Approximation is useful in tractable models • • Sparse forward-backward (Pal et al, 2006) • Beam parsing (Goodman, 1997) Cannot be obtained through AutoDiff of an • approximate model Neither can Viterbi •

MLL should facilitate approximations Expectations • Fixed-point iteration • • Optimization • Root finding Should all be natively supported •

3. Message-passing generalizes autodiff

Message-passing Approximate reasoning about exponential state • space of a program, along all execution paths Propagates state summaries in both directions • Forward can depend on backward and vice • versa Iterate to convergence •

Interval constraint propagation What is largest and smallest value • each variable could have? Each operation in program is • interpreted as a constraint between inputs and output Propagates information forward and • backward until convergence

Circle-parabola example Find (𝑦, 𝑧) that satisfies 𝑦 2 + 𝑧 2 = 1 and 𝑧 = 𝑦 2

Circle-parabola program Input program x y = x^2 yy y yy = y^2 z = y + yy assert(z == 1) z

Interval propagation program x Input program Edge program ^2 y = x^2 y = x^2 y (y1,y2) = dup(y) y1 dup ^2 yy = y^2 yy = y1^2 z = y + yy z = y2 + yy y2 yy assert(z == 1) assert(z == 1) + z

Interval propagation program Edge program Message program x Until convergence: ^2 y = x^2 yF = xF^2 y1F y y1F = yF ∩ y2B (y1,y2) = dup(y) y1 y2F = yF ∩ y1B dup ^2 yy = y1^2 yyF = y1F^2 y1B y2 yy y1B = sqrt(y1F, yyB) z = y2 + yy y2B = zB – yyF + yyB = zB – y2F z assert(z == 1) zB = [1,1]

Running ^2 backwards yy = y1^2 y1B = sqrt(y1F, yyB) = project[ y1F ∩ sqrt(yyB) ] y1F yyB = [1, 4] y1 ^2 sqrt(yyB) = [-2, -1] ∪ [1, 2] y1F = [0, 10] y1B yy y1F ∩ sqrt(yyB) = [] ∪ [1, 2] project[ y1F ∩ sqrt(yyB) ] = [1, 2] y1F ∩ project[ sqrt(yyB) ] = [0, 2]

Results If all intervals start (−∞, ∞) • then 𝑦 → −1,1 (overestimate) Apply subdivision • Starting at 𝑦 = (0.1,1) gives • 𝑦 → (0.786, 0.786)

Interval propagation program yF = xF^2 Until convergence: x zB = [1,1] yF = xF^2 ^2 xB = sqrt(xF, yB) Until convergence: yB = y1B ∩ y2B y y1 (perform updates) y1F = yF ∩ y2B dup ^2 y2F = yF ∩ y1B yB = y1B ∩ y2B y2 yy … xB = sqrt(xF, yB) zB = [1,1] + z

Typical message-passing program 1. Pass messages into the loopy core 2. Iterate 3. Pass messages out of the loopy core Analogous to Stan’s “transformed data” and “generated quantities”

Simplifications of message-passing Message dependencies dictate execution • If forward messages do not depend on • backward, becomes non-iterative If forward messages only include single • state, only one execution path is explored AutoDiff has both properties •

Other message-passing algorithms

Probabilistic Programming Probabilistic programs are the • new Bayesian networks Using a program to specify a • probabilistic model Program is not a black box: • undergoes analysis and transformation to help inference

Loopy belief propagation Loopy belief propagation has same structure as • interval propagation, but using distributions • Gives forward and backward summations for tractable models Expectation propagation adds projection steps • • Approximate expectations for intractable models • Parameter estimation in non-conjugate models

Gradient descent Parameters send current value • out, receive gradients in, take a step θ 𝜄 ∇𝑔(𝜄) • Gradients fall out of EP equations 𝑔(𝜄) Part of the same iteration loop •

Gibbs sampling Variables send current value • out, receive conditional 𝑞(𝑦) x distributions in 𝑦 𝑢 𝑞(𝑧 = 𝑧 𝑢 |𝑦) Collapsed variables • send/receive distributions as in 𝑧 𝑢 𝑞(𝑧|𝑦 = 𝑦 𝑢 ) BP y • No need to collapse in the model

Thanks! On the Balcony Model-based machine learning book: http://mbmlbook.com/ Infer.NET is open source: http://dotnet.github.io/infer

On the Balcony T om Mi Mink nka Mi Microso soft t Res esea - PowerPoint PPT Presentation

From aut utomat omatic ic di differ erentia entiatio tion n to mes essa sage ge passi pa ssing ng On the Balcony T om Mi Mink nka Mi Microso soft t Res esea earch ch What I do Algorithms for probabilistic inference

0,0 0,0 Precast balcony slab with 4 x 4 wire mesh with 12 x 12 mild steel

GARY SHOEMAKER A R C H I T E C T S P C 203 LAF A YETTE ST 5TH FLR NEW YORK NY 10012 T 2 1 2

The high rise building 20 storey hotel is located right along the beach, private balcony in every

Master Stateroom Master Bathroom VIP Stateroom Formal Dining Dining Detail Sky Lounge Sky

POSS DESIGN PRESENTATION Aspen Square Improvements Scope EXTERIOR: Refresh Exterior Signage

Concrete repairs,

2017 STRATEGIC PLAN THE MISSISSIPPI FLOWS THROUGH US DRAFT CAPITAL PROJECTS DRAFT CAPITAL

HARCs New Exhibit at the Maritime Museum of the Atlantic (MMA) MMA Focus: history of ships,

309 West 85 Street, New York, NY 10024 THE ORIGINAL WINDOW CONFIGURATION SHOWN IN THE PHOTOGRAPH

Narrative as a systems leadership practice Ways of perceiving Ways of

Prelim 1 Review Spring 2019 Exam Info Prelim 1: Tuesday, March 12th BKL 219 Last

Rethinking Flyovers: culture and design Dr Steve Millington Manchester Metropolitan University

Electric circuit components Capacitor stores charge and potential energy, measured in Farads

Rocketship Model Overview 5A ANC Commission Presentation June 26, 2019 At Rocketship Public

TripS : Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment Kwangsung Oh

Cycle 2 2018 Applicant Town Hall Webinar Washington, DC August 8, 2018 at 12:00pm ET Agenda I.

PUTTI NG FAMI LI ES FI RST I N DC Decem ber 5 , 2 0 1 9 Emerging Best Practices Conference

DC Motors the two motors come in all in the kit shapes and sizes You probably have 3-4 on you

Maycroft Apartments: A Low-Income Solar+Storage Resiliency Center in DC July 31, 2019

Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with

Solar+Storage in in Low-Income Communities March 29, 2018 Housekeeping Use the red arrow to

Pillsburys Washington Weekly Briefing: COVID-19 Developments April 22, 2020 Addr Addres

The magic of cross-spectrum measurements from DC to optics E. Rubiola FEMTO-ST Institute, CNRS

NMI Build & Test Laboratory: Continuous Integration Framework for Distributed Computing

On the Balcony T om Mi Mink nka Mi Microso soft t Res esea - PowerPoint PPT Presentation

From aut utomat omatic ic di differ erentia entiatio tion n to mes essa sage ge passi pa ssing ng On the Balcony T om Mi Mink nka Mi Microso soft t Res esea earch ch What I do Algorithms for probabilistic inference

0,0 0,0 Precast balcony slab with 4 x 4 wire mesh with 12 x 12 mild steel

GARY SHOEMAKER A R C H I T E C T S P C 203 LAF A YETTE ST 5TH FLR NEW YORK NY 10012 T 2 1 2

The high rise building 20 storey hotel is located right along the beach, private balcony in every

Master Stateroom Master Bathroom VIP Stateroom Formal Dining Dining Detail Sky Lounge Sky

POSS DESIGN PRESENTATION Aspen Square Improvements Scope EXTERIOR: Refresh Exterior Signage

Concrete repairs,

2017 STRATEGIC PLAN THE MISSISSIPPI FLOWS THROUGH US DRAFT CAPITAL PROJECTS DRAFT CAPITAL

HARCs New Exhibit at the Maritime Museum of the Atlantic (MMA) MMA Focus: history of ships,

309 West 85 Street, New York, NY 10024 THE ORIGINAL WINDOW CONFIGURATION SHOWN IN THE PHOTOGRAPH

Narrative as a systems leadership practice Ways of perceiving Ways of

Prelim 1 Review Spring 2019 Exam Info Prelim 1: Tuesday, March 12th BKL 219 Last

Rethinking Flyovers: culture and design Dr Steve Millington Manchester Metropolitan University

Electric circuit components Capacitor stores charge and potential energy, measured in Farads

Rocketship Model Overview 5A ANC Commission Presentation June 26, 2019 At Rocketship Public

TripS : Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment Kwangsung Oh

Cycle 2 2018 Applicant Town Hall Webinar Washington, DC August 8, 2018 at 12:00pm ET Agenda I.

PUTTI NG FAMI LI ES FI RST I N DC Decem ber 5 , 2 0 1 9 Emerging Best Practices Conference

DC Motors the two motors come in all in the kit shapes and sizes You probably have 3-4 on you

Maycroft Apartments: A Low-Income Solar+Storage Resiliency Center in DC July 31, 2019

Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with

Solar+Storage in in Low-Income Communities March 29, 2018 Housekeeping Use the red arrow to

Pillsburys Washington Weekly Briefing: COVID-19 Developments April 22, 2020 Addr Addres

The magic of cross-spectrum measurements from DC to optics E. Rubiola FEMTO-ST Institute, CNRS

NMI Build &amp; Test Laboratory: Continuous Integration Framework for Distributed Computing

NMI Build & Test Laboratory: Continuous Integration Framework for Distributed Computing