The computations of acting agents and the agents acting in - PowerPoint PPT Presentation

The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5 June 2017 Research Group for Probabilistic Numerics Max Planck Institute for Intelligent Systems Tübingen, Germany Some of the presented work was supported by the Emmy Noether Programme of the DFG

Part I: The computations of acting agents 09:00–09:45 � a minimal introduction to machine learning � the computational tasks of learning agents � some special challenges, some house numbers Part II: The agents acting in computations 10:30–11:15 � computation is inference � new challenges require new answers � a computer science view on numerical computations 1

An Acting Agent autonomous interaction with a data-source from Hennig, Osborne, Girolami, Proc. Roy. Soc. A, 2015 parameters data variables inference by estimation by θ x t D quadrature optimization learning / inference / system id. prediction by analysis action by a t x t + δ t control prediction action environment machine 2

The Very Foundation probabilistic inference p ( x ) p ( D | x ) p ( x | D ) = � p ( x ) p ( D | x ) dx prior explicit representation of assumptions about latent variables likelihood explicit representation of assumptions about generation of data posterior structured uncertainty over prediction evidence marginal likelihood of model � � 1 − 1 2( x − µ ) ⊺ Σ − 1 ( x − µ ) N ( x ; µ , Σ ) = � exp 2 π | Σ | 3

Gaussian Inference the link between probabilistic inference and linear algebra C := ( A − 1 + B − 1 ) − 1 c := C ( A − 1 a + B − 1 b ) � products of Gaussians are Gaussians N ( x ; a , A ) N ( x ; b , B ) = N ( x ; c , C ) N ( a ; b , A + B ) � marginals of Gaussians are Gaussians � �� x � � µ x � � Σ xx �� Σ xy N dy = N ( x ; µ x , Σ xx ) ; , Σ yx Σ yy y µ y � (linear) conditionals of Gaussians are Gaussians � � p ( x | y ) = p ( x , y ) = N x ; µ x + Σ xy Σ − 1 yy ( y − µ y ), Σ xx − Σ xy Σ − 1 yy Σ yx p ( y ) � linear projections of Gaussians are Gaussians p ( z ) = N ( z ; µ , Σ ) ⇒ p ( Az ) = N ( Az , A µ , A Σ A ⊺ ) Bayesian inference becomes linear algebra p ( x ) = N ( x ; µ , Σ ) p ( y | x ) = N ( y ; A ⊺ x + b , Λ ) p ( B ⊺ x + c | y ) = N [ B ⊺ x + c ; B ⊺ µ + c + B ⊺ Σ A ( A ⊺ Σ A + Λ ) − 1 ( y − A ⊺ µ − b ), B ⊺ Σ B − B ⊺ Σ A ( A ⊺ Σ A + Λ ) − 1 A ⊺ Σ B ] 4

A Minimal Machine Learning Setup nonlinear regression problem 20 10 y 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( y | f X ) = N ( y ; f X , σ I ) 5

Gaussian Parametric Regression aka. general linear least-squares 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � f ( x ) = φ ( x ) ⊺ w = w i φ i ( x ) p ( w ) = N ( w ; µ , Σ ) i ⇒ p ( f ) = N ( f , φ ⊺ µ , φ ⊺ Σ φ ) φ i ( x ) = I ( x > a i ) · c i ( x − a i ) (RELU) 6

Gaussian Parametric Regression aka. general linear least-squares 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( y | w , φ X ) = N ( y ; φ ⊺ X w , σ 2 I ) X Σ φ X + σ 2 I ) − 1 ( y − φ ⊺ p ( f x | y , φ X ) = N ( f x ; φ ⊺ x µ + φ ⊺ x Σ φ X ( φ ⊺ X µ ), X Σ φ X + σ 2 I ) − 1 φ ⊺ φ ⊺ x Σ φ x − φ ⊺ x Σ φ X ( φ ⊺ X Σ φ x ) 6

The Choice of Prior Matters Bayesian framework provides flexible yet explicit modelling language 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � � − ( x − c i ) 2 φ i ( x ) = θ exp 2 λ 2 7

popular extension no. 1 requires large-scale linear algebra X Σ φ X + σ 2 I ) − 1 ( y − φ ⊺ p ( f x | y , φ X ) = N ( f x ; φ ⊺ x µ + φ ⊺ x Σ φ X ( φ ⊺ X µ ), X Σ φ X + σ 2 I ) − 1 φ ⊺ φ ⊺ x Σ φ x − φ ⊺ x Σ φ X ( φ ⊺ X Σ φ x ) � set µ = 0 � aim for closed-form expression of kernel φ ⊺ a Σ φ b 8

Features are cheap, so let’s use a lot an example [DJC MacKay, 1998] � For simplicity, let’s fix Σ = σ 2 ( c max − c min ) I F φ ( x i ) ⊺ Σ φ ( x j ) = σ 2 ( c max − c min ) F � φ ℓ ( x i ) φ ℓ ( x j ) thus: F ℓ =1 � � − ( x − c ℓ ) 2 � especially, for φ ℓ ( x ) = exp 2 λ 2 φ ( x i ) ⊺ Σ φ ( x j ) � � � � = σ 2 ( c max − c min ) F − ( x i − c ℓ ) 2 − ( x j − c ℓ ) 2 � exp exp 2 λ 2 2 λ 2 F ℓ =1 � � � � F − ( c ℓ − 1 = σ 2 ( c max − c min ) − ( x i − x j ) 2 2 ( x i + x j )) 2 � exp exp 4 λ 2 λ 2 F ℓ 9

Features are cheap, so let’s use a lot an example [DJC MacKay, 1998] φ ( x i ) ⊺ Σ φ ( x j ) = � � � � − ( x i − x j ) 2 F − ( c ℓ − 1 σ 2 ( c max − c min ) 2 ( x i + x j )) 2 � exp exp 4 λ 2 λ 2 F ℓ F · δ c � now increase F so # of features in δ c approaches ( c max − c min ) φ ( x i ) ⊺ Σ φ ( x j ) � � � � � � c max − ( x i − x j ) 2 − ( c − 1 2 ( x i + x j )) 2 σ 2 exp exp dc 4 λ 2 λ 2 c min � let c min � −∞ , c max � ∞ � � √ − ( x i − x j ) 2 2 πλσ 2 exp k ( x i , x j ) := φ ( x i ) ⊺ Σ φ ( x j ) � 4 λ 2 10

Gaussian Process Regression aka. Kriging, kernel-ridge regression,... 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x � � − ( a − b ) 2 p ( f ) = GP (0, k ) k ( a , b ) = exp 2 λ 2 11

Gaussian Process Regression aka. Kriging, kernel-ridge regression,... 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( f | y ) = GP ( f x ; k xX ( k XX + σ 2 I ) − 1 y , k xx − k xX ( k XX + σ 2 I ) − 1 k Xx ) 11

The prior still matters just one other example out of the space of kernels 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x For φ i ( x ) = I ( x > c i )( x − c i ) , an analogous limit gives 12

The prior still matters just one other example out of the space of kernels 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x p ( f ) = GP (0, k ) with k ( a , b ) = θ 21 / 3 min( a , b ) 3 + | a − b | min( a , b ) 2 . the integrated Wiener process , aka. cubic splines . More on GPs in Paris Perdikaris ’ tutorial. more on nonparametric models in Neil Lawrence ’s and Tamara Broderick ’s talks? 12

The Computational Challenge large-scale linear algebra ( k XX + σ 2 I ) − 1 k aX ( k XX + σ 2 I ) − 1 k Xb log | k XX + σ 2 I | α := y � �� ∈ R N × N , symm. pos. def. 13

The Computational Challenge large-scale linear algebra ( k XX + σ 2 I ) − 1 k aX ( k XX + σ 2 I ) − 1 k Xb log | k XX + σ 2 I | α := y � �� ∈ R N × N , symm. pos. def. Methods in wide use: � exact linear algebra (BLAS), for N � 10 4 (because O ( N 3 ) ) � (rarely:) iterative Krylov solvers (in part. conjugate gradients), for N � 10 5 For large-scale ( O ( NM 2 ) ): � inducing point methods, Nyström, etc.: using iid. structure of data Ω − 1 ∈ R M × M k au Ω − 1 ˜ k ab ≈ ˜ k ub Williams & Seeger, 2001; Quiñonero & Rasmussen, 2005; Snelson & Ghahramani, 2007; Titsias, 2009 � spectral expansions using algebraic properties of kernel Rahimi & Recht 2008; 2009 � in univariate setting: filtering using Markov structure Särkkä 2013 Both are linear time , with finite error . Bridge to iterative methods is beginning to form, via sub-space recycling ( de Roos & P .H., arXiv 1706.00241 2017) 13

popular extensions no. 2: requires large-scale nonlinear optimization Maximum Likelihood estimation: Assume φ ( x ) = φ θ ( x ) N � 1 L ( y ; θ , w ) = log p ( y | φ , w ) = � y i − φ θ ( x i ) ⊺ w � 2 + const. 2 σ 2 i =1 y i w φ 1 ( x i ) φ 2 ( x i ) φ ... ( x i ) φ ... ( x i ) φ M ( x i ) θ x i (A feed-forward network) 14

Learning Features a (in general) non-convex , non-linear optimization problem N � 1 L ( y ; θ , w ) = log p ( y | φ , w ) = � y i − φ θ ( x i ) ⊺ w � 2 + const. 2 σ 2 i =1 N � ∇ θ L = 1 − ( y i − φ θ ( x i ) ⊺ w ) · w ⊺ ∇ θ φ ( x i ) σ 2 i =1 � �� “back-propagation” 10 f ( x ) 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 x 15

Deep Learning (really just a quick peek) in practice: � multiple input dimensions (e.g. pixel intensities) � multi-dimensional output (e.g. structured sentences) � multiple feature layers � structured layers (convolutions, pooling, pyramids, etc.) y 1 y 2 ... ... y M o i i i ξ 1 ξ 2 ξ M 2 ... ... i φ 1 φ 2 ... ... φ M 1 i i i x M 0 x 1 ... ... x 2 i i 16

Deep Learning has become Mainstream an increasingly professional industry Krizhevsky, Sutskever & Hinton “ImageNet Classification with Deep Convolutional Neural Networks” Adv. in Neural Information Processing Systems (NIPS 2012) 25 , pp. 1097–1105 17

The computations of acting agents and the agents acting in - PowerPoint PPT Presentation

The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5 June 2017 Research Group for Probabilistic Numerics Max Planck Institute for Intelligent Systems Tbingen, Germany Some of the presented work was

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Intelligent Agents: acting rationally AIMA chapter 2 Summary Intelligent Agents: acting

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing &

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Embarrassingly Parallel Computations Embarrassingly Parallel Computations A computation that

Interval Computations as Why Intervals? Applied Constructive Interval Computations . . . Wiener

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

CSC421 Intro to Artificial Intelligence UNIT 01: Intelligent Agents Agents & environments

BABA is getting Social BECOME A BETTER AGENT Where good agents go to become great agents.

1 Defining Agents 2 2 How Agents Should Act 3 2.1 Mapping from Percept Sequences to Actions

2015 OUTSTANDING YOUNG AGENTS COMMITTEE: Membership Development The Young Agents Council of the

Intelligent Driving Agents Intelligent Driving Agents Microscopic traffic simulation with

Innovative Ideas to Engage Agents Will Bickmore & Sarah-Lynne Rand Senior Account Managers

Learning Agents Overview Learning important aspects Learning in Agents goal, types; individual

River Engineering Nothing in these lectures will be exact. We are talking about the modelling of

Jaki VPN wybra w 2015? Piotr Matusiak Security Consultant, Cisco AS Education CCIE #19860,

M-Theory on Calabi-Yau 5-folds K.S. Stelle Workshop on Holonomy Groups and applications Hamburg

Non-perturbative study of the viscosity in SU ( 2 ) lattice gluodynamics. V. V. Braguta, A. Yu.

ECE-8843 http://www.csc.gatech.edu/copeland/jac/8813-03/ Prof. John A. Copeland

A Protocol for Secure Public Instant Messaging Mohammad Mannan and Paul C. van Oorschot Digital

ARX Model Development IIT Bombay Consider data obtained from two tank system and let us try

Understanding GitOps IBM Cloud and Cognitive Software GitOps What is the difference between

The computations of acting agents and the agents acting in - PowerPoint PPT Presentation

The computations of acting agents and the agents acting in computations Philipp Hennig ICERM 5 June 2017 Research Group for Probabilistic Numerics Max Planck Institute for Intelligent Systems Tbingen, Germany Some of the presented work was

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Intelligent Agents: acting rationally AIMA chapter 2 Summary Intelligent Agents: acting

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing &amp;

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Embarrassingly Parallel Computations Embarrassingly Parallel Computations A computation that

Interval Computations as Why Intervals? Applied Constructive Interval Computations . . . Wiener

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

CSC421 Intro to Artificial Intelligence UNIT 01: Intelligent Agents Agents &amp; environments

BABA is getting Social BECOME A BETTER AGENT Where good agents go to become great agents.

1 Defining Agents 2 2 How Agents Should Act 3 2.1 Mapping from Percept Sequences to Actions

2015 OUTSTANDING YOUNG AGENTS COMMITTEE: Membership Development The Young Agents Council of the

Intelligent Driving Agents Intelligent Driving Agents Microscopic traffic simulation with

Innovative Ideas to Engage Agents Will Bickmore &amp; Sarah-Lynne Rand Senior Account Managers

Learning Agents Overview Learning important aspects Learning in Agents goal, types; individual

River Engineering Nothing in these lectures will be exact. We are talking about the modelling of

Jaki VPN wybra w 2015? Piotr Matusiak Security Consultant, Cisco AS Education CCIE #19860,

M-Theory on Calabi-Yau 5-folds K.S. Stelle Workshop on Holonomy Groups and applications Hamburg

Non-perturbative study of the viscosity in SU ( 2 ) lattice gluodynamics. V. V. Braguta, A. Yu.

ECE-8843 http://www.csc.gatech.edu/copeland/jac/8813-03/ Prof. John A. Copeland

A Protocol for Secure Public Instant Messaging Mohammad Mannan and Paul C. van Oorschot Digital

ARX Model Development IIT Bombay Consider data obtained from two tank system and let us try

Understanding GitOps IBM Cloud and Cognitive Software GitOps What is the difference between

Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing &

CSC421 Intro to Artificial Intelligence UNIT 01: Intelligent Agents Agents & environments

Innovative Ideas to Engage Agents Will Bickmore & Sarah-Lynne Rand Senior Account Managers