Model inference . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Model inference . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1

Model inference Purpose the data domain. Dataset distributed (iid): they can be seen as realizations of a single random variable. 2 Inferring a probabilistic model from a collection of observed data X = { x 1 , . . . , x n } . A probabilistic model is a probability distribution over A dataset X is a collection of N observed data, independent and identically

Model inference Problems considered Inference objectives: data collection of a given model type (probability distribution), which best the set of already observed data 3 Model selection Selecting the probabilistic model M best suited for a given Estimation Estimate the values of the set θ = ( θ 1 , . . . , θ D ) of parameters model the observed data X Prediction Compute the probability p ( x | X ) of a new observation from

Bayesian learning Context data. The corresponding predictive distribution of data is 4 Model space M : a model m ∈ M is a probability distribution p ( x | m ) over Let p ( m ) be any prior distribution of models ∑ p ( m ) = 1 m ∈M ∑ p ( x ) = p ( x | m ) p ( m ) m ∈M

Parameters Posterior parameter distribution The posterior predictive distribution, given the model, is Parametric models 6 predictive distribution is then Models are defined as parametric probability distributions, with parameters θ ranging on a parameter space Θ . A prior parameter distribution p ( θ | m ) is defined for a model. The prior ∫ p ( x | m ) = p ( x | θ , m ) p ( θ | m ) d θ Θ Given a model m ∈ M , Bayes' formula makes it possible to infer the posterior distribution of parameters, given the dataset X p ( θ | X , m ) = p ( θ | m ) p ( X | θ , m ) ∝ p ( θ | m ) p ( X | θ , m ) p ( X | m ) ∫ p ( x | X , m ) = p ( x | θ , m ) p ( θ | X , m ) d θ Θ

Bayesian inference Theorem (Bayes) where According to the bayesian approach to inference, parameters are considered 7 The approach relies on Bayes' classic result: data. as random variables, whose distributions have to be inferred from observed Let X , Y be a pair of (sets of) random variables. Then, p ( Y | X ) = p ( X | Y ) p ( Y ) = p ( X | Y ) p ( Y ) p ( X ) ∫ Z p ( X , Z ) d Z • p ( Y ) is the prior probability of Y (with respect to the observation of X ) • p ( Y | X ) is the posterior probability of Y • p ( X | Y ) is the likelihood of X w.r.t. Y • p ( X ) is the evidence of X

Point estimate of parameters and as follows performed. The posterior predictive distribution can then be approximated Motivation Idea This is usually impossible to be done efficiently. 8 Given a model m , the bayesian approach is aimed to derive the posterior distribution of the set of parameters θ . This requires computing p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = ∫ p ( X ) Θ p ( X | θ ) p ( θ ) d θ ∫ p ( x | X ) = p ( x | θ ) p ( θ | X ) d θ θ Only an estimate of the ``best'' value ˆ θ in θ (according to some measure) is ∫ ∫ p ( x | ˆ p ( x | X ) = p ( x | θ ) p ( θ | X ) d θ ≈ θ ) p ( θ | X ) d θ θ θ ∫ = p ( x | ˆ p ( θ | X ) d θ = p ( x | ˆ θ ) θ ) θ

Maximum likelihood estimate Log-likelihood Estimate The maximum occurs at the same point: argmax Approach is usually preferrable. 9 Determine the parameter value that maximize the likelihood Frequentist point of view: parameters are deterministic variables, whose value is unknown and must be estimated. N ∏ L ( θ | X ) = p ( X | θ ) = p ( x i | θ ) i =1 N ∑ l ( θ | X ) = ln L ( θ | X ) = ln p ( x i | θ ) i =1 l ( θ | X ) = argmax L ( θ | X ) θ θ N ˆ ∑ θ ML = argmax L ( θ | X ) = argmax ln p ( x i | θ ) θ θ i =1

Maximum likelihood estimate Likelihood Example Log-likelihood 11 Collection X of n binary events, modeled through a Bernoulli distribution with unknown parameter φ p ( x | φ ) = φ x (1 − φ ) 1 − x N ∏ φ x i (1 − φ ) 1 − x i L ( φ | X ) = i =1 N ∑ l ( φ | X ) = ( x i ln φ + (1 − x i ) ln(1 − φ )) = N 1 ln φ + N 0 ln(1 − φ ) i =1 where N 0 ( N 1 ) is the number of events x ∈ X equal to 0 (1) ∂l ( φ | X ) = N 1 N 0 N 0 + N 1 = N 1 N 1 ˆ φ − 1 − φ = 0 = ⇒ φ ML = ∂φ N

ML and overfitting Overfitting Maximizing the likelihood of the observed dataset tends to result into an obtained estimates are suitable to model observed data, but may be too specialized to be used to model different datasets. Penalty functions overfitting and the overall complexity of the model. This results in the following function to maximize 12 estimate too sensitive to the dataset values, hence into overfitting. The An additional function P ( θ ) can be introduced with the aim to limit C ( θ | X ) = l ( θ | X ) − P ( θ ) as a common case, P ( θ ) = γ 2 ∥ θ ∥ 2 , with γ a tuning parameter.

Maximum a posteriori estimate is computed. Idea Estimate 13 distribution). The parameter value maximizing observations, also taking into account previous knowledge (prior considered as a random variable, whose distribution has to be derived from Inference through maximum a posteriori (MAP) is similar to ML, but θ is now p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X ) ˆ θ MAP = argmax p ( θ | X ) = argmax p ( X | θ ) p ( θ ) θ θ = argmax L ( θ | X ) p ( θ ) = argmax ( l ( θ | X ) + ln p ( θ )) θ θ ( N ) ∑ = argmax ln p ( x i | θ ) + ln p ( θ ) θ i =1

MAP and gaussian prior Inference Hypothesis From the hypothesis, 14 uniform variance and null covariance.That is, Assume θ is distributed around the origin as a multivariate gaussian with ∥ θ ∥ 2 −∥ θ ∥ 2 1 ( − 1 ) ( ) p ( θ ) ∼ N ( θ | 0 , σ 2 ) = ∝ exp (2 π ) d/ 2 σ d exp 2 σ 2 2 σ 2 ˆ p ( θ | X ) = argmax ( l ( θ | X ) + ln p ( θ )) θ MAP = argmax θ θ −∥ θ ∥ 2 l ( θ | X ) − ∥ θ ∥ 2 ( ( )) ( ) l ( θ | X ) + ln exp = argmax = argmax 2 σ 2 2 σ 2 θ θ 1 which is equal to the penalty function introduced before, if γ = σ 2

MAP estimate Log-likelihood Example 15 distribution: Collection X of n binary events, modeled as a Bernoulli distribution with unknown parameter φ . Initial knowledge of φ is modeled as a Beta p ( φ | α, β ) = Beta ( φ | α, β ) = Γ( α + β ) Γ( α )Γ( β ) φ α − 1 (1 − φ ) β − 1 N ∑ l ( φ | X ) = ( x i ln φ + (1 − x i ) ln(1 − φ )) = N 1 ln φ + N 0 ln(1 − φ ) i =1 1 − φ + α − 1 − β − 1 ∂φl ( φ | X ) + ln Beta ( φ | α, β ) = N 1 ∂ N 0 φ − 1 − φ = 0 = ⇒ φ N 1 + α − 1 N 1 + α − 1 ˆ φ MAP = N 0 + N 1 + α + β − 2 = N + α + β − 2

Note Gamma function The function is an extension of the factorial to the real numbers field: hence, for any 16 ∫ ∞ t x − 1 e − t dt Γ( x ) = 0 integer x , Γ( x ) = ( x − 1)!

Applying bayesian inference Mode and mean Once the posterior distribution of the distribution. This may lead to inaccurate estimates, as in the figure below: 17 p ( θ | X ) = p ( X | θ ) p ( θ ) = p ( X | θ ) p ( θ ) p ( X ) ∫ θ p ( X | θ ) d θ is available, MAP estimate computes the most probable value (mode) θ MAP p ( x ) x

Applying bayesian inference Mode and mean A better estimation can be obtained by applying a fully bayesian approach and referring to the whole posterior distribution, for example by deriving the 18 expectation of θ w.r.t. p ( θ | X ) , ∫ θ ∗ = E p ( θ | X ) [ θ ] = θ p ( θ | X ) d θ θ

Bayesian estimate Posterior distribution since Example 19 distribution: Collection X of n binary events, modeled as a Bernoulli distribution with unknown parameter φ . Initial knowledge of φ is modeled as a Beta p ( φ | α, β ) = Beta ( φ | α, β ) = Γ( α + β ) Γ( α )Γ( β ) φ α − 1 (1 − φ ) β − 1 ∏ N i =1 φ x i (1 − φ ) 1 − x i p ( φ | α, β ) p ( φ | X , α, β ) = p ( X ) = φ N 1 (1 − φ ) N 0 φ α − 1 (1 − φ ) β − 1 = φ N 1 + α − 1 (1 − φ ) N 0 + β − 1 Γ( α )Γ( β ) Z Γ( α + β ) p ( X ) ∫ + ∞ −∞ p ( φ | X , α, β ) dφ = 1 , Z must be equal to the normalizing coefficient of the distribution Beta ( φ | α + N 1 , β + N 0 ) . Hence, p ( φ | X , α, β ) = Beta ( φ | α + N 1 , β + N 0 )

Model inference . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Model inference . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Model inference Purpose the data domain. Dataset distributed (iid): they can be seen as

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin,

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Mathematical approximation Jo Hardin Professor, Pomona College DataCamp Inference for Linear

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Causal Inference and Response Surface Modeling Inference and

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Rules of Inference Section 1.6

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

Bayesian Fitting Probabilistic Morphable Models Summer School, June 2017 Sandro Schnborn

Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I) Rung-Ching Tsai

Bayesian Inference and Traffic Analysis Carmela Troncoso George Danezis September-November

Welcome to the co u rse ! FU N DAME N TAL S OF BAYE SIAN DATA AN ALYSIS IN R Rasm u s Bth

Probabilistic Graphical Models Lecture 6 Variable Elimination CS/CNS/EE 155 Andreas Krause

First Results with PAWIAN th 2019| P ANDA CM 19/2 GSI | Jennifer Ptz June 25 Outline

Overview Bayesian Model Selection Bayesian Learning of CPTs Dealing with Multiple Models Chris

Model Selection Model Selection with Small Samples with Small Samples Department of Computer