Machine Learning Lecture 2 - Bayesian Learning: Binomial and - PowerPoint PPT Presentation

Introduction D. Dubhashi Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engineering Chalmers University January 21, 2016

Introduction Coin Tossing D. Dubhashi ◮ Estimate probability a coin shows head based on observed coin tosses. ◮ Simple but fundamental! ◮ Historically important: originally used by alertBayes (1763) and generalized by Pierre–Simon de Laplace (1774) creating Bayes Rule.

Introduction Likelihood D. Dubhashi Suppose X i ∼ Ber( θ ) i.e. P ( X i = 1) = θ (“heads”) P ( X i = 0) = 1 − θ (“tails”) and θ ∈ [0 . 1] is the parameter to be estimated. Given a series of N observed coin tosses, the probability that we observe k heads is given by the Binomial distribution : � N � θ k (1 − θ ) N − k . Bin( k | N , θ ) = k

Introduction Likelihood D. Dubhashi Thus, the likelihood has the form P ( D | θ ) ∝ θ N 1 (1 − θ ) N 0 , where N 0 and N 1 are the number of tails and heads seen respectively. These are called sufficient statistics since this is all we need to know about the data to estimate θ . Formally, s ( D ) are a set of sufficient statistics for D if P ( θ | D ) = P ( θ | s ( D )) .

Introduction Binomial Distribution D. Dubhashi θ =0.250 θ =0.900 0.35 0.4 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Introduction Bayes Rule for Posterior D. Dubhashi P ( D | θ ) P ( θ ) P ( θ | D ) = � 1 0 P ( D | θ ) P ( θ ) d θ Bit daunting to compute the integral in the denominator! Can avoid it via a clever trick of choosing suitable prior!

Introduction Prior D. Dubhashi Need a prior with support [0 , 1] and to make math easy, of the same form as likelihood. Beta distribution 1 B ( a , b ) θ a − 1 (1 − θ ) b − 1 , Beta( θ | a , b ) = with hyperparameters a , b , and where B ( a , b ) = Γ( a )Γ( b ) Γ( a + b ) , is a normalizing factor. a Mean a + b a − 1 Mode a + b − 2 Prior Knowledge If we believe that the mean is 0 . 7 and standard deviation is 0 . 2 we can set a = 2 . 975 and b = 1 . 277 (exercise!) Uninformative Prior Uniform prior a = 1 = b .

Introduction Beta Distribution D. Dubhashi beta distributions 3 a=0.1, b=0.1 a=1.0, b=1.0 a=2.0, b=3.0 2.5 a=8.0, b=4.0 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1

Introduction Posterior, Conjugate Prior D. Dubhashi Multiplying prior and likelihood gives the posterior: P ( θ | D ) ∝ Bin( N 1 | θ, N 0 + N 1 )Beta( θ | a , b ) = Beta( θ | N 1 + a , N 0 + b ) . ] The posterior has the same distribution as the prior (with different parameters): the Beta distribution is said to be a conjugate prior for the Binomial distribution. The posterior is obtained by simply adding the prior parameters to the empirical counts, hence the hyperparameters are often called pseudo–counts.

Introduction Updating Beta Prior with Binomial Likelihood D. Dubhashi 6 4.5 prior Be(2.0, 2.0) prior Be(5.0, 2.0) lik Be(4.0, 18.0) lik Be(12.0, 14.0) 4 post Be(5.0, 19.0) post Be(16.0, 15.0) 5 3.5 4 3 2.5 3 2 2 1.5 1 1 0.5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Introduction Sequential update versus Batch D. Dubhashi Suppose we have two data sets D 1 and D 2 with sufficient statistics N 1 1 , N 1 0 and N 2 1 , N 2 0 respectively. Let N 1 := N 1 1 + N 2 1 and N 0 := N 1 0 + N 2 0 be the combined suffcient statistics. In batch mode, P ( θ | D 1 , D 2 ) ∝ Bin( N 1 | θ, N 1 + N 0 )Beta( θ | a , b ) ∝ Beta( θ | N 1 + a , N 0 + a ) . In sequential mode, P ( D 2 | θ ) P ( θ | D 1 ) P ( θ | D 1 , D 2 ) ∝ Bin( N 2 1 | θ, N 2 1 + N 2 0 )Beta( θ | N 1 1 + a , N 1 ∝ 0 + b ) Beta( θ | N 1 1 + N 2 1 + a , N 1 0 + N 2 ∝ 0 + b ) = Beta( θ | N 1 + a , N 0 + b ) Very suitable for online learning!

Introduction Posterior Mean and Mode D. Dubhashi ◮ The MAP estimate is given by a + N 1 − 1 ˆ θ MAP = a + b + N − 2 . ◮ With uniform prior a = 1 = b , θ MLE = N 1 ˆ N , which is just the MLE. ◮ The posterior mean is a + N 1 ¯ θ = a + b + N , which is a convex combination of the prior mean and the MLE: a ¯ a + b + (1 − λ )ˆ θ = λ θ MLE , a + b + N . Note that as N → ∞ , ¯ a + b θ → ˆ with λ := θ MLE .

Introduction Posterior Predictive Distribution D. Dubhashi The probability of a heads in a single new coin toss is: � 1 P (ˆ x = 1 | D ) = P ( x = 1 | θ ) P ( θ | D ) d θ 0 � 1 = θ Beta( θ | N 1 + a , N 0 + b ) d θ 0 = E Beta[ θ ] N 1 + a = N 1 + N 0 + a + b

Introduction Predicting Multiple Future Trials D. Dubhashi The probability of predicting x heads in M future trials: � 1 P ( x | D ) = Bin( x | θ, M ) P ( θ | D ) d θ 0 � 1 = Bin( x | θ, M )Beta( θ | N 1 + a , N 0 + b ) d θ 0 � 1 � M � 1 θ x + N 1 + a − 1 (1 − θ ) M − x + N 0 + b − 1 = x B ( N 1 + a , N 0 + b ) 0 � M � B ( x + N 1 + a , ( M − x ) + N 0 + b ) = x B ( N 1 + a , N 0 + b ) the compound Beta–Binomial distribution, with mean and variance: N + a + b , var[ x ] = M ( N 1 + a )( N 0 + b ) N 1 + a N + a + b + M E [ x ] = M ( N 1 + a + N 0 + b ) 2 N + 1

Introduction Posterior Predictive and Plugin D. Dubhashi posterior predictive plugin predictive 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Introduction Tossing a Dice D. Dubhashi ◮ From coins and two outcomes to dice and many outcomes. ◮ Given observations from a dice with K faces, predict next roll. ◮ Suppose we observe N dice rolls D = { x 1 , x 2 , · · · , x N } where each x i ∈ { 1 , · · · , K } .

Introduction Likelihood D. Dubhashi ◮ Suppose we observe N dice rolls D = { x 1 , x 2 , · · · , x N } where each x i ∈ { 1 , · · · , K } . ◮ K � θ N k P ( D | θ ) = k , k =1 where θ k is the (unknown) probability of showing face k and N k is the observed number outomes showing face k .

Introduction Multinomial Distribution D. Dubhashi The probability of observing face k appear x k times in n rolls of a dice with face probabilities θ := ( θ k , k ∈ { 1 , · · · K } ) is Multinomial Distribution k � n � � θ x k Mu( x | n , θ ) = k . x 1 , x 2 , · · · , x K k =1

Introduction Priors D. Dubhashi Since the parameter vector θ lives in the K –dimensional simplex S K := { ( x 1 , · · · , x K ) | x 1 + · · · + x K = 1 } , we need a prior that ◮ has support on this simplex. ◮ is ideally also conjugate to the likelihood distribution i.e. multinomial.

Introduction Dirichlet Distribution D. Dubhashi Dirichlet Distribution K 1 � x α k − 1 Dir( x | α ) := 1[ x ∈ S K ] , k B ( α ) k =1 where B ( α ) is the normalization factor � K k =1 Γ( α k ) B ( α ) := , Γ( α 0 ) with α 0 := � K k =1 α 0 .

Introduction Dirichlet Distribution D. Dubhashi α =0.10 15 10 p 5 0 1 1 0.5 0.5 0 0

Introduction Dirichlet Distribution examples D. Dubhashi α 0 controls the strength i.e. the peak and α k s control where it occurs: (1,1,1) Uniform distribution. (2,2,2) Broad distribution centered at (1 / 3 , 1 / 3 , 1 / 3) (20,20,20) Narrow distribution centered at (1 / 3 , 1 / 3 , 1 / 3). When α k < 1, for all k , we get “spikes” at the corners.

Introduction Samples from Dirichlet Distribution D. Dubhashi Samples from Dir (alpha=0.1) Samples from Dir (alpha=1) 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5 1 1 0.5 0.5 0 0 1 2 3 4 5 1 2 3 4 5

Introduction Prior and Posterior D. Dubhashi Prior Dirichlet Prior: K 1 � θ α k − 1 Dir( θ | α ) = k B ( α ) i =1 Posterior P ( θ | D ) ∝ P ( D | θ ) P ( θ ) K � θ N k k θ α k − 1 ∝ k k =1 K � θ α k + N k − 1 = k k =1 = Dir( θ | α 1 + N 1 , · · · , α K + N K )

Introduction MAP estimate using Lagrange Multipliers D. Dubhashi max Dir( θ | α 1 + N 1 , · · · , α K + N K ) θ K � θ α k + N k − 1 = k k =1 subject to θ 1 + θ 2 + · · · + θ K = 1 . Use Lagrange multipliers! Solution: θ k = α k + N k − 1 ˆ α 0 + N − K . With uniform prior this becomes ˆ θ k = N k / N .

Introduction Application to Language Modelling D. Dubhashi Suppose observe the following sentences: Sentences Mary had a little lamb, little lamb, little lamb Mary had a little lamb, it’s fleece as white as snow Can we predict which word comes next?

Machine Learning Lecture 2 - Bayesian Learning: Binomial and - PowerPoint PPT Presentation

Introduction D. Dubhashi Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engineering Chalmers University January 21, 2016

1 Binomial Heaps Binomial- -Heap Heap- -Union Union Binomial Heaps Binomial Binomial-

On the q -binomial coefficients and binomial congruences q -series seminar University of Illinois

Chapter 19: Binomial Heaps We will study another heap structure called, the binomial heap. The

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Unit 2: Probability and distributions Lecture 4: Binomial distribution Statistics 101 Thomas

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Single-parameter models: Binomial data Applied Bayesian Statistics Dr. Earvin Balderama

The Binomial Distribution Binomial Experiment An experiment with these characteristics: For some

Binomial Distribution Binomial Experiment 1 The same experiment is repeated a fixed number of

Chapter 3.3, 4.1, 4.3. Binomial Coefficient Identities Prof. Tesler Math 184A Winter 2017 Prof.

Chapter 3.3, 4.1, 4.3. Binomial Coefficient Identities Prof. Tesler Math 184A Winter 2019

JUST THE MATHS SLIDES NUMBER 2.2 SERIES 2 (Binomial series) by A.J.Hobson 2.2.1

MA/CSSE 473 Day 29 Day30-Dynamic-Binomial-Warshall Dynamic Programming Binomial Coefficients

The story of the film so far... X , Y independent random variables and Z = X + Y : f Z = f X f

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates

Probability Chapters 4 & 5 Overview Statistics important for game analysis

Chapter 4 continued: Probability models 1. Random variables: a) Idea. b) Discrete and continuous

Outline DM811 Fall 2009 Heuristics for Combinatorial Optimization 1. Introduction Lecture

Perspectives on Nuclear Physics Input into High-Energy Cosmic Ray Interactions A.B. Balantekin

Statistical Machine Learning Lecture 03: Statistics Refresher Kristian Kersting TU Darmstadt

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

Machine Learning Lecture 2 - Bayesian Learning: Binomial and - PowerPoint PPT Presentation

Introduction D. Dubhashi Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt Dubhashi dubhashi@chalmers.se Department of Computer Science and Engineering Chalmers University January 21, 2016

1 Binomial Heaps Binomial- -Heap Heap- -Union Union Binomial Heaps Binomial Binomial-

On the q -binomial coefficients and binomial congruences q -series seminar University of Illinois

Chapter 19: Binomial Heaps We will study another heap structure called, the binomial heap. The

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Unit 2: Probability and distributions Lecture 4: Binomial distribution Statistics 101 Thomas

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Single-parameter models: Binomial data Applied Bayesian Statistics Dr. Earvin Balderama

The Binomial Distribution Binomial Experiment An experiment with these characteristics: For some

Binomial Distribution Binomial Experiment 1 The same experiment is repeated a fixed number of

Chapter 3.3, 4.1, 4.3. Binomial Coefficient Identities Prof. Tesler Math 184A Winter 2017 Prof.

Chapter 3.3, 4.1, 4.3. Binomial Coefficient Identities Prof. Tesler Math 184A Winter 2019

JUST THE MATHS SLIDES NUMBER 2.2 SERIES 2 (Binomial series) by A.J.Hobson 2.2.1

MA/CSSE 473 Day 29 Day30-Dynamic-Binomial-Warshall Dynamic Programming Binomial Coefficients

The story of the film so far... X , Y independent random variables and Z = X + Y : f Z = f X f

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates

Probability Chapters 4 &amp; 5 Overview Statistics important for game analysis

Chapter 4 continued: Probability models 1. Random variables: a) Idea. b) Discrete and continuous

Outline DM811 Fall 2009 Heuristics for Combinatorial Optimization 1. Introduction Lecture

Perspectives on Nuclear Physics Input into High-Energy Cosmic Ray Interactions A.B. Balantekin

Statistical Machine Learning Lecture 03: Statistics Refresher Kristian Kersting TU Darmstadt

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

Probability Chapters 4 & 5 Overview Statistics important for game analysis