Bayesian learning (with a recap of Bayesian Networks) Applied - PowerPoint PPT Presentation

Bayesian learning (with a recap of Bayesian Networks) Applied artificial intelligence (EDA132) Lecture 05 2016-02-02 Elin A. Topp Material based on course book, chapters 14.1-3, 20, and on Tom M. Mitchell, “Machine Learning”, McGraw-Hill, 1997 1

Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: a set of nodes, one per random variable a directed, acyclic graph (link ≈ “directly influences”) a conditional distribution for each node given its parents: P ( X i | Parents( X i )) In the simplest case, conditional distribution represented as a conditional probability table ( CPT) giving the distribution over X i for each combination of parent values 2

Example Topology of network encodes conditional independence assertions: P(Cav) P(Cav) P( ¬ Cav) P(W=sunny) P(W=sunny) P(W=rainy) P(W=rainy) P(W=cloudy) P(W=cloudy) P(W=snow) 0.2 0.2 0.8 Cavity 0.72 0.72 0.1 0.1 0.08 0.08 0.1 Weather Toothache Catch Cav P(T|Cav) Cav P(T|Cav) P( ¬ T|Cav) Cav Cav P(C|Cav) P(C|Cav) P( ¬ C|Cav) T 0.6 T 0.6 0.4 T T 0.9 0.9 0.1 F 0.1 F 0.1 0.9 F F 0.2 0.2 0.8 Weather is (unconditionally, absolutely) independent of the other variables Toothache and Catch are conditionally independent given Cavity We can skip the dependent columns in the tables to reduce complexity! 3

Example 2 I am at work, my neighbour John calls to say my alarm is ringing, but neighbour Mary does not call. Sometimes the alarm is set off by minor earthquakes. Is there a burglar? Variables: Burglar , Earthquake , Alarm , JohnCalls , MaryCalls Network topology reflects “causal” knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause John to call The alarm can cause Mary to call 4

Example 2 (2) P(B) Burglary 0,001 P(E) Earthquake 0,002 B E P(A|B,E) T T 0,95 T F 0,94 Alarm F T 0,29 F F 0,001 A P(J|A) A P(M|A) JohnCalls MaryCalls T 0,9 T 0,7 F 0,05 F 0,01 5

Global semantics E B Global semantics defines the full joint distribution as the product of the local conditional distributions: n P( x 1, ..., x n ) = ∏ P( x i | parents( X i )) i=1 A E.g., P( j ∧ m ∧ a ∧ ¬b ∧ ¬e) P( j | a) P( m | a) P( a | ¬b, ¬e) P( ¬b) P( ¬e) = = 0.9 * 0.7 * 0.001 * 0.999 * 0.998 J M ≈ 0.000628 6

Constructing Bayesian networks We need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics. 1. Choose an ordering of variables X 1 ,..., X n 2. For i = 1 to n add X i to the network select parents from X 1 ,..., X i-1 such that P ( X i | Parents( X i )) = P ( X i | X 1 ,..., X i-1 ) This choice of parents guarantees the global semantics: n P ( X 1 ,..., X n ) = ∏ P ( X i | X 1 ,..., X i-1 ) (chain rule) i=1 n = ∏ P ( X i | Parents( X i )) (by construction) i=1 7

Construction example MaryCalls JohnCalls Alarm Burglary Earthquake Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for humans!) Assessing conditional probabilities is hard in noncausal directions Network is less compact: 1 + 2 + 4 +2 +4 = 13 numbers Hence: Choose preferably an order corresponding to the cause → effect “chain” 8

Locally structured (sparse) network Initial evidence: The *** car won’t start! Testable variables (green), “broken, so fix it” variables (yellow) Hidden variables (blue) ensure sparse structure / reduce parameters alternator fanbelt battery age broken broken battery dead no charging fuel line starter battery meter battery flat no oil no gas blocked broken car won’t lights oil light gas gauge dipstick start! 9

And now - learning. How do we get the numbers into the network??? How do we determine the network structure? More general: How can we predict and explain based on (limited) experience? 10

A robot’s view of the world... 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position 11

Predicting the next pattern type 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position Images preprocessed into categories / collections according to the type of situation and possible numbers of “leg-like” patterns based on the knowledge of how many persons were in the room at a given time. Labels for the image categories are lost, only numbers and pattern labels remain… Hypotheses for types of pattern collection (i.e., images from a certain situation) are also available, with their priors : h 1 : only furniture P(h 1 ) = 0.1 h 2 : mostly furniture (75%), few persons P(h 2 ) = 0.2 h 3 : half furniture (50%), half persons P(h 3 ) = 0.4 h 4 : few furniture (25%), mostly persons P(h 4 ) = 0.2 h 5 : only persons P(h 5 ) = 0.1 12

Maximum Likelihood 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position We can predict (probabilities) by maximizing the likelihood of having observed some particular data with the help of the Maximum Likelihood hypothesis: h ML = argmax P( D | h) h … which is a strong simplification disregarding the priors… 13

“Maximum A Posteriori” - MAP 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position Finding the slightly more sophisticated Maximum A Posteriori hypothesis: h MAP = argmax P( h | D) h Then predict by assuming the MAP-hypothesis (quite bold) ℙ ( X | D ) = P( X | h MAP ) 14

Optimal Bayes learner 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position Prediction for X, given some observations D = <d 0 , d 1 .... d n > ℙ ( X | D ) = ∑ i ℙ ( X | h i ) P( h i | D ) in first step, P( h i | D ) = P( h i )... 15

Learning from experience 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position Prediction for the first pattern picked, assuming e.g., h 3 , and no observations are made: P( d 0 = Furniture | h 3 ) = P( d 0 = Person | h 3 ) = 0.5 First pattern is of type person, now we know: P( h 1 | d 0 ) = 0 (as P( d 0 | h 1 ) = 0), etc... After 10 patterns that all turn out to be Person , assuming that outcomes for d i are i.i.d. (independent and identically distributed) : P( D | h k ) = ∏ i P( d i | h k ) ℙ ( h k | D ) = ℙ ( D | h k ) P( h k ) / ℙ ( D ) = α ℙ ( D | h k ) P( h k ) X

Posterior probabilities 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position 1 P ( h 1 | d ) P ( h 2 | d ) P ( h 3 | d ) 0.8 P ( h 4 | d ) Posterior probability P ( h 5 | d ) 0.6 for hypothesis h k after i observations 0.4 0.2 0 0 2 4 6 8 10 Number of observations Number of observations in d 16

Prediction after sampling 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position 1 0.9 0.8 Probability for the next pattern being 0.7 caused by a person 0.6 0.5 0.4 0 2 4 6 8 10 Number of observations Number of observations in d 17

Optimal learning vs MAP-estimating 9000 Scan data Robot 8000 Distance in mm relative to robot position 7000 6000 ? 5000 4000 3000 2000 1000 0 − 1000 − 5000 − 4000 − 3000 − 2000 − 1000 0 1000 2000 3000 Distance in mm relative to robot position Predict by assuming the MAP-hypothesis: ℙ ( X | D ) = P( X | h MAP ) with h MAP = argmax P( h | D)   h i.e., P_h MAP ( d 4 = Person | d 1 = d 2 = d 3 = Person ) = P( X | h 5 ) = 1 While the optimal classifier / learner predicts P( d 4 = Person | d 1 = d 2 = d 3 = Person ) = ... = 0.7961 However, they will grow closer! Consequently, the MAP-learner should not be considered for small sets of training data! X

The Gibbs Algorithm Optimal Bayes Learner is costly, MAP-learner might be as well. Gibbs algorithm (surprisingly well working under certain conditions regarding the a posteriori distribution for H): 1. Choose a hypothesis h from H at random, according to the posterior probability distribution over H (i.e., rule out “impossible” hypotheses) 2. Use h to predict the classification of the next instance x. 18

Bayesian learning (with a recap of Bayesian Networks) Applied - PowerPoint PPT Presentation

Bayesian learning (with a recap of Bayesian Networks) Applied artificial intelligence (EDA132) Lecture 05 2016-02-02 Elin A. Topp Material based on course book, chapters 14.1-3, 20, and on Tom M. Mitchell, Machine Learning, McGraw-Hill,

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Main Tracker Geometry & Performance Rosa Simoniello

GMBA 7098: Statistics and Data Analysis (Fall 2014) Hypothesis testing (2) Ling-Chieh Kung

The following chart summarizes which model assumptions are necessary to prove which part of the

Using Switchable Fluorescent Molecules to Image Tracks and Measure Energy in Large Liquid Double

Differentiated Services Delay-and-Loss vs. Loss-Rate-Adaptive Service Classes

CH.11. VARIATIONAL PRINCIPLES Continuum Mechanics Course (MMC) - ETSECCPB - UPC Overview

Program Administrator Milestones: A Mechanism to Gauge Your Professional Development Willo M.

Mathema'cal Lies and the Lying Liars who Teach Them

Bayesian learning (with a recap of Bayesian Networks) Applied - PowerPoint PPT Presentation

Bayesian learning (with a recap of Bayesian Networks) Applied artificial intelligence (EDA132) Lecture 05 2016-02-02 Elin A. Topp Material based on course book, chapters 14.1-3, 20, and on Tom M. Mitchell, Machine Learning, McGraw-Hill,

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Main Tracker Geometry &amp; Performance Rosa Simoniello

GMBA 7098: Statistics and Data Analysis (Fall 2014) Hypothesis testing (2) Ling-Chieh Kung

The following chart summarizes which model assumptions are necessary to prove which part of the

Using Switchable Fluorescent Molecules to Image Tracks and Measure Energy in Large Liquid Double

Differentiated Services Delay-and-Loss vs. Loss-Rate-Adaptive Service Classes

CH.11. VARIATIONAL PRINCIPLES Continuum Mechanics Course (MMC) - ETSECCPB - UPC Overview

Program Administrator Milestones: A Mechanism to Gauge Your Professional Development Willo M.

Mathema'cal Lies and the Lying Liars who Teach Them

Main Tracker Geometry & Performance Rosa Simoniello