Probability Machine Learning and Pattern Recognition Chris Williams - PowerPoint PPT Presentation

Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) 1 / 31

Outline ◮ What is probability? ◮ Random Variables (discrete and continuous) ◮ Expectation ◮ Joint Distributions ◮ Marginal Probability ◮ Conditional Probability ◮ Chain Rule ◮ Bayes’ Rule ◮ Independence ◮ Conditional Independence ◮ Some Probability Distributions (for reference) ◮ Reading: Murphy secs 2.1-2.4 2 / 31

What is probability? ◮ Quantification of uncertainty ◮ Frequentist interpretation: long run frequenies of events ◮ Example: The probability of a particular coin landing heads up is 0.43 ◮ Bayesian interpretation: quantify our degrees of belief about something ◮ Example: the probability of it raining tomorrow is 0.3 ◮ Not possible to repeat “tomorrow” many times ◮ Basic rules of probability are the same, no matter which interpretation is adopted 3 / 31

Random Variables ◮ A random variable (RV) X denotes a quantity that is subject to variations due to chance ◮ May denote the result of an experiment (e.g. flipping a coin) or the measurement of a real-world fluctuating quantity (e.g. temperature) ◮ Use capital letters to denote random variables and lower case letters to denote values that they take, e.g. p ( X = x ) ◮ An RV may be discrete or continuous ◮ A discrete variable takes on values from a finite or countably infinite set ◮ Probability mass function p ( X = x ) for discrete random variables 4 / 31

◮ Examples: ◮ Colour of a car blue, green, red ◮ Number of children in a family 0 , 1 , 2 , 3 , 4 , 5 , 6 , > 6 ◮ Toss two coins, let X = ( number of heads ) 2 . X can take on the values 0, 1 and 4. ◮ Example p ( Colour = red ) = 0 . 3 ◮ � x p ( x ) = 1 5 / 31

Continuous RVs ◮ Continuous RVs take on values that vary continuously within one or more real intervals ◮ Probability density function (pdf) p ( x ) for a continuous random variable X � b p ( a ≤ X ≤ b ) = p ( x ) dx a therefore p ( x ≤ X ≤ x + δx ) ≃ p ( x ) δx ◮ � p ( x ) dx = 1 (but values of p ( x ) can be greater than 1) ◮ Examples (coming soon): Gaussian, Gamma, Exponential, Beta 6 / 31

Expectation ◮ Consider a function f ( x ) mapping from x onto numerical values � E [ f ( x )] = f ( x ) p ( x ) x � = f ( x ) p ( x ) dx for discrete and continuous variables resp. ◮ f ( x ) = x , we obtain the mean, µ x ◮ f ( x ) = ( x − µ x ) 2 we obtain the variance 7 / 31

Joint distributions ◮ Properties of several random variables are important for modelling complex problems ◮ p ( X 1 = x 1 , X 2 = x 2 , . . . , X D = x D ) ◮ “,” is read as “and” ◮ Examples about Grade and Intelligence (from Koller and Friedman, 2009) Intelligence = low Intelligence = high Grade = A 0 . 07 0 . 18 Grade = B 0 . 28 0 . 09 Grade = C 0 . 35 0 . 03 8 / 31

Marginal Probability ◮ The sum rule � p ( x ) = p ( x, y ) y ◮ p ( Grade = A ) ?? ◮ Replace sum by an integral for continuous RVs 9 / 31

Conditional Probability ◮ Let X and Y be two disjoint groups of variables, such that p ( Y = y ) > 0 . Then the conditional probability distribution (CPD) of X given Y = y is given by p ( X = x | Y = y ) = p ( x | y ) = p ( x , y ) p ( y ) ◮ Product rule p ( X , Y ) = p ( X ) p ( Y | X ) = p ( Y ) p ( X | Y ) ◮ Example : In the grades example, what is p ( Intelligence = high | Grade = A ) ? ◮ � x p ( X = x | Y = y ) = 1 for all y ◮ Can we say anything about � y p ( X = x | Y = y ) ? 10 / 31

Chain Rule The chain rule is derived by repeated application of the product rule p ( X 1 , . . . , X D ) = p ( X 1 , . . . , X D − 1 ) p ( X D | X 1 , . . . , X D − 1 ) = p ( X 1 , . . . , X D − 2 ) p ( X D − 1 | X 1 , . . . , X D − 2 ) p ( X D | X 1 , . . . , X D − 1 ) = . . . D � = p ( X 1 ) p ( X i | X 1 , . . . , X i − 1 ) i =2 ◮ Exercise: give six decompositions of p ( x, y, z ) using the chain rule 11 / 31

Bayes’ Rule ◮ From the product rule, p ( X | Y ) = p ( Y | X ) p ( X ) p ( Y ) ◮ From the sum rule the denominator is � p ( Y ) = p ( Y | X ) p ( X ) X 12 / 31

Probabilistic Inference using Bayes’ Rule ◮ Tuberculosis (TB) and a skin test (Test) ◮ p ( TB = yes ) = 0 . 001 (for subjects who get tested) ◮ p ( Test = yes | TB = yes ) = 0 . 95 ◮ p ( Test = no | TB = no ) = 0 . 95 ◮ Person gets a positive test result. What is p ( TB = yes | Test = yes ) ? p ( TB = yes | Test = yes ) = p ( Test = yes | TB = yes ) p ( TB = yes ) p ( Test = yes ) 0 . 95 × 0 . 001 = 0 . 95 × 0 . 001 + 0 . 05 × 0 . 999 ≃ 0 . 0187 NB: These are fictitious numbers 13 / 31

Independence ◮ Let X and Y be two disjoint groups of variables. Then X is said to be independent of Y if and only if p ( X | Y ) = p ( X ) for all possible values x and y of X and Y ; otherwise X is said to be dependent on Y ◮ Using the definition of conditional probability, we get an equivalent expression for the independence condition p ( X , Y ) = p ( X ) p ( Y ) ◮ X independent of Y ⇔ Y independent of X ◮ Independence of a set of variables. X 1 , . . . . , X D are independent iff D � p ( X 1 , . . . , X D ) = p ( X i ) i =1 14 / 31

Conditional Independence ◮ Let X , Y and Z be three disjoint groups of variables. X is said to be conditionally independent of Y given Z iff p ( x | y , z ) = p ( x | z ) for all possible values of x , y and z . ◮ Equivalently p ( x , y | z ) = p ( x | z ) p ( y | z ) [show this] ◮ Notation, I ( X , Y | Z ) 15 / 31

Bernoulli Distribution 1 ◮ X is a random variable that either 0.8 takes the value 0 or the value 1 . 0.6 ◮ Let p ( X = 1 | p ) = p and so 0.4 p ( X = 0 | p ) = 1 − p . 0.2 ◮ Then X has a Bernoulli distribution. 0 0 1 16 / 31

Categorical Distribution ◮ X is a random variable that takes one 1 of the values 1 , 2 , . . . , D . 0.8 ◮ Let p ( X = i | p ) = p i , with 0.6 � D i =1 p i = 1 . 0.4 ◮ Then X has a catgorical (aka 0.2 multinoulli) distribution (see Murphy 0 1 2 3 4 2012, p. 35)) 17 / 31

Binomial Distribution ◮ The binomial distribution is obtained 1 from the total number of 1 ’s in n 0.8 independent Bernoulli trials. 0.6 ◮ X is a random variable that takes one 0.4 of the values 0 , 1 , 2 , . . . , n . � n � 0.2 ◮ Let p ( X = r | p ) = p r (1 − p ) ( n − r ) . r 0 0 1 2 3 4 ◮ Then X is binomially distributed. 18 / 31

Multinomial Distribution ◮ The multinomial distribution is obtained from the total count for each outcome in n independent multivariate trials with D possible outcomes. ◮ X is a random vector of length D taking values x with x i ∈ Z + (non-negative integers) and � D i =1 x i = n . ◮ Let n ! x 1 ! . . . x D ! p x 1 1 . . . p x D p ( X = x | p ) = m ◮ Then X is multinomially distributed. 19 / 31

Poisson Distribution ◮ The Poisson distribution is obtained from binomial distribution in the limit 0.4 n → ∞ with p/n = λ . 0.35 0.3 ◮ X is a random variable taking 0.25 non-negative integer values 0 , 1 , 2 , . . . . 0.2 0.15 ◮ Let 0.1 0.05 p ( X = x | λ ) = λ x exp( − λ ) 0 0 5 10 15 x ! ◮ Then X is Poisson distributed. 20 / 31

Uniform Distribution ◮ X is a random variable taking values 1 x ∈ [ a, b ] . 0.8 ◮ Let p ( X = x ) = 1 / [ b − a ] 0.6 ◮ Then X is uniformly distributed. 0.4 Note 0.2 Cannot have a uniform distribution on an 0 0 2 4 6 8 10 unbounded region. 21 / 31

Gaussian Distribution ◮ X is a random variable taking values 0.4 x ∈ R (real values). ◮ Let p ( X = x | µ, σ 2 ) = 0.3 0.2 − ( x − µ ) 2 � � 1 √ 2 πσ 2 exp 2 σ 2 0.1 0 ◮ Then X is Gaussian distributed with −4 −2 0 2 4 mean µ and variance σ 2 . 22 / 31

Gamma Distribution ◮ The Gamma distribution has a rate parameter β > 0 (or a scale parameter 1 /β ) and a shape parameter α > 0 . 0.35 0.3 ◮ X is a random variable taking values x ∈ R + (non-negative real values). 0.25 0.2 ◮ Let 0.15 0.1 1 Γ( α ) x α − 1 β α exp( − βx ) 0.05 p ( X = x | α, β ) = 0 0 2 4 6 8 10 12 ◮ Then X is Gamma distributed. ◮ Note the Gamma function. 23 / 31

Exponential Distribution ◮ The exponential distribution is a Gamma distribution with α = 1 . 0.5 ◮ The exponential distribution is often 0.4 used for arrival times. 0.3 ◮ X is a random variable taking values 0.2 x ∈ R + . 0.1 ◮ Let p ( X = x | λ ) = λ exp( − λx ) 0 0 5 10 15 ◮ Then X is exponentially distributed. 24 / 31

Laplace Distribution ◮ The Laplace distribution is obtained from the difference between two 0.25 independent identically exponentially 0.2 distributed variables. 0.15 ◮ X is a random variable taking values 0.1 x ∈ R . 0.05 ◮ Let p ( X = x | λ ) = ( λ/ 2) exp( − λ | x | ) 0 −10 −5 0 5 10 ◮ Then X is Laplace distributed. 25 / 31

Beta Distribution 3 2.5 2 1.5 1 ◮ X is a random variable taking values 0.5 x ∈ [0 , 1] . 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ◮ Let a = b = 0 . 5 p ( X = x | a, b ) = Γ( a + b ) 1.8 Γ( a )Γ( b ) x a − 1 (1 − x ) b − 1 1.6 1.4 1.2 ◮ Then X is β ( a, b ) distributed. 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 a = 2 , b = 3 26 / 31

Probability Machine Learning and Pattern Recognition Chris Williams - PowerPoint PPT Presentation

Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Counting and Probability Whats to come? Counting and Probability Whats to come?

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Lecture 15: More Probability. Summary. CS70: Onwards. Events, Conditional Probability,

Probability Probability Random variables Atomic events Sample space Probability

Foundations of Computer Science Lecture 16 Conditional Probability Updating a Probability when

Foundations of Computer Science Lecture 16 Conditional Probability Updating a Probability when

P1 - Probability STAT 587 (Engineering) Iowa State University August 17, 2020 Probability

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

Dark matter gamma-ray line searches toward the Galactic Center halo with H.E.S.S. I Emmanuel

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Statistics of One-Way Internet Packet Delays Andrew Corlett CQOS Inc., Irvine, CA D. I. Pullin

Inverse gamma distribution STAT 587 (Engineering) Iowa State University September 17, 2020

Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline

The cosmological evolution of blazars and the cosmic gamma- ray background in the Fermi era

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

EE361: Signals and System II Probability Distributions http://www.ee.unlv.edu/~b1morris/ee361/