Probability, Decision Theory, and Loss Functions CMSC 678 UMBC - PowerPoint PPT Presentation

Probabilistic Independence Independence: when events can occur and not impact the probability of Q: Are the results of flipping the same other events coin twice in succession independent? Formally: p(x,y) = p(x)*p(y) A: Yes (assuming no weird effects) Generalizable to > 2 random variables

Probabilistic Independence Q: Are A and B independent? Independence: when events can occur and not impact the probability of other events Formally: p(x,y) = p(x)*p(y) A Generalizable to > 2 random variables B everything

Probabilistic Independence Q: Are A and B independent? Independence: when events can occur and not impact the probability of other events Formally: p(x,y) = p(x)*p(y) A Generalizable to > 2 random variables B everything A: No (work it out from p(A,B)) and the axioms

Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables

Probabilistic Independence Q: Are X and Y independent? Independence: when events can occur and not impact the probability of p(x,y) Y=0 Y=1 other events X=“cat” .04 .32 X=“dog” .2 .04 Formally: p(x,y) = p(x)*p(y) X=“bird” .1 .1 X=“human” .1 .1 Generalizable to > 2 random variables A: No (find the marginal probabilities of p(x) and p(y))

Probability Prerequisites Basic probability axioms and definitions Bayes rule Joint probability Probability chain rule Probabilistic Independence Common distributions Marginal probability Expected Value (of a function) of a Random Variable Definition of conditional probability

Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y Consider the mutually exclusive ways that different values of x could occur with y Q: How do write this in terms of joint probabilities?

Marginal(ized) Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y Consider the mutually exclusive ways that different values of x could occur with y 𝑞 𝑧 = ෍ 𝑞(𝑦, 𝑧) 𝑦

Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) Conditional Probabilities are Probabilities

Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y

Conditional Probability 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌

Conditional Probabilities: Changing the Right 1 p(A) 0 what happens as we add conjuncts to the right?

Conditional Probabilities: Changing the Right 1 p(A) p(A | B) 0 what happens as we add conjuncts to the right?

Conditional Probabilities: Changing the Right 1 p(A | B) p(A) 0 what happens as we add conjuncts to the right?

Conditional Probabilities Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed observations, estimates become less reliable

Revisiting Marginal Probability: The Discrete Case y x 1 & y x 2 & y x 3 & y x 4 & y 𝑞 𝑧 = ෍ 𝑞(𝑦, 𝑧) 𝑦 = ෍ 𝑞 𝑦 𝑞 𝑧 𝑦) 𝑦

Deriving Bayes Rule Start with conditional p(X | Y)

Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍)

Deriving Bayes Rule 𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) Solve for p(x,y) 𝑞(𝑍) 𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍) p(x,y) = p(y,x) 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

Bayes Rule prior likelihood probability 𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍) posterior probability marginal likelihood (probability)

Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗 extension of Bayes rule

Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻

Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍)

Distribution Notation If X is a R.V. and G is a distribution: • 𝑌 ∼ 𝐻 means X is distributed according to (“sampled from”) 𝐻 • 𝐻 often has parameters 𝜍 = (𝜍 1 , 𝜍 2 , … , 𝜍 𝑁 ) that govern its “shape” • Formally written as 𝑌 ∼ 𝐻(𝜍) i.i.d. If 𝑌 1 , X 2 , … , X N are all independently sampled from 𝐻(𝜍) , they are i ndependently and i dentically d istributed

Common Distributions Bernoulli: A single draw Bernoulli/Binomial • Binary R.V.: 0 (failure) or 1 (success) • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal (Gamma)

Common Distributions Bernoulli: A single draw • Binary R.V.: 0 (failure) or 1 (success) Bernoulli/Binomial • 𝑌 ∼ Bernoulli(𝜍) Categorical/Multinomial • 𝑞 𝑌 = 1 = 𝜍 , 𝑞 𝑌 = 0 = 1 − 𝜍 • Generally, 𝑞 𝑌 = 𝑙 = 𝜍 𝑙 1 − 𝑞 1−𝑙 Poisson Normal Binomial: Sum of N iid Bernoulli draws • Values X can take: 0, 1, …, N (Gamma) • Represents number of successes • 𝑌 ∼ Binomial(𝑂, 𝜍) 𝑙 𝜍 𝑙 1 − 𝜍 𝑂−𝑙 𝑂 • 𝑞 𝑌 = 𝑙 =

Common Distributions Categorical: A single draw • Finite R.V. taking one of K values: 1, 2, …, K Bernoulli/Binomial 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ 𝐿 • • 𝑞 𝑌 = 1 = 𝜍 1 , 𝑞 𝑌 = 2 = 𝜍 2 , … 𝑞( 𝑌 = Categorical/Multinomial 𝐿 = 𝜍 𝐿 ) 𝟐[𝑙=𝑘] • Generally, 𝑞 𝑌 = 𝑙 = ς 𝑘 𝜍 𝑘 Poisson 1 𝑑 = ቊ1, 𝑑 is true • Normal 0, 𝑑 is false Multinomial: Sum of N iid Categorical draws (Gamma) • Vector of size K representing how often value k was drawn 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ 𝐿 •

Common Distributions Poisson • Finite R.V. taking any integer that is >= 0 • 𝑌 ∼ Poisson 𝜇 ,𝜇 ∈ ℝ is the “rate” Bernoulli/Binomial 𝜇 𝑙 exp(−𝜇) • 𝑞 𝑌 = 𝑙 = 𝑙! Categorical/Multinomial Poisson Normal (Gamma)

Common Distributions Normal • Real R.V. taking any real number • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is the standard deviation Bernoulli/Binomial − 𝑦−𝜈 2 1 • 𝑞 𝑌 = 𝑦 = 2𝜌𝜏 exp( ) 2𝜏 2 Categorical/Multinomial Poisson Normal 𝑞 𝑌 = 𝑦 (Gamma) https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png

Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅

Expected Value of a Random Variable random variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 expected value (distribution p is implicit)

Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

Expected Value: Example uniform distribution of number of cats I have 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 Q: What common 𝑦 distribution is this? 1/6 * 1 + 1/6 * 2 + A: Categorical 1/6 * 3 + = 3.5 1/6 * 4 + 1/6 * 5 + 1/6 * 6

Expected Value: Example 2 non-uniform distribution of number of cats a normal cat person has 1 2 3 4 5 6 𝔽 𝑌 = ෍ 𝑦 𝑞 𝑦 𝑦 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + = 2.5 1/10 * 4 + 1/10 * 5 + 1/10 * 6

Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) =? ? ?

Expected Value of a Function of a Random Variable 𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍ 𝑦 𝑞(𝑦) 𝑦 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 𝑦

Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 𝑦

Expected Value of Function: Example non-uniform distribution of number of cats I start with 1 2 3 4 5 6 What if each cat magically becomes two? 𝑔 𝑙 = 2 𝑙 2 𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = ෍ 𝑔(𝑦) 𝑞 𝑦 = ෍ 𝑦 𝑦 1/2 * 2 1 + 1/10 * 2 2 + 1/10 * 2 3 + = 13.4 1/10 * 2 4 + 1/10 * 2 5 + 1/10 * 2 6

Outline Review+Extension Probability Decision Theory Loss Functions

Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ

Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ

Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ Requirement 2: a function ℓ (y, ỹ) telling us how wrong we are

Decision Theory “Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h( x ) to produce ỹ Requirement 2: a loss function ℓ (y, ỹ) telling us how wrong we are Goal: minimize our expected loss across any possible input

Requirement 1: Decision Function Gold/correct labels h(x) instance 1 instance 2 Machine score Evaluator Learning Predictor instance 3 instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

Requirement 2: Loss Function “ell” (fancy l predicted label/result character) optimize ℓ ? ℓ 𝑧, ො 𝑧 ≥ 0 • minimize • maximize “correct” label/result loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

Requirement 2: Loss Function “ell” (fancy l predicted label/result character) Negative ℓ ( −ℓ ) is ℓ 𝑧, ො 𝑧 ≥ 0 called a utility or reward function “correct” label/result loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

Decision Theory minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] ො

Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] ො a particular , unspecified input pair ( x ,y)… but we want any possible pair

Decision Theory minimize expected loss across any possible input input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = h argmin h ∫ ℓ 𝑧, ℎ 𝒚 𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧)

Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = h argmin h ∫ ℓ 𝑧, ℎ 𝒚 𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧) we don’t know this distribution*! *we could try to approximate it analytically

Empirical Risk Minimization minimize expected loss across our observed input arg min 𝑧 𝔽[ℓ(𝑧, ො 𝑧)] = ො arg min ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] = argmin 𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 ≈ h 𝑂 1 argmin 𝑂 ෍ ℓ 𝑧 𝑗 , ℎ 𝒚 𝑗 h 𝑗=1

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC - PowerPoint PPT Presentation

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2019/cmsc678 Course site:

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Ramsey on Partial Belief Dan Hoek PHI 371 Foundations of Probability and Decision Theory

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

DATA MINING TECHNIQUES Review of Probability Theory Yijun Zhao Northeastern University spring

Estimating the ATE of an endogenously assigned treatment from a sample with endogenous selection

Concise Implementation of Linear Regression Concise Implementation of Linear Regression

Structured Prediction Basics Graham Neubig Site https://phontron.com/class/nn4nlp2017/ A

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Probability and Statistics for Computer Science

Rule of 85 Rule of 85 - Key Points Rule of 85 is not a type of retirement Way of assessing

Todays Agenda Welcome & Introduction Background 2007 Interim & Discretionary

Program Evaluation Division Status Report A presentation to the Joint Legislative Program

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC - PowerPoint PPT Presentation

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2019/cmsc678 Course site:

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Ramsey on Partial Belief Dan Hoek PHI 371 Foundations of Probability and Decision Theory

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

DATA MINING TECHNIQUES Review of Probability Theory Yijun Zhao Northeastern University spring

Estimating the ATE of an endogenously assigned treatment from a sample with endogenous selection

Concise Implementation of Linear Regression Concise Implementation of Linear Regression

Structured Prediction Basics Graham Neubig Site https://phontron.com/class/nn4nlp2017/ A

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Probability and Statistics for Computer Science

Rule of 85 Rule of 85 - Key Points Rule of 85 is not a type of retirement Way of assessing

Todays Agenda Welcome &amp; Introduction Background 2007 Interim &amp; Discretionary

Program Evaluation Division Status Report A presentation to the Joint Legislative Program

Todays Agenda Welcome & Introduction Background 2007 Interim & Discretionary