Probability, Decision Theory, and Loss Functions CMSC 678 UMBC - - PowerPoint PPT Presentation

probability decision theory and
SMART_READER_LITE
LIVE PREVIEW

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC - - PowerPoint PPT Presentation

Probability, Decision Theory, and Loss Functions CMSC 678 UMBC Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2019/cmsc678 Course site:


slide-1
SLIDE 1

Probability, Decision Theory, and Loss Functions

CMSC 678 UMBC

Some slides adapted from Hamed Pirsiavash

slide-2
SLIDE 2

Logistics Recap

Piazza (ask & answer questions):

https://piazza.com/umbc/spring2019/cmsc678

Course site:

https://www.csee.umbc.edu/courses/graduate/678/spring19

Evaluation submission site:

https://www.csee.umbc.edu/courses/graduate/678/spring19/submit

slide-3
SLIDE 3

Course Announcement: Assignment 1

Due Friday, 2/8 (~9 days) Math & programming review Discuss with others, but write, implement and complete on your own

slide-4
SLIDE 4

A Terminology Buffet

Classification Regression Clustering

the task: what kind

  • f problem are you

solving?

slide-5
SLIDE 5

A Terminology Buffet

Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised

the task: what kind

  • f problem are you

solving? the data: amount of human input/number

  • f labeled examples
slide-6
SLIDE 6

A Terminology Buffet

Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised

Probabilistic Generative Conditional Spectral Neural Memory- based Exemplar …

the data: amount of human input/number

  • f labeled examples

the approach: how any data are being used the task: what kind

  • f problem are you

solving?

slide-7
SLIDE 7

Outline

Review+Extension Probability Decision Theory Loss Functions

slide-8
SLIDE 8

What does it mean to learn?

Generalization

slide-9
SLIDE 9

Machine Learning Framework: Learning

instance 1 instance 2 instance 3 instance 4 Machine Learning Predictor Extra-knowledge

Evaluator

score

instances are typically examined independently Gold/correct labels

give feedback to the predictor

scoreθ(X)

scoring model

  • bjective

F(θ)

slide-10
SLIDE 10

Model, parameters and hyperparameters

Model: mathematical formulation of system (e.g., classifier) Parameters: primary “knobs” of the model that are set by a learning algorithm Hyperparameter: secondary “knobs”

http://www.uiparade.com/wp-content/uploads/2012/01/ui-design-pure-css.jpg

slide-11
SLIDE 11

Gradient Ascent

slide-12
SLIDE 12

What do we know before we see the data, and how does that influence our modeling decisions?

General ML Consideration: Inductive Bias

Courtesy Hamed Pirsiavash

slide-13
SLIDE 13

General ML Consideration: Inductive Bias

A C B D Partition these into two groups…

Courtesy Hamed Pirsiavash

What do we know before we see the data, and how does that influence our modeling decisions?

slide-14
SLIDE 14

General ML Consideration: Inductive Bias

A C B D Partition these into two groups

Courtesy Hamed Pirsiavash

Who selected red vs. blue?

What do we know before we see the data, and how does that influence our modeling decisions?

slide-15
SLIDE 15

General ML Consideration: Inductive Bias

A C B D Partition these into two groups

Courtesy Hamed Pirsiavash

Who selected red vs. blue? Who selected vs. ?

What do we know before we see the data, and how does that influence our modeling decisions?

slide-16
SLIDE 16

General ML Consideration: Inductive Bias

A C B D Partition these into two groups

Courtesy Hamed Pirsiavash

Who selected red vs. blue? Who selected vs. ?

What do we know before we see the data, and how does that influence our modeling decisions?

Tip: Remember how your own biases/interpretation are influencing your approach

slide-17
SLIDE 17

Today’s Goals:

1.Remember Probability/Statistics 2.Understand Optimizing Empirical Risk

slide-18
SLIDE 18

Outline

Review+Extension Probability Decision Theory Loss Functions

slide-19
SLIDE 19

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-20
SLIDE 20

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0 p(A) ≤ p(B), when A ⊆ B p(A ∪ B) = p(A) + p(B), when A ∩ B = φ

everything A B

p(A ∪ B) = p(A) + p(B) – p(A ∩ B) p(A ∪ B) ≠ p(A) + p(B)

slide-21
SLIDE 21

Probabilities and Random Variables

Random variables: variables that represent the possible outcomes of some random “process”

slide-22
SLIDE 22

Probabilities and Random Variables

Random variables: variables that represent the possible outcomes of some random “process” Example #1: A (weighted) coin that can come up heads or tails

X is a random variable denoting the possible

  • utcomes

X=HEADS or X=TAILS

slide-23
SLIDE 23

Probabilities and Random Variables

Random variables: variables that represent the possible

  • utcomes of some random “process”

Example #1: A (weighted) coin that can come up heads or tails

X is a random variable denoting the possible outcomes X=HEADS or X=TAILS

Example #2: Measuring the amount of snow that fell in the last storm

Y is a random variable denoting the amount snow that fell, in inches Y=0, or Y=0.5, or Y=1.0495928591, or Y=10, or …

slide-24
SLIDE 24

Probabilities and Random Variables

Random variables: variables that represent the possible

  • utcomes of some random “process”

Example #1: A (weighted) coin that can come up heads or tails

X is a random variable denoting the possible outcomes X=HEADS or X=TAILS

Example #2: Measuring the amount of snow that fell in the last storm

Y is a random variable denoting the amount snow that fell, in inches Y=0, or Y=0.5, or Y=1.0495928591, or Y=10, or …

DISCRETE random variable CONTINUOUS random variable

slide-25
SLIDE 25

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values)

slide-26
SLIDE 26

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF)

slide-27
SLIDE 27

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF) The values that PMF/PDF can take are 0 ≤ p(X=k) ≤ 1 p(X=k) ≥ 0

slide-28
SLIDE 28

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF) The values that PMF/PDF can take are 0 ≤ p(X=k) ≤ 1 p(X=k) ≥ 0 We “add” with Sums (∑) Integrals (∫ )

slide-29
SLIDE 29

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF) The values that PMF/PDF can take are 0 ≤ p(X=k) ≤ 1 p(X=k) ≥ 0 We “add” with Sums (∑) Integrals (∫ ) Our PMF/PDF satisfies p(everything)=1 by ෍

𝑙

𝑞(𝑌 = 𝑙) = 1 ∫ 𝑞 𝑦 𝑒𝑦 = 1

slide-30
SLIDE 30

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-31
SLIDE 31

Joint Probability

Probability that multiple things “happen together”

everything A B Joint probability

slide-32
SLIDE 32

Joint Probability

Probability that multiple things “happen together” p(x,y), p(x,y,z), p(x,y,w,z) Symmetric: p(x,y) = p(y,x)

everything A B Joint probability

slide-33
SLIDE 33

Joint Probability

Probability that multiple things “happen together” p(x,y), p(x,y,z), p(x,y,w,z) Symmetric: p(x,y) = p(y,x) Form a table based of

  • utcomes: sum across cells = 1

everything A B Joint probability p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

slide-34
SLIDE 34

Joint Probabilities

1

p(A)

what happens as we add conjuncts?

slide-35
SLIDE 35

Joint Probabilities

1

p(A, B) p(A)

what happens as we add conjuncts?

slide-36
SLIDE 36

Joint Probabilities

1

p(A, B, C) p(A, B) p(A)

what happens as we add conjuncts?

slide-37
SLIDE 37

Joint Probabilities

p(A, B, C, D)

1

p(A, B, C) p(A, B) p(A)

what happens as we add conjuncts?

slide-38
SLIDE 38

Joint Probabilities

p(A, B, C, D)

1

p(A, B, C) p(A, B) p(A) p(A, B, C, D, E)

what happens as we add conjuncts?

slide-39
SLIDE 39

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-40
SLIDE 40

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are the results of flipping the same coin twice in succession independent?

slide-41
SLIDE 41

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are the results of flipping the same coin twice in succession independent? A: Yes (assuming no weird effects)

slide-42
SLIDE 42

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

everything A B

Q: Are A and B independent?

slide-43
SLIDE 43

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

everything A B

Q: Are A and B independent? A: No (work it out from p(A,B)) and the axioms

slide-44
SLIDE 44

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are X and Y independent?

p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

slide-45
SLIDE 45

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are X and Y independent?

p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

A: No (find the marginal probabilities of p(x) and p(y))

slide-46
SLIDE 46

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-47
SLIDE 47

Marginal(ized) Probability: The Discrete Case

y x1 & y x2 & y x3 & y x4 & y Consider the mutually exclusive ways that different values of x could occur with y

Q: How do write this in terms of joint probabilities?

slide-48
SLIDE 48

Marginal(ized) Probability: The Discrete Case

y x1 & y x2 & y x3 & y x4 & y

𝑞 𝑧 = ෍

𝑦

𝑞(𝑦, 𝑧)

Consider the mutually exclusive ways that different values of x could occur with y

slide-49
SLIDE 49

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-50
SLIDE 50

Conditional Probability

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍)

Conditional Probabilities are Probabilities

slide-51
SLIDE 51

Conditional Probability

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y

slide-52
SLIDE 52

Conditional Probability

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌

slide-53
SLIDE 53

Conditional Probabilities: Changing the Right

1

p(A)

what happens as we add conjuncts to the right?

slide-54
SLIDE 54

Conditional Probabilities: Changing the Right

1

p(A | B) p(A)

what happens as we add conjuncts to the right?

slide-55
SLIDE 55

Conditional Probabilities: Changing the Right

1

p(A | B) p(A)

what happens as we add conjuncts to the right?

slide-56
SLIDE 56

Conditional Probabilities: Changing the Right

1

p(A | B) p(A)

what happens as we add conjuncts to the right?

slide-57
SLIDE 57

Conditional Probabilities

Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed

  • bservations, estimates

become less reliable

slide-58
SLIDE 58

Revisiting Marginal Probability: The Discrete Case

y x1 & y x2 & y x3 & y x4 & y

𝑞 𝑧 = ෍

𝑦

𝑞(𝑦, 𝑧) = ෍

𝑦

𝑞 𝑦 𝑞 𝑧 𝑦)

slide-59
SLIDE 59

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-60
SLIDE 60

Deriving Bayes Rule

Start with conditional p(X | Y)

slide-61
SLIDE 61

Deriving Bayes Rule

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍)

Solve for p(x,y)

slide-62
SLIDE 62

Deriving Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍)

Solve for p(x,y)

𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍)

p(x,y) = p(y,x)

slide-63
SLIDE 63

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability marginal likelihood (probability)

slide-64
SLIDE 64

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-65
SLIDE 65

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗 = ෑ

𝑗 𝑇

𝑞 𝑦𝑗 𝑦1, … , 𝑦𝑗−1)

extension of Bayes rule

slide-66
SLIDE 66

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-67
SLIDE 67

Distribution Notation

If X is a R.V. and G is a distribution:

  • 𝑌 ∼ 𝐻 means X is distributed according to

(“sampled from”) 𝐻

slide-68
SLIDE 68

Distribution Notation

If X is a R.V. and G is a distribution:

  • 𝑌 ∼ 𝐻 means X is distributed according to

(“sampled from”) 𝐻

  • 𝐻 often has parameters 𝜍 = (𝜍1, 𝜍2, … , 𝜍𝑁)

that govern its “shape”

  • Formally written as 𝑌 ∼ 𝐻(𝜍)
slide-69
SLIDE 69

Distribution Notation

If X is a R.V. and G is a distribution:

  • 𝑌 ∼ 𝐻 means X is distributed according to

(“sampled from”) 𝐻

  • 𝐻 often has parameters 𝜍 = (𝜍1, 𝜍2, … , 𝜍𝑁) that

govern its “shape”

  • Formally written as 𝑌 ∼ 𝐻(𝜍)

i.i.d. If 𝑌1, X2, … , XN are all independently sampled

from 𝐻(𝜍), they are independently and identically distributed

slide-70
SLIDE 70

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal (Gamma)

Bernoulli: A single draw

  • Binary R.V.: 0 (failure) or 1 (success)
  • 𝑌 ∼ Bernoulli(𝜍)
  • 𝑞 𝑌 = 1 = 𝜍, 𝑞 𝑌 = 0 = 1 − 𝜍
  • Generally, 𝑞 𝑌 = 𝑙 = 𝜍𝑙 1 − 𝑞 1−𝑙
slide-71
SLIDE 71

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal (Gamma)

Bernoulli: A single draw

  • Binary R.V.: 0 (failure) or 1 (success)
  • 𝑌 ∼ Bernoulli(𝜍)
  • 𝑞 𝑌 = 1 = 𝜍, 𝑞 𝑌 = 0 = 1 − 𝜍
  • Generally, 𝑞 𝑌 = 𝑙 = 𝜍𝑙 1 − 𝑞 1−𝑙

Binomial: Sum of N iid Bernoulli draws

  • Values X can take: 0, 1, …, N
  • Represents number of successes
  • 𝑌 ∼ Binomial(𝑂, 𝜍)
  • 𝑞 𝑌 = 𝑙 =

𝑂 𝑙 𝜍𝑙 1 − 𝜍 𝑂−𝑙

slide-72
SLIDE 72

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal (Gamma)

Categorical: A single draw

  • Finite R.V. taking one of K values: 1, 2, …, K
  • 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ𝐿
  • 𝑞 𝑌 = 1 = 𝜍1, 𝑞 𝑌 = 2 = 𝜍2, … 𝑞(

) 𝑌 = 𝐿 = 𝜍𝐿

  • Generally, 𝑞 𝑌 = 𝑙 = ς𝑘 𝜍𝑘

𝟐[𝑙=𝑘]

  • 1 𝑑 = ቊ1, 𝑑 is true

0, 𝑑 is false Multinomial: Sum of N iid Categorical draws

  • Vector of size K representing how often

value k was drawn

  • 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ𝐿
slide-73
SLIDE 73

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal (Gamma)

Poisson

  • Finite R.V. taking any integer that is >= 0
  • 𝑌 ∼ Poisson 𝜇 ,𝜇 ∈ ℝ

is the “rate”

  • 𝑞 𝑌 = 𝑙 =

𝜇𝑙 exp(−𝜇) 𝑙!

slide-74
SLIDE 74

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal (Gamma)

Normal

  • Real R.V. taking any real number
  • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is

the standard deviation

  • 𝑞 𝑌 = 𝑦 =

1 2𝜌𝜏 exp( − 𝑦−𝜈 2 2𝜏2

)

https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png

𝑞 𝑌 = 𝑦

slide-75
SLIDE 75

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-76
SLIDE 76

Expected Value of a Random Variable

𝑌 ~ 𝑞 ⋅

random variable

slide-77
SLIDE 77

Expected Value of a Random Variable

𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

random variable expected value (distribution p is implicit)

slide-78
SLIDE 78

Expected Value: Example

1 2 3 4 5 6

uniform distribution of number of cats I have

1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

slide-79
SLIDE 79

Expected Value: Example

1 2 3 4 5 6

uniform distribution of number of cats I have

1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

Q: What common distribution is this?

slide-80
SLIDE 80

Expected Value: Example

1 2 3 4 5 6

uniform distribution of number of cats I have

1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

Q: What common distribution is this? A: Categorical

slide-81
SLIDE 81

Expected Value: Example 2

1 2 3 4 5 6

non-uniform distribution of number of cats a normal cat person has

1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

slide-82
SLIDE 82

Expected Value of a Function of a Random Variable

𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) =? ? ?

slide-83
SLIDE 83

Expected Value of a Function of a Random Variable

𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = ෍

𝑦

𝑔(𝑦) 𝑞 𝑦

slide-84
SLIDE 84

Expected Value of Function: Example

1 2 3 4 5 6

non-uniform distribution of number of cats I start with

What if each cat magically becomes two? 𝑔 𝑙 = 2𝑙 𝔽 𝑔(𝑌) = ෍

𝑦

𝑔(𝑦) 𝑞 𝑦

slide-85
SLIDE 85

Expected Value of Function: Example

1 2 3 4 5 6

non-uniform distribution of number of cats I start with

1/2 * 21 + 1/10 * 22 + 1/10 * 23 + 1/10 * 24 + 1/10 * 25 + 1/10 * 26 = 13.4 What if each cat magically becomes two? 𝑔 𝑙 = 2𝑙 𝔽 𝑔(𝑌) = ෍

𝑦

𝑔(𝑦) 𝑞 𝑦 = ෍

𝑦

2𝑦𝑞(𝑦)

slide-86
SLIDE 86

Probability Prerequisites

Basic probability axioms and definitions Joint probability Probabilistic Independence Marginal probability Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-87
SLIDE 87

Outline

Review+Extension Probability Decision Theory Loss Functions

slide-88
SLIDE 88

Decision Theory

“Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ

slide-89
SLIDE 89

Decision Theory

“Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h(x) to produce ỹ

slide-90
SLIDE 90

Decision Theory

“Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h(x) to produce ỹ Requirement 2: a function ℓ(y, ỹ) telling us how wrong we are

slide-91
SLIDE 91

Decision Theory

“Decision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (“state of the world”) Output: a decision ỹ Requirement 1: a decision (hypothesis) function h(x) to produce ỹ Requirement 2: a loss function ℓ(y, ỹ) telling us how wrong we are Goal: minimize our expected loss across any possible input

slide-92
SLIDE 92

score

Requirement 1: Decision Function

instance 1 instance 2 instance 3 instance 4

Evaluator

Gold/correct labels

h(x) is our predictor (classifier, regression model, clustering model, etc.)

Machine Learning Predictor Extra-knowledge

h(x)

slide-93
SLIDE 93

Requirement 2: Loss Function

ℓ 𝑧, ො 𝑧 ≥ 0

“correct” label/result predicted label/result “ell” (fancy l character)

loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

  • ptimize ℓ?
  • minimize
  • maximize
slide-94
SLIDE 94

Requirement 2: Loss Function

ℓ 𝑧, ො 𝑧 ≥ 0

“correct” label/result predicted label/result “ell” (fancy l character)

loss: A function that tells you how much to penalize a prediction ỹ from the correct answer y

Negative ℓ (−ℓ) is called a utility or reward function

slide-95
SLIDE 95

Decision Theory

minimize expected loss across any possible input

arg min

ො 𝑧 𝔽[ℓ(𝑧, ො

𝑧)]

slide-96
SLIDE 96

Risk Minimization

minimize expected loss across any possible input

a particular, unspecified input pair (x,y)… but we want any possible pair

arg min

ො 𝑧 𝔽[ℓ(𝑧, ො

𝑧)] = arg min

ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))]

slide-97
SLIDE 97

Decision Theory

minimize expected loss across any possible input input

arg min

ො 𝑧 𝔽[ℓ(𝑧, ො

𝑧)] = arg min

ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] =

argmin

h

𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚

Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

slide-98
SLIDE 98

Risk Minimization

minimize expected loss across any possible input

arg min

ො 𝑧 𝔽[ℓ(𝑧, ො

𝑧)] = arg min

ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] =

argmin

h

𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = argmin

h ∫ ℓ 𝑧, ℎ 𝒚

𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧)

slide-99
SLIDE 99

Risk Minimization

minimize expected loss across any possible input

arg min

ො 𝑧 𝔽[ℓ(𝑧, ො

𝑧)] = arg min

ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] =

argmin

h

𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 = argmin

h ∫ ℓ 𝑧, ℎ 𝒚

𝑄 𝒚, 𝑧 𝑒(𝒚, 𝑧)

we don’t know this distribution*!

*we could try to approximate it analytically

slide-100
SLIDE 100

Empirical Risk Minimization

minimize expected loss across our observed input

arg min

ො 𝑧 𝔽[ℓ(𝑧, ො

𝑧)] = arg min

ℎ 𝔽[ℓ(𝑧, ℎ(𝒚))] =

argmin

h

𝔽 𝒚,𝑧 ∼𝑄 ℓ 𝑧, ℎ 𝒚 ≈ argmin

h

1 𝑂 ෍

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ 𝒚𝑗

slide-101
SLIDE 101

Empirical Risk Minimization

minimize expected loss across our observed input

argmin

h

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ 𝒚𝑗

  • ur

classifier/predictor controlled by our parameters θ

change θ → change the behavior of the classifier

slide-102
SLIDE 102

Best Case: Optimize Empirical Risk with Gradients

argmin

h

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ𝜄 𝒚𝑗 argmin

𝜄

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ𝜄 𝒚𝑗

change θ → change the behavior of the classifier

slide-103
SLIDE 103

Best Case: Optimize Empirical Risk with Gradients

differentiating might not always work: “… apart from the computational details”

argmin

𝜄

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ𝜄 𝒚𝑗

change θ → change the behavior of the classifier

How? Use Gradient Descent on 𝐺(𝜄)!

𝐺(𝜄)

slide-104
SLIDE 104

Best Case: Optimize Empirical Risk with Gradients

𝛼𝜄𝐺 = ෍

𝑗

𝜖ℓ 𝑧𝑗, ො 𝑧 = ℎ𝜄 𝒚𝑗 𝜖 ො 𝑧 𝛼𝜄ℎ𝜄 𝒚𝒋

differentiating might not always work: “… apart from the computational details”

argmin

𝜄

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ𝜄 𝒚𝑗

change θ → change the behavior of the classifier

slide-105
SLIDE 105

Best Case: Optimize Empirical Risk with Gradients

𝛼𝜄𝐺 = ෍

𝑗

𝜖ℓ 𝑧𝑗, ො 𝑧 = ℎ𝜄 𝒚𝑗 𝜖 ො 𝑧 𝛼𝜄ℎ𝜄 𝒚𝒋

differentiating might not always work: “… apart from the computational details”

argmin

𝜄

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ𝜄 𝒚𝑗

change θ → change the behavior of the classifier

Step 1: compute the gradient of the loss wrt the predicted value

slide-106
SLIDE 106

Best Case: Optimize Empirical Risk with Gradients

𝛼𝜄𝐺 = ෍

𝑗

𝜖ℓ 𝑧𝑗, ො 𝑧 = ℎ𝜄 𝒚𝑗 𝜖 ො 𝑧 𝛼𝜄ℎ𝜄 𝒚𝒋

differentiating might not always work: “… apart from the computational details”

argmin

𝜄

𝑗=1 𝑂

ℓ 𝑧𝑗, ℎ𝜄 𝒚𝑗

change θ → change the behavior of the classifier

Step 1: compute the gradient of the loss wrt the predicted value Step 2: compute the gradient of the predicted value wrt 𝜄.

slide-107
SLIDE 107

Outline

Review+Extension Probability Decision Theory Loss Functions

slide-108
SLIDE 108

Loss Functions Serve a Task

Classification Regression Clustering Fully-supervised Semi-supervised Un-supervised

Probabilistic Generative Conditional Spectral Neural Memory- based Exemplar …

the data: amount of human input/number

  • f labeled examples

the approach: how any data are being used the task: what kind

  • f problem are you

solving?

slide-109
SLIDE 109

Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

an instance d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled instances (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps instances to classes

γ learns to associate certain features of instances with their labels

slide-110
SLIDE 110

Classification Example: Face Recognition

Courtesy from Hamed Pirsiavash

slide-111
SLIDE 111

Classification Loss Function Example: 0-1 Loss

ℓ 𝑧, ො 𝑧 = ቊ0, if 𝑧 = ො 𝑧 1, if 𝑧 ≠ ො 𝑧

slide-112
SLIDE 112

Classification Loss Function Example: 0-1 Loss

ℓ 𝑧, ො 𝑧 = ቊ0, if 𝑧 = ො 𝑧 1, if 𝑧 ≠ ො 𝑧

Problem 1: not differentiable wrt ො 𝑧 (or θ)

slide-113
SLIDE 113

Classification Loss Function Example: 0-1 Loss

ℓ 𝑧, ො 𝑧 = ቊ0, if 𝑧 = ො 𝑧 1, if 𝑧 ≠ ො 𝑧

Problem 1: not differentiable wrt ො 𝑧 (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work

slide-114
SLIDE 114

Classification Loss Function Example: 0-1 Loss

ℓ 𝑧, ො 𝑧 = ቊ0, if 𝑧 = ො 𝑧 1, if 𝑧 ≠ ො 𝑧

Problem 1: not differentiable wrt ො 𝑧 (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work Solution 2: is h(x) a conditional distribution p(y | x)? Maximize that probability (a couple classes)

slide-115
SLIDE 115

Structured Classification: Sequence & Structured Prediction

Courtesy Hamed Pirsiavash

slide-116
SLIDE 116

Structured Classification Loss Function Example: 0-1 Loss?

ℓ 𝑧, ො 𝑧 = ቊ0, if 𝑧 = ො 𝑧 1, if 𝑧 ≠ ො 𝑧

Problem 1: not differentiable wrt ො 𝑧 (or θ) Solution 1: is the data linearly separable? Perceptron (next class) can work Solution 2: is h(x) a conditional distribution p(y | x)? Use MAP

Problem 2: too strict. Structured Prediction involves many individual decisions Solution 1: Specialize 0-1 to the structured problem at hand

slide-117
SLIDE 117

Regression

Like classification, but real-valued

slide-118
SLIDE 118

Regression Example: Stock Market Prediction

Courtesy Hamed Pirsiavash

slide-119
SLIDE 119

Regression Loss Function Examples

ℓ 𝑧, ො 𝑧 = y − ො 𝑧 2 squared loss/MSE (Mean squared error)

ො 𝑧 is a real value → nicely differentiable (generally) ☺

slide-120
SLIDE 120

Regression Loss Function Examples

ℓ 𝑧, ො 𝑧 = y − ො 𝑧 2 ℓ 𝑧, ො 𝑧 = |𝑧 − ො 𝑧| squared loss/MSE (Mean squared error) absolute loss

ො 𝑧 is a real value → nicely differentiable (generally) ☺ Absolute value is mostly differentiable

slide-121
SLIDE 121

Regression Loss Function Examples

ℓ 𝑧, ො 𝑧 = y − ො 𝑧 2 ℓ 𝑧, ො 𝑧 = |𝑧 − ො 𝑧| squared loss/MSE (Mean squared error) absolute loss

ො 𝑧 is a real value → nicely differentiable (generally) ☺ Absolute value is mostly differentiable

These loss functions prefer different behavior in the predictions (hint: look at the gradient of each)… we’ll get back to this

slide-122
SLIDE 122

Unsupervised learning: Clustering

Courtesy Hamed Pirsiavash

We’ll return to clustering loss functions later

slide-123
SLIDE 123

Outline

Review+Extension Probability Decision Theory Loss Functions