15-780: Graduate AI Lecture 19. Learning Geoff Gordon (this - - PowerPoint PPT Presentation

15 780 graduate ai lecture 19 learning
SMART_READER_LITE
LIVE PREVIEW

15-780: Graduate AI Lecture 19. Learning Geoff Gordon (this - - PowerPoint PPT Presentation

15-780: Graduate AI Lecture 19. Learning Geoff Gordon (this lecture) Tuomas Sandholm TAs Sam Ganzfried, Byron Boots 1 Review 2 Stationary distribution 3 Stationary distribution Q ( x t +1 ) = P ( x t +1 | x t ) Q ( x t ) d x t 4 MH


slide-1
SLIDE 1

15-780: Graduate AI Lecture 19. Learning

Geoff Gordon (this lecture) Tuomas Sandholm TAs Sam Ganzfried, Byron Boots

1
slide-2
SLIDE 2

Review

2
slide-3
SLIDE 3

Stationary distribution

3
slide-4
SLIDE 4

Stationary distribution

Q(xt+1) =

  • P(xt+1 | xt)Q(xt)dxt
4
slide-5
SLIDE 5

MH algorithm

Proof that MH algorithm’s stationary distribution is the desired P(x) Based on detailed balance: transitions between x and x’ happen equally often in each direction

5
slide-6
SLIDE 6

Gibbs

Special case of MH Proposal distribution: conditional probability of block i of x, given rest of x Acceptance probability is always 1

6
slide-7
SLIDE 7

Sequential sampling

Often we want to keep a sample of belief at current time This is the sequential sampling problem Common algorithm: particle filter Parallel importance sampling for P(xt+1 | xt)

7
slide-8
SLIDE 8

Particle filter example

8
slide-9
SLIDE 9

Learning

Improve our model, using sampled data Model = factor graph, SAT formula, … Hypothesis space = { all models we’ll consider } Conditional models

9
slide-10
SLIDE 10

Version space algorithm

Predict w/ majority of still-consistent hypotheses Mistake bound analysis

10
slide-11
SLIDE 11

Bayesian Learning

11
slide-12
SLIDE 12

Recall iris example

H = factor graphs of given structure Need to specify entries of ϕs ϕ0 ϕ4 ϕ3 ϕ2 ϕ1

12
slide-13
SLIDE 13

Factors

lo m hi set.

pi qi 1–pi–qi

vers.

ri si 1–ri–si

vir.

ui vi 1–ui–vi setosa p versicolor q virginica 1–p–q ϕ0 ϕ1–ϕ4

13
slide-14
SLIDE 14

Continuous factors

lo m hi set.

p1 q1 1–p1–q1

vers.

r1 s1 1–r1–s1

vir.

u1 v1 1–u1–v1 ϕ1 Discretized petal length Continuous petal length Φ1(ℓ, s) = exp(−(ℓ − ℓs)2/2σ2)

parameters ℓset, ℓvers, ℓvir; constant σ2

14
slide-15
SLIDE 15

Simpler example

H

p

T

1–p Coin toss

15
slide-16
SLIDE 16

Parametric model class

H is a parametric model class: each H in H corresponds to a vector of parameters θ = (p) or θ = (p, q, p1, q1, r1, s1, …) Hθ: X ~ P(X | θ) (or, Y ~ P(Y | X, θ)) Contrast to discrete H, as in version space Could also have mixed H: discrete choice among parametric (sub)classes

16
slide-17
SLIDE 17

Prior

Write D = (X1, X2, …, XN) Hθ gives P(D | θ) Bayesian learning also requires prior distribution over H for parametric classes, P(θ) Together, P(D | θ) P(θ) = P(D, θ)

17
slide-18
SLIDE 18

Prior

E.g., for coin toss, p ~ Beta(a, b): Specifying, e.g., a = 2, b = 2: P(p | a, b) = 1 B(a, b)pa−1(1 − p)b−1 P(p) = 6p(1 − p)

18
slide-19
SLIDE 19

Prior for p

0.2 0.4 0.6 0.8 1 1 2 3 4 5

19
slide-20
SLIDE 20

Coin toss, cont’d

Joint dist’n of parameter p and data xi: P(p, x) = P(p)

  • i

P(xi | p) = 6p(1 − p)

  • i

pxi(1 − p)1−xi

20
slide-21
SLIDE 21

Posterior

P(θ | D) is posterior Prior says what we know about θ before seeing D; posterior says what we know after seeing D Bayes rule: P(θ | D) = P(D | θ) P(θ) / P(D) P(D | θ) is (data or sample) likelihood

21
slide-22
SLIDE 22

Coin flip posterior

P(p | x) = P(p)

  • i

P(xi | p)/P(x) = 1 Z p(1 − p)

  • i

pxi(1 − p)1−xi = 1 Z p1+P

i xi(1 − p)1+P i(1−xi)

= Beta(2 +

i xi, 2 + i(1 − xi))

22
slide-23
SLIDE 23

Prior for p

0.2 0.4 0.6 0.8 1 1 2 3 4 5

23
slide-24
SLIDE 24

Posterior after 4 H, 7 T

0.2 0.4 0.6 0.8 1 1 2 3 4 5

24
slide-25
SLIDE 25

Posterior after 10 H, 19 T

0.2 0.4 0.6 0.8 1 1 2 3 4 5

25
slide-26
SLIDE 26

Where does prior come from?

Sometimes, we know something about θ ahead of time in this case, encode knowledge in prior e.g., ||θ|| small, or θ sparse Often, we want prior to be noninformative (i.e., not commit to anything about θ) in this case, make prior “flat” then P(D | θ) typically overwhelms P(θ)

26
slide-27
SLIDE 27

Predictive distribution

Posterior is nice, but doesn’t tell us directly what we need to know We care more about P(xN+1 | x1, …, xN) By law of total probability, conditional independence: P(xN+1 | D) =

  • P(xN+1, θ | D)dθ

=

  • P(xN+1 | θ)P(θ | D)dθ
27
slide-28
SLIDE 28

Coin flip example

After 10 H, 19 T: p ~ Beta(12, 21) E(xN+1 | p) = p E(xN+1 | θ) = E(p | θ) = a/(a+b) = 12/33 So, predict 36.4% chance of H on next flip

28
slide-29
SLIDE 29

Approximate Bayes

29
slide-30
SLIDE 30

Approximate Bayes

Coin flip example was easy In general, computing posterior (or predictive distribution) may be hard Solution: use the approximate integration techniques we’ve studied!

30
slide-31
SLIDE 31

Bayes as numerical integration

Parameters θ, data D P(θ | D) = P(D | θ) P(θ) / P(D) Usually, P(θ) is simple; so is Z P(D | θ) So, P(θ | D) ∝ Z P(D | θ) P(θ) Perfect for MH

31
slide-32
SLIDE 32

P(y | x) = σ(ax + b) σ(z) = 1/(1 + exp(−z))

petal length P(I. virginica)

32
slide-33
SLIDE 33

Posterior

P(a, b | xi, yi) = ZP(a, b)

  • i

σ(axi + b)yiσ(−axi − b)1−yi P(a, b) = N(0, I)

33
slide-34
SLIDE 34

a b

Sample from posterior

a b

!"# !"$ !"% & &"' &"# &"$ &"% !% !( !$ !) !# !* !'

34
slide-35
SLIDE 35

Bayes discussion

35
slide-36
SLIDE 36

Expanded factor graph

  • riginal factor graph:
36
slide-37
SLIDE 37

Inference vs. learning

Inference on expanded factor graph = learning on original factor graph aside: why the distinction between inference and learning? mostly a matter of algorithms: parameters are usually continuous,

  • ften high-dimensional
37
slide-38
SLIDE 38

Why Bayes?

Recall: we wanted to ensure our agent doesn’t choose too many mistaken actions Each action can be thought of as a bet: e.g., eating X = bet X is not poisonous We choose bets (actions) based on our inferred probabilities E.g., R = 1 for eating non-poisonous, –99 for poisonous: eat iff P(poison) < 0.01

38
slide-39
SLIDE 39

Choosing bets

Don’t know which bets we’ll need to make So, Bayesian reasoning tries to set probabilities that result in reasonable betting decisions no matter what bets we are choosing among I.e., works if betting against an adversary (with rules defined as follows)

39
slide-40
SLIDE 40

Bayesian bookie

Bookie (our agent) accepts bets on any event (defined over our joint distribution) A: next I. versicolor has petal length ≥ 4.2 B: next three coins in a row come up H C: A ^ B

40
slide-41
SLIDE 41

Odds

Bookie can’t refuse bets, but can set odds: A: 1:1 odds (stake of $1 wins $1 if A) ¬B: 1:7 odds (stake of $7 wins $1 if ¬B) Must accept same bet in either direction no “house cut” e.g., 7:1 odds on B ⇔ 1:7 odds on ¬B

41
slide-42
SLIDE 42

Odds vs. probabilities

Bookie should choose odds based on probabilities E.g., if coin is fair, P(B) = 1/8 So, should give 7:1 odds on B (1:7 on ¬B) bet on B: (1/8) (7) + (7/8) (–1) = 0 bet on ¬B: (7/8) (1) + (1/8) (–7) = 0 In general: odds x:y ⇔ p = y/(x+y)

42
slide-43
SLIDE 43

Conditional bets

We’ll also allow conditional bets: “I bet that, if we go to the restaurant, Ted will

  • rder the fries”

If we go and Ted orders fries, I win If we go and Ted doesn’t order fries, I lose If we don’t go, bet is called off

43
slide-44
SLIDE 44

How can adversary fleece us?

Method 1: by knowing the probabilities better than we do if this is true, we’re sunk so, assume no informational advantage for adversary Method 2: by taking advantage of bookie’s non-Bayesian reasoning

44
slide-45
SLIDE 45

Example of Method 2

Suppose I give probabilities: P(A)=0.5 P(A ^ B)=0.333 P(B | A)=0.5 Adversary will bet on A at 1:1, on ¬(A^B) at 1:2, and on B | A at 1:1

45
slide-46
SLIDE 46

Result of bet

A B $1 $2 $3 $ttl

T T 1 –2 1 T F 1 1 –1 1 F T –1 1 F F –1 1 A at 1:1 ¬(A^B) at 1:2 B|A at 1:1

46
slide-47
SLIDE 47

Dutch book

Called a “Dutch book” Adversary can print money, with no risk This is bad for us… we shouldn’t have stated incoherent probabilities i.e., probabilities inconsistent with Bayes rule

47
slide-48
SLIDE 48

Theorem

If we do all of our reasoning according to Bayesian axioms of probability, we will never be subject to a Dutch book So, if we don’t know what decisions we’re going to need to make based on learned hypothesis H, we should use Bayesian learning to compute posterior P(H)

48
slide-49
SLIDE 49

Cheaper approximations

49
slide-50
SLIDE 50

Getting cheaper

Maximum a posteriori (MAP) Maximum likelihood (MLE) Conditional MLE / MAP Instead of true posterior, just use single most probable hypothesis

50
slide-51
SLIDE 51

MAP

Summarize entire posterior density using the maximum arg max

θ

P(D | θ)P(θ)

51
slide-52
SLIDE 52

MLE

Like MAP, but ignore prior term arg max

θ

P(D | θ)

52
slide-53
SLIDE 53

Conditional MLE, MAP

Split D = (x, y) Condition on x, try to explain only y arg max

θ

P(y | x, θ) arg max

θ

P(y | x, θ)P(θ)

53
slide-54
SLIDE 54

a b

Iris example: MAP vs. posterior

!"# !"$ !"% & &"' &"# &"$ &"% !% !( !$ !) !# !* !'

54
slide-55
SLIDE 55

Irises: MAP vs. posterior

! " # $ % !&'( & &'( &'" &'$ &') * *'(

55
slide-56
SLIDE 56

Too certain

This behavior of MAP (or MLE) is typical: we are too sure of ourselves But, often gets better with more data Theorem: MAP and MLE are consistent estimates of true θ, if “data per parameter” → ∞

56