Stochastic Processes, Kernel Regression, Infinite Mixture Models - - PowerPoint PPT Presentation

stochastic processes kernel regression infinite mixture
SMART_READER_LITE
LIVE PREVIEW

Stochastic Processes, Kernel Regression, Infinite Mixture Models - - PowerPoint PPT Presentation

Stochastic Processes, Kernel Regression, Infinite Mixture Models Gabriel Huang (TA for Simon Lacoste-Julien) IFT 6269 : Probabilistic Graphical Models - Fall 2018 Stochastic Process = Random Function 2 Today Motivate Gaussian and


slide-1
SLIDE 1

Stochastic Processes, Kernel Regression, Infinite Mixture Models

IFT 6269 : Probabilistic Graphical Models - Fall 2018

Gabriel Huang (TA for Simon Lacoste-Julien)

slide-2
SLIDE 2

2

Stochastic Process = Random Function

slide-3
SLIDE 3

Today

  • Motivate Gaussian and Dirichlet distribution in Bayesian Framework.
  • Kolmogorov’s extension theorem.
  • Define Gaussian Process and Dirichlet Process from finite-dimensional marginals.
  • Gaussian Process:
  • Motivating Applications: Kriging, Hyperparameter optimization.
  • Properties: Conditioning/Posterior distribution.
  • Demo.
  • Dirichlet Process:
  • Motivating Application: Clustering with unknown number of clusters.
  • Construction: stick-breaking, Polya urn, Chinese Restaurant Process.
  • De Finetti theorem.
  • How to use.
  • Demo.

3

slide-4
SLIDE 4

4

Disclaimer I will be skipping the more theoretical building blocks of stochastic processes (e.g. measure theory) in order to be able to cover more material.

slide-5
SLIDE 5

Recall some distributions

Gaussian Distribution Samples ! in "#. Dirichlet Distribution Samples $ in Simplex Δ#&' verifies $' + ⋯ + $# = 1

slide-6
SLIDE 6

6

Why Gaussian and Dirichlet?

They are often used as priors

slide-7
SLIDE 7

7

Bayesians like to use those distributions as priors

  • ver model parameters !(#)

Why?

slide-8
SLIDE 8

8

Because they are very convenient to represent/update. Conjugate Priors

slide-9
SLIDE 9

9

! "|$ ∝ ! $ " !(")

Prior Posterior Likelihood model Conjugate Prior means: Posterior in same family as prior

slide-10
SLIDE 10

10

! "|$ ∝ ! $ " !(")

Prior "|$ ∼ )*+,,-*.(/, 1) Posterior "|$ ∼ )*+,,-*.(/′, 1′) Likelihood $|" ∼ )*+,,-*.(;, 1<) )*+,,-*. is conjugate prior for )*+,,-*. likelihood model.

slide-11
SLIDE 11

11

! "|$ ∝ ! $ " !(")

Prior "|$ ∼ )*+*,-./0(1) Posterior "|$ ∼ )*+*,-./0(12) Likelihood $|" ∼ ;<0/=>+*,<.(?) )*+*,-./0 is conjugate prior for ;<0/=>+*,<./AB.0*C>B..* likelihood model.

slide-12
SLIDE 12

12

So taking the posterior is simply a matter of updating the parameters

  • f the prior.
slide-13
SLIDE 13

Back to Gaussian and Dirichlet

Gaussian Distribution Samples ! in "#. Dirichlet Distribution Samples $ in Simplex Δ#&' verifies $' + ⋯ + $# = 1

slide-14
SLIDE 14

14

Gaussian and Dirichlet are indexed with a finite set of integers {1, … , %}. They are random vectors.

((), (*, … , (+) (-), -*, … , -+)

slide-15
SLIDE 15

15

Can we index random variables with infinite sets as well? In other words, define random functions.

slide-16
SLIDE 16

16

Defining stochastic processes from their marginals.

slide-17
SLIDE 17

17

Suppose we want to define a random function (stochastic process)

!: # ∈ % → ',

where % is an infinite set of indices. Imagine a joint distribution over all the ()*).

slide-18
SLIDE 18

18

Kolmogorov Extension Theorem

informal statement

Assume that for any ! ≥ 1, and every finite subset of indices (%&, %(, … , %*), we can define a marginal probability (finite-dimensional distribution) ,-.,-/,…,-0(1-., 1-/, … , 1-0) Then, if all marginal probabilities agree, there exists a unique stochastic process 2: % ∈ 5 → 7 which satisfies the given marginals.

slide-19
SLIDE 19

19

So Kolmogorov’s extension theorem gives us a way to implicitly define stochastic processes. (However it does not tell us how to construct them.)

slide-20
SLIDE 20

20

Defining Gaussian Process from finite-dimensional marginals.

slide-21
SLIDE 21

21

Characterizing Gaussian Process

Samples ! ∼ #$(&, () of a Gaussian Process are random functions *: , → . defined on the domain , (such as time , = ., or vectors , = .0). We can also see them as an infinite collection *1 1∈3 indexed by ,. Parameters are the Mean function &(4) and Covariance function ((4, 4′).

slide-22
SLIDE 22

22

For any !", !$, … , !& ∈ ( we define the following finite-dimensional distributions p *+,, *+-, … , *+. .

*+,, *+-, … , *+. ∼ 1( 3(45 5, 6 45, 47

5,7)

Since they are consistent with each other, Kolmogorov’s extension theorem states that they define a unique stochastic process, we will call Gaussian Process:

9 ∼ :;(3, 6)

slide-23
SLIDE 23

23

Some properties are immediate consequences of definition:

  • ! "# = %(')
  • )*+ "#, "#- = .[ "# − % '

"#1 − % '-

2

] = Σ(', '-)

  • Any linear combination of distinct dimensions is still a

Gaussian: 5

678 9

:6 ∗ "#< ∼ >(⋅,⋅)

Characterizing Gaussian Process

slide-24
SLIDE 24

24

Some properties are immediate consequences of definition:

  • Stationarity: Σ ", "$ = Σ " − "$

does not depend on the positions

  • Continuity: lim

*+→* - ", "′ = - ", "

  • Any linear combination is still a Gaussian:

/

012 3

40 ∗ 678 ∼ :(⋅,⋅)

Characterizing Gaussian Process

slide-25
SLIDE 25

25

Example Samples

slide-26
SLIDE 26

26

Posteriors of Gaussian Process.

How to use them for regression?

slide-27
SLIDE 27

27

http://chifeng.scripts.mit.edu/stuff/gp-demo/

Interactive Demo

need a volunteer

slide-28
SLIDE 28

28

Gaussian processes are very useful for doing regression on an unknown function !: " = !(%). Say we don’t know anything about that function, except the fact that it is smooth.

slide-29
SLIDE 29

29

Before observing any data, we represent our belief on the unknown function f with the following prior: ! ∼ #$(&('), Σ(x, x,)) For instance & ' = 0 and Σ ', ', = / ⋅ exp(−

4546

7

87

)

Controls smoothness (bandwidth/length-scale) Controls uncertainty

WARNING: Change of notation! ' is now the index and !(') is the random function

slide-30
SLIDE 30

30

Now, assume we observe a training set !" = $%, $', … , $" , )" = )%, )', … , )" and we want to predict the value )∗ = +($∗) associated with a new test point $∗. One way to do that is to compute the posterior +|!", )" after observing the evidence (training set).

slide-31
SLIDE 31

31

Bayes’ Rule + , -., 0. ∝ +(0.|,, -.) +(,)

  • Gaussian Process Prior:

5 6 = 89 : ; , Σ x, x>

  • Gaussian Likelihood:

5 ?@ 6, A@ = B(6 A , CDE@)

  • > Gaussian Process Posterior:

5 6|A@, ?@ = 89 :′ ; , Σ′ x, x> for some :′(;), Σ′(;, ;>). Remember: Gaussian Process is conjugate prior for Gaussian likelihood model.

slide-32
SLIDE 32

32

Bayes’ Rule + , -., 0. ∝ +(0.|,, -.) +(,)

  • Gaussian Process Prior:

5 6 = 89 : ; , Σ x, x>

  • Dirac Likelihood: (? → 0)

5 BC 6, DC = E BC − 6 DC that is, BC is now deterministic after observing 6, DC. BC = 6 DC

  • > Gaussian Process Posterior:

5 6|DC, BC = 89 :′ ; , Σ′ x, x> for some :′(;), Σ′(;, ;>).

slide-33
SLIDE 33

33

The problem is there is no easy way to represent the parameters of the posterior ! " , Σ(", "&) efficiently. Instead of computing the full posterior (, we will just evaluate the posterior at one point )∗ = (("∗). We want: , -∗ ./, -/, 0∗

slide-34
SLIDE 34

We want: ! "∗ $%, "%, '∗ The finite-dimensional marginals of the Gaussian process give that: "% |

∼ *( , )

"∗ $% '∗

,($%) ,('∗)

.($%, $%) .($%, '∗) .('∗, $%) .('∗, '∗)

slide-35
SLIDE 35

35

Theorem: For a Gaussian vector with distribution the conditional distribution !"|!$ is given by

∼ &( , )

!$ !"

)* )+

,*,* ,*,+ ,+,* ,+,+

∼ &( , )

!$ !"

)+ + ,+,*,*,*

/*(0* − )*)

,+,+ − ,+,*,*,*

/*,*,+

|

[Schur’s complement] This Theorem will be useful for the Kalman filter, later on …

slide-36
SLIDE 36

Applying the previous theorem gives us the posterior !∗.

| ∼ %( , )

(∗ () *) +∗

,- = ,(+∗) + 1(+∗, *))1 *), *) 23(*) − , *) ) 1- = 1 +∗, +∗ − 1(+∗, *))1 *), *) 231(*), +∗) ,- 1

slide-37
SLIDE 37

37

Active Learning with Gaussian Process.

slide-38
SLIDE 38

38

Active Learning

Active Learning is iterative process:

  • Generate a question !∗.
  • Query the world with the question (by acting, can be costly)
  • Obtain an answer #∗ = %(!∗).
  • Improve model by learning from the answer.
  • Repeat.
slide-39
SLIDE 39

39

Active Learning

Gaussian process is good for cases where it is expensive to evaluate !∗ = $ %∗ .

  • Kriging. !∗ is the amount of natural resource, %∗ is new

2D/3D location to dig. Every evaluation is mining and can cost millions.

  • Hyperparameter optimization (Bayesian optimization).

!∗ is the validation loss, %∗ is set of hyperparameters to

  • test. Every evaluation is running an experiment and can

take hours.

slide-40
SLIDE 40

40

http://chifeng.scripts.mit.edu/stuff/gp-demo/

Back to the demo

(Talk about utility function)

slide-41
SLIDE 41

41

Formal equivalence with Kernelized Linear Regression. [blackboard if time]

Rasmussen & Williams (2006) http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

slide-42
SLIDE 42

42

Dirichlet Processes.

Stick Breaking Construction

slide-43
SLIDE 43

43

! = #

$%& '(

)$*+,

  • &, -/, … ∼223 !4

parameters, sampled from base distribution

) = ()&, )/, … ) ∼ !78(9)

scalar weights sum up to 1

Diracs concentrate probability mass )$ at -$ G is a random probability measure:

  • random: both ) and - are random
  • probability measure: it is a convex

combination of Diracs, which are probability measures

slide-44
SLIDE 44

44

Courtesy of Khalid El-Arini

slide-45
SLIDE 45

45

Two independent samples ! from "#(%, !') ! = *

+,- ./

0+123 Each sample ! is a probability distribution (e.g. over parameters) and can be written as a mixture of diracs.

04

Ω Ω

6(-) 6(4) 6(-) 6(4)

04 07 08 09 0: 0- 07 0; 0: 0-

6:

(-)

6:

(4)

slide-46
SLIDE 46

46

Measuring is counting ! " = ∑%&'

() *% ∗ 1{.% ∈ "} *1

!1 2 = *3 + *5 + *6 !1 2 = 0.05+0.05+0.2=0.3

Ω Ω

.(') .(1) .(') .(1)

*1 *5 *: *; *6 *' *5 *3 *6 *'

!' 2 = *6 + *; + *: !' 2 = 0.05+0.1+0.3=0.45

" "

For a fixed subset ", notice how G(A) is random. In fact even the *% change value for each sample.

slide-47
SLIDE 47

47

How to generate an infinite sequence of (mixture) weights ! = !#, !%, … which sum up to 1? we can use stick-breaking ! ∼ ()*(1, -) To generate a finite sequence of (mixture) weights ! = !#, !%, … !/ that sum up to 1, we can use the Dirichlet distribution ! ∼ 01213ℎ567(-#, … , -/)

slide-48
SLIDE 48

Beta Distribution

!" ∼ $%&' (, * !+ = 1 − !" / !" (, * ∝ !"

12" 1 − !" 32"

Equivalent to: !", !+ ∼ 4565789%& (, * / !", !+ (, * ∝ !"

12"!+ 32"

(, * → +∞ gives peaked distribution around (/(( + *)

slide-49
SLIDE 49

49

Stick Breaking ! ∼ #$%(')

!) 1 1 − !) !) 1 − !) − !, !, !) !, !- !) !, !- … !/ …

) ∼ 1234 1, '

!) = 0

) , ∼ 1234 1, '

!, = 0

,(1 − !))

  • ∼ 1234 1, '

!- = 0

,(1 − !) − !,)

Griffiths, Engen, McCloskey

slide-50
SLIDE 50

50

Defining Dirichlet Process from finite-dimensional marginals.

slide-51
SLIDE 51

51

Dirichlet Process

Samples ! ∼ #$(&, !() of a Dirichlet Process are themselves probability measures (i.e. distributions) over a measurable space (Ω, ℱ).

,: ℱ → /0

which associate a probability to every measurable subset 1 ∈ ℱ. Note: ℱ is the set of all measurable subsets 1 ⊆ Ω. Parameters are the base probability distribution !( (over Ω) and the parameter & > 0.

Ω

6(7) 6(8)

98 9: 9; 9< 9= 97

1

slide-52
SLIDE 52

52

Kolmogorov Consistency Construction

For any ! ≥ 0 , consider any partition $%, $', … , $) of the space Ω. We define the following finite-dimensional distributions

+($%), … , +($)) ∼ /0102ℎ456(7 ∗ 9: ;< , … , 7 ∗ 9:(;=))

Since they can be proved* to be consistent with each other, Kolmogorov’s extension theorem states that they define a unique stochastic process, we will call Dirichlet Process:

9 ∼ >?(7, 9:)

slide-53
SLIDE 53

53

Here !", !$, !% is a partition of the parameter space Ω. Assume ' = 10, +, = - 0, .$ . Draw two distributions +", +$ ∼112 34 ', +, . First sample +" !" = 56 + 58 + 59 +" !$ = 5$ + 5: +" !% = 5" Second sample +$ !" = 5% + 5: + 56 +$ !$ = 5$ +$ !% = 5" Probability masses for base distribution (deterministic) +, !" = - 0, .$ !" = 0.8 +, !$ = - 0, .$ !$ = 0.2 +, !% = - 0, .$ !% = 0.2 Then we have that + !" , + !$ , + !% ∼ =>?>@ABCD(8,2,2)

Ω

G(") G($)

5$ 5: 59 58 56 5"

!" !% !$

Ω

G(") G($)

!" !% !$

5$ 5: 5% 56 5"

slide-54
SLIDE 54

54

All constructions match.

It can be shown that Stick- Breaking and Kolmogorov consistency definitions match.

https://www.stat.ubc.ca/~bouchard/courses/stat547-sp2011/notes-part2.pdf

slide-55
SLIDE 55

55

Defining Dirichlet Process from Chinese Restaurant Process /BlackWell-McQueen Urn.

slide-56
SLIDE 56

56

Chinese Restaurant Process (CRP)

Infinity of Tables !" = $%&'()*({-&.ℎ, 1()2, 3('4})

slide-57
SLIDE 57

57

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 1 arrives.
  • Takes any free table.
  • Sample a dish !" ∼ $%-> Tofu
  • state={{1}}, n=1 customers

1 Tofu

slide-58
SLIDE 58

58

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 2 arrives.
  • P(new table)∝ α
  • P(table {1})∝

1 = 1

  • Decides to sit at {1}
  • Share dish: '( = ') =Tofu
  • {{1,2}}, n=2 customers

1 2 Tofu

slide-59
SLIDE 59

59

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 3 arrives.
  • P(new table)∝ α
  • P(table {1,2})∝

1,2 = 2

  • Decides to sit at new table
  • Sample a dish )* ∼ ,--> Pork
  • {{1,2},{3}}, n=3 customers

1 2 3 Tofu Pork

slide-60
SLIDE 60

60

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 4 arrives.
  • P(new table)∝ α
  • P(table {1,2})∝

1,2 = 2

  • P(table {3})∝

3 = 1

  • Share dish, *+ = *, =Tofu
  • {{1,2,4},{3}}, n=4 customers

1 2 3 4 Tofu Pork

slide-61
SLIDE 61

61

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 5 arrives.
  • P(new table)∝ α
  • P(table {1,2,4})∝

1,2,4 = 3

  • P(table {3})∝

3 = 1

  • Pick new table
  • Sample new dish +, = -./ℎ
  • {{1,2,4},{3},{5}}, n=5 customers

1 2 3 4 5 Tofu Pork Fish

slide-62
SLIDE 62

62

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 6 arrives.
  • P(new table)∝ α
  • P(table {1,2,4})∝

1,2,4 = 3

  • P(table {3})∝

3 = 1

  • P(table {5})∝

5 = 1

  • Pick table {1,2,4}
  • Share dish ,- = ,. = /012
  • {{1,2,4,6},{3},{5}}, n=6 customers

1 2 3 4 5 6 Tofu Pork Fish

slide-63
SLIDE 63

Chinese Restaurant Process (CRP)

We can look at the sequence of dishes !" = $%&' !( = $%&' !) = *%+, !- = $%&' !. = /01ℎ !3 = $%&' It can be shown that the distribution of !4 4 is

  • exchangeable. That is:

5 !" = '", !( = '(, … = 5 !" = '8("), !( = '8((), … The order in which the customers arrive is actually not important.

slide-64
SLIDE 64

64

De Finetti’s Theorem

  • M. I. Jordan NIPS 2017

Tutorial http://faculty.dbmi.pitt. edu/day/Bioinf2132- advanced-Bayes-and- R/Bioinf2132- documents-2017/2017- 11-30/nips- tutorial05.pdf

slide-65
SLIDE 65

65

De Finetti’s Theorem

Applied to the CRP, it means there exist a unique* random variable !, such that all " become independent conditionally to !. We can show that !~$%(', !))! Here: +, =

. / 01234 + . / 05678 + . / 0961:, ; is the same (∝ =>? @ABC>)

Let D = D., DE, … ∼ HIJ(') stick-breaking. Sample " = "8K., "8KE, … ∼22L H, Now we can form our random measure H = ∑8K.

NO D8 ∗ 0QR.

And we sample "2K., "2KE , … ∼22L H *Unique in distribution.

"2 is the parameter for data point S (customer S) "8 is the parameter for component T (table T)

slide-66
SLIDE 66

66

Blackwell-McQueen Urn Polya Urn

slide-67
SLIDE 67

67

Same process, different story. Each dish is a set of unique ball colors. Each customer is a successive draw.

slide-68
SLIDE 68

68

Using Dirichlet Process for infinite mixture models.

slide-69
SLIDE 69

69

Chinese Restaurant Process (CRP)

Infinity of Components !" = $(&", Σ")

slide-70
SLIDE 70

70

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Sample parameter for data point 1.
  • Takes any free table.
  • Sample a parameter !" ∼ $%
  • state={{1}}, n=1 customers

1

&'() = (), −.)

slide-71
SLIDE 71

71

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Sample parameter for data point 2.
  • P(new table)∝ α
  • P(table {1})∝

1 = 1

  • Decides to sit at {1}
  • Share dish: '( = ') =Tofu
  • {{1,2}}, n=2 customers

1 2

*+,- = (-, −1)

slide-72
SLIDE 72

72

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Sample parameter for data point 3.
  • P(new table)∝ α
  • P(table {1,2})∝

1,2 = 2

  • Decides to sit at new table
  • Sample a dish )* ∼ ,--> Pork
  • {{1,2},{3}}, n=3 customers

1 2 3

./01 = (1, −4) ./01 = (6, 1)

slide-73
SLIDE 73

73

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 4 arrives.
  • P(new table)∝ α
  • P(table {1,2})∝

1,2 = 2

  • P(table {3})∝

3 = 1

  • Share dish, *+ = *, =Tofu
  • {{1,2,4},{3}}, n=4 customers

1 2 3 4

  • ./0 = (0, −3)
  • ./0 = (5, 0)
slide-74
SLIDE 74

74

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 5 arrives.
  • P(new table)∝ α
  • P(table {1,2,4})∝

1,2,4 = 3

  • P(table {3})∝

3 = 1

  • Pick new table
  • Sample new dish +, = -./ℎ
  • {{1,2,4},{3},{5}}, n=5 customers

1 2 3 4 5

1234 = (4, −7) 1234 = (9, 4) 1234 = (:, −4)

slide-75
SLIDE 75

75

Chinese Restaurant Process (CRP)

Infinity of Tables

  • Customer 6 arrives.
  • P(new table)∝ α
  • P(table {1,2,4})∝

1,2,4 = 3

  • P(table {3})∝

3 = 1

  • P(table {5})∝

5 = 1

  • Pick table {1,2,4}
  • Share dish ,- = ,. = /012
  • {{1,2,4,6},{3},{5}}, n=6 customers

1 2 3 4 5 6

3456 = (6, −9) 345; = (<, 6) 3459 = (;, −6)

slide-76
SLIDE 76

76

Ω

"($) "(&)

'$ '& '(

)*+, = (,, −0) )*+1 = (2, ,) )*+0 = (1, −,)

What does G look like?

slide-77
SLIDE 77

77

We can describe a generative process of data points.

(but first let’s recall the generative process for GMM)

slide-78
SLIDE 78

78

slide-79
SLIDE 79

79