What uncertainty do we get? Zhenwen Dai 11 October 2019 Zhenwen - - PowerPoint PPT Presentation

what uncertainty do we get
SMART_READER_LITE
LIVE PREVIEW

What uncertainty do we get? Zhenwen Dai 11 October 2019 Zhenwen - - PowerPoint PPT Presentation

What uncertainty do we get? Zhenwen Dai 11 October 2019 Zhenwen Dai What uncertainty do we get? 11 October 2019 1 / 27 Probabilistic Models Many probabilistic models have been discussed. We are interested in probabilistic models because it


slide-1
SLIDE 1

What uncertainty do we get?

Zhenwen Dai 11 October 2019

Zhenwen Dai What uncertainty do we get? 11 October 2019 1 / 27

slide-2
SLIDE 2

Probabilistic Models

Many probabilistic models have been discussed. We are interested in probabilistic models because it provides how uncertain it is about its prediction. Uncertainty has been categorized into various names such as epistemic uncertainty, aleatoric uncertainty, model uncertainty, noise. What do people mean by these types of uncertainty?

Zhenwen Dai What uncertainty do we get? 11 October 2019 2 / 27

slide-3
SLIDE 3

Uncertainty in Discriminative Model

Regression as an example: y = f(x) + ǫ A simple example, Bayesian linear regression (BLR): yi = w⊤Φ(xi) + ǫi Two random variables: w ∼ N(0, I), ǫi ∼ N(0, σ2)

Zhenwen Dai What uncertainty do we get? 11 October 2019 3 / 27

slide-4
SLIDE 4

Uncertainty in Discriminative Model

By uncertainty, we usually mean how wide is the probabilistic distribution of the predicted variable. For BLR, it refers to var(y∗) = Ep(y∗|x∗)[(y∗ − ¯ y∗)2]. If we obtain maximum likelihood estimate (MLE) of w, ˆ w, the predictive distribution is p(y∗|x∗, ˆ w) = ˆ w⊤Φ(x∗) + ǫ∗. If we do Bayesian inference over w, the predictive distribution is p(y∗|x∗) =

  • p(y∗|x∗, w)p(w|x, y)dw.

Zhenwen Dai What uncertainty do we get? 11 October 2019 4 / 27

slide-5
SLIDE 5

Epistemic and Aleatoric Uncertainty

  • Aleatoric uncertainty

Aleatoric uncertainty is also known as statistical uncertainty, and is representative of unknowns that differ each time we run the same experiment.

  • Epistemic uncertainty

Epistemic uncertainty is also known as systematic uncertainty, and is due to things one could in principle know but doesn’t in practice. This may be because a measurement is not accurate, because the model neglects certain effects, or because particular data has been deliberately hidden.

Zhenwen Dai What uncertainty do we get? 11 October 2019 5 / 27

slide-6
SLIDE 6

Epistemic and Aleatoric Uncertainty in BLR

Use BLR as an example: yi = w⊤Φ(xi) + ǫi, w ∼ N(0, I), ǫi ∼ N(0, σ2) In the usual modeling scenario, ǫ corresponds to aleatoric uncertainty. Measured as var(y∗) = Ep(y∗|x∗, ˆ

w)[(y∗ − ¯

y∗)2] = σ2. w corresponds to epistemic uncertainty. Measured as var(f∗) = Ep(f∗|x∗)[(f∗ − ¯ f∗)2], where f∗ = w⊤Φ(x∗).

Zhenwen Dai What uncertainty do we get? 11 October 2019 6 / 27

slide-7
SLIDE 7

Separation of Uncertainty

With a probabilistic model, what we care is the predictive distribution p(y∗|x∗). The separation of epistemic and aleatoric uncertainty seems a bit

  • artificial. Do we really need to separate them?

Zhenwen Dai What uncertainty do we get? 11 October 2019 7 / 27

slide-8
SLIDE 8

Probability Calibration

It is a common question in practice whether we should trust the predictive probability. What does it mean when a weather forecasting method predict 70%

  • f probability of raining.

It is an well understood question in frequentist statistics.

0.0 0.2 0.4 0.6 0.8 1.0

Error=30.6

Outputs Gap

ence

Zhenwen Dai What uncertainty do we get? 11 October 2019 8 / 27

slide-9
SLIDE 9

Probability Calibration for Aleatoric and Epistemic Uncertainty

Make sense for aleatoric uncertainty. It is i.i.d., ǫ1, . . . , ǫN ∼ p(ǫ). Probability calibration for epistemic uncertainty? Does the uncertainty from the exact Bayesian posterior warrant calibrated probability on output? How about the measure only happened once? How shall we give a prior distribution? Would uncertainty be calibrated in this case?

Zhenwen Dai What uncertainty do we get? 11 October 2019 9 / 27

slide-10
SLIDE 10

Uncertainty in Decision Making

Alternatively we may assess the quality of uncertainty by the performance of downstream tasks. Which uncertainty shall we use in Bayesian optimization, experimental design?

Zhenwen Dai What uncertainty do we get? 11 October 2019 10 / 27

slide-11
SLIDE 11

Preferential Bayesian Optimization

Many functions that we are interested in

  • ptimizing is hard to measure:

◮ user experience, e.g, UI design ◮ movie/music rating

Human are much better at comparing two things, e.g., is this coffee better than the previous one? To search for the most preferred option via only pair-wise comparisons.

Zhenwen Dai What uncertainty do we get? 11 October 2019 11 / 27

slide-12
SLIDE 12

Preference Function

Preference function: p(y = 1|x, x′) = π(x, x′) = σ(g(x′) − g(x)). Copeland function: S(x) =

1 Vol(X)

  • X Iπ(x,x′)≥0.5dx′.

The minimal of a Copeland function corresponds to the most preferred choice.

−10 −5 5 10 15 20 f(x)

Objective function

Global minimum 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Score value

Copeland and soft-Copeland functions

Copeland soft-Copeland 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x’ 0.5 0.5 0.5

Preference function

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Zhenwen Dai What uncertainty do we get? 11 October 2019 12 / 27

slide-13
SLIDE 13

Exploration

p(y|x, x′) = π(x, x′)y(1 − π(x, x′))1−y, π(x, x′) = σ(f(x, x′)). E[y] = π(x, x′), var(y) = π(x, x′)(1 − π(x, x′))

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Expectation of y? and (f?)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variance of y∗

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variance of (f?)

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Zhenwen Dai What uncertainty do we get? 11 October 2019 13 / 27

slide-14
SLIDE 14

Epistemic and aleatoric uncertainty are different. Exploration should done only with epistemic uncertainty.

Zhenwen Dai What uncertainty do we get? 11 October 2019 14 / 27

slide-15
SLIDE 15

What about composite model?

I1 I2 x1

1

x1

2

x1

3

x1

4

x1

5

y2 y3 y5 y4 x2

1

x2

2

x2

3

x2

4

y1 y6 x3

1

x3

2

x3

3

x3

4

G E EG latent representation

  • f disease stratification

survival analysis gene ex- pression clinical mea- surements and treatment clinical notes social net- work, music data X-ray biopsy environment epigenotype genotype

Zhenwen Dai What uncertainty do we get? 11 October 2019 15 / 27

slide-16
SLIDE 16

Disclaimer

I don’t know how to categorize the uncertainty from a probabilistic generative model for unsupervised learning such as VAE, GPLVM.

Zhenwen Dai What uncertainty do we get? 11 October 2019 16 / 27

slide-17
SLIDE 17

Separation of Uncertainty in Complex model

We need a systematic approach to separate epistemic and aleatoric uncertainty. Let’s still focus on discriminative models yi = f(xi) + ǫi

Zhenwen Dai What uncertainty do we get? 11 October 2019 17 / 27

slide-18
SLIDE 18

Look back at BLR

yi = w⊤Φ(xi) + ǫi, w ∼ N(0, I), ǫi ∼ N(0, σ2) Aleatoric uncertainty: Unknowns that differ each time we run the same experiment. Epistemic uncertainty: Things one could in principle know but doesn’t in practice.

Zhenwen Dai What uncertainty do we get? 11 October 2019 18 / 27

slide-19
SLIDE 19

One way to classify

Aleatoric uncertainty Unknowns that differ each time we run the same experiment. Independence among data points yi = (xi, hi) Epistemic uncertainty Things one could in principle know but doesn’t in practice. Global variable yi = (xi, h)

Zhenwen Dai What uncertainty do we get? 11 October 2019 19 / 27

slide-20
SLIDE 20

Variables Shared by a Subset of Data Points

Aleatoric uncertainty yi = (xi, hi) Epistemic uncertainty yi = (xi, h) What about something in between? yi = (xi, hz(i)), z : {1, . . . , N} → {1, . . . , C}

Zhenwen Dai What uncertainty do we get? 11 October 2019 20 / 27

slide-21
SLIDE 21

An example: Multi-output GP

Also known as Intrinsic Coregionalization Each input location corresponds to C different output dimensions. f = (f11, . . . , f1N, . . . , fC1, . . . , fCN)⊤. f|X ∼ N(0, B ⊗ K), B ∈ RC×C, K ∈ RN×N.

Zhenwen Dai What uncertainty do we get? 11 October 2019 21 / 27

slide-22
SLIDE 22

Latent variable multi-output GP

Assume B is a covariance matrix computed according to a kernel function k(·, ·) over a set of variable h1, . . . , hC. hi is a latent variable, hi ∼ N(0, I).

0.0 0.2 0.4 0.6 0.8 1.0

x

−20 −10 10 20 30

f(x)

Task 0 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8

−0.4 −0.2 0.0 0.2 0.4

latent dim h0

−0.4 −0.2 0.0 0.2 0.4

latent dim h1

Task-0 Task-1 Task-2 Task-3 Task-4 Task-5 Task-6 Task-7 Task-8

Zhenwen Dai What uncertainty do we get? 11 October 2019 22 / 27

slide-23
SLIDE 23

Latent variable multi-output GP

GP-ind −2 2 test train LMC −2 2 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 LVMOGP −2 2 Zhenwen Dai What uncertainty do we get? 11 October 2019 23 / 27

slide-24
SLIDE 24

Epistemic or aleatoric?

For multi-task learning, one output correspond to a task. The uncertainty associated with hi is epistemic uncertainty of the task. What if only one observation can be collected for each task? It becomes aleatoric! A better way to see it may be epistemic within the group and aleatoric for other groups.

Zhenwen Dai What uncertainty do we get? 11 October 2019 24 / 27

slide-25
SLIDE 25

Soft group assignment

Let’s see a more confusing case by softening the group assignment. The covariance of data points within a group is a bias kernel B11 = b11   1 · · · 1 . . . ... . . . 1 · · · 1   . Augment the model with one hi for each data point xi, yi = f(xi, hi), the covariance matrix is B ⊙ K. The joint distribution p(h1, . . . , hN) correlates. A trivial case would be the degenerate distribution h1 = . . . = hN = ǫ, ǫ ∼ p(ǫ).

Zhenwen Dai What uncertainty do we get? 11 October 2019 25 / 27

slide-26
SLIDE 26

Continuous Learning

An example of previous model is a model for continuous learning. Data points arrives with different time, x1, . . . , xT and y1, . . . , yT. The underlying function may change over time f1(·), . . . , fT(·). We can construct such a model in the above form by constructing a state-space model, p(h1, . . . , hT) = p(h1)

T

  • t=2

p(ht|ht−1) Are h1, . . . hT epistemic or aleatoric?

Zhenwen Dai What uncertainty do we get? 11 October 2019 26 / 27

slide-27
SLIDE 27

Summary

Epistemic and aleatoric uncertainty and their role in decision making. “outliner” models that are hard to be classified. Thoughts: Looking at the uncertainty of the output variable may be the best way.

Zhenwen Dai What uncertainty do we get? 11 October 2019 27 / 27