Scaling of scoring rules Jonas Wallin joint work with David Bolin - - PowerPoint PPT Presentation

scaling of scoring rules
SMART_READER_LITE
LIVE PREVIEW

Scaling of scoring rules Jonas Wallin joint work with David Bolin - - PowerPoint PPT Presentation

Scaling of scoring rules Jonas Wallin joint work with David Bolin (KAUST) CIRM virtual conference 2020-06-02 Forecast and observation classes (a) Forecast (b) Observation (c) Comparison 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8


slide-1
SLIDE 1

Scaling of scoring rules

Jonas Wallin joint work with David Bolin (KAUST) CIRM virtual conference 2020-06-02

slide-2
SLIDE 2

Forecast and observation classes

(a) Forecast

2 4 6 8 10

(b) Observation

2 4 6 8 10

(c) Comparison

2 4 6 8 10

2 / 36

slide-3
SLIDE 3

Scoring functions apply to deterministic forecasts

The forecast x is evaluated against the observation y using scoring functions such as

negative Squared Error (SE) S(x, y) = −(x − y)2 negative Absolute Error (AE) S(x, y) = −|x − y|

3 / 36

slide-4
SLIDE 4

Bayes predictors should be used for probilistic forecasts

For a probabilistic forecast P, decision theory tells us that if the scoring function S is given, we should issue the Bayes predictor, ˆ x = arg minx EP [S(x, Y )] as the point forecast, where the expectation is with respect to P.

Squared Error (SE) S(x, y) = −(x − y)2 ˆ x = mean(P) Absolute Error (AE) S(x, y) = −|x − y| ˆ x = median(P)

4 / 36

slide-5
SLIDE 5

The basic idea

Assume we have a prediction p ∈ P and an observation o ∈ O where we wish to measure the skill of the prediction by applying a function s : P × O − → R with a higher function value indicating a better skill. What are good theoretical properties for s?

5 / 36

slide-6
SLIDE 6

General framework without any formulas...

Assume Q is Nature’s distribution of some event y and denote

  • ur forecast for y by P.

6 / 36

slide-7
SLIDE 7

General framework without any formulas...

Assume Q is Nature’s distribution of some event y and denote

  • ur forecast for y by P.

For forecast evaluation, we should use performance metrics that follow the principle in the long run, we will obtain the optimal performance for P = Q

6 / 36

slide-8
SLIDE 8

Probabilistic forecasts should generally be evaluated using proper scoring rules

A consistent scoring function is a special case of a proper scoring rule for probabilistic forecasts

Definition (Murphy and Winkler, 1968)

If F denotes a class of probabilistic forecasts on R, a proper scoring rule is any function S : F × R → R such that S(Q, Q) := EQ S(Q, Y ) ≥ EQ S(P, Y ) =: S(P, Q) for all P, Q ∈ F.

7 / 36

slide-9
SLIDE 9

The class of proper scoring rules is large

S(P, y) = −(mean(P) − y)2 S(P, y) = −|median(P) − y|

Gneiting, T. and Raftery, A.E. (2007): Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association, 102, 359-178.

8 / 36

slide-10
SLIDE 10

Optimally, forecasts should be probabilistic

All those whose duty it is to issue regular daily forecasts know that there are times when they feel very confident and other times when they are doubtful as to coming weather. It seems to me that the condition of confidence or otherwise forms a very important part of the prediction. Cooke (Monthly Weather Review, 1906)

(d) Forecast

2 4 6 8 10

(e) Observation

2 4 6 8 10

(f) Comparison

2 4 6 8 10

9 / 36

slide-11
SLIDE 11

The class of proper scoring rules is large

The perhaps the two most common proper scoring rule is the continuous ranked probability score (CRPS) S(P, y) = −EP|X − y| + 1 2EPEP|X − X′| and the log score S(P, y) = − log(f(y)),

Gneiting, T. and Raftery, A.E. (2007): Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association, 102, 359-178.

10 / 36

slide-12
SLIDE 12

The different scores behave somewhat differently

y Score −4 −2 2 4 1 2 3 4

SE AE CRPS IGN

11 / 36

slide-13
SLIDE 13

Average scores facilitate comparison across methods

Assume we have two forecasting methods m = 1, 2. They issue point forecasts Pmi with observed values yi, at a finite set of times, locations or instances i = 1, . . . , n The methods are assessed and ranked by the mean score (our contribution starts here) ¯ Sm

n = 1

n

n

  • i=1

S(Pmi, yi) for m = 1, 2.

12 / 36

slide-14
SLIDE 14

Average scores facilitate comparison across methods

−4 −2 2 4 −4 −2 2 4

13 / 36

slide-15
SLIDE 15

Two observations, two models

14 / 36

slide-16
SLIDE 16

Two observations, two models

14 / 36

slide-17
SLIDE 17

Two observations, two models

14 / 36

slide-18
SLIDE 18

Two observations, two models, result

Model 1 Model 2 CRPS CRPS Y1 0.0023 0.02346 Y2 4.0486 3.920 mean 2.0255 1.9719

15 / 36

slide-19
SLIDE 19

Other example

Consider a situation with two observations Yi ∼ Qθi = N(0, σ2

i ), i = 1, 2, with σ1 = 0.1 and σ2 = 1.

Assume that we want to evaluate a model which has predictive distributions Pi = N(0, ˆ σ2

i ) for Yi, using the average of a

proper scoring rule for the two observations.

16 / 36

slide-20
SLIDE 20

Other example

Consider a situation with two observations Yi ∼ Qθi = N(0, σ2

i ), i = 1, 2, with σ1 = 0.1 and σ2 = 1.

Assume that we want to evaluate a model which has predictive distributions Pi = N(0, ˆ σ2

i ) for Yi, using the average of a

proper scoring rule for the two observations.

17 / 36

slide-21
SLIDE 21

Other example

CRPS log(LS)

18 / 36

slide-22
SLIDE 22

Varying scale in practice?

  • ● ●
  • ●●
  • −70

−60 −50 −40 160 170 180 Longitude Latitude

−3 −2 −1 1

Kuusela, M. and Stein, M.L. (2018): Locally stationary spatio-temporal interpolation of argo profiling float data. Proceedings of the Royal Society A, 474 Bolin, D. and Wallin, J. (2020):Multivariate type-G Matérn-SPDE random fields, JRSSB

19 / 36

slide-23
SLIDE 23

Example spatial statistics

We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup

20 / 36

slide-24
SLIDE 24

Example spatial statistics

We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup Let si, i = 1, . . . , n be a set of, typically irregular, locations.

20 / 36

slide-25
SLIDE 25

Example spatial statistics

We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup Let si, i = 1, . . . , n be a set of, typically irregular, locations. We have a set of observations {yi}n

i=1 at the locations {si}n i=1.

20 / 36

slide-26
SLIDE 26

Example spatial statistics

We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup Let si, i = 1, . . . , n be a set of, typically irregular, locations. We have a set of observations {yi}n

i=1 at the locations {si}n i=1.

The score of the model, P, is given by ¯ s = 1

n

n

i=1 S(Pi, yi).

20 / 36

slide-27
SLIDE 27

Realization

  • 0.00

0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

x y

0.05 0.10 0.15

Y 21 / 36

slide-28
SLIDE 28

variation of the standard deviation

  • 0.00

0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

x y

0.002 0.004 0.006

sd

Figure: true kriging standard devation, by location

0.00 0.25 0.50 0.75 1.00 0.002 0.004 0.006

sd start

Figure: emperical density of the true kriging standard devation, σi

22 / 36

slide-29
SLIDE 29

Mathematical framework

Definition

If S is a proper scoring rule. If Qσ, Pσ are probability measure with scaling σ then ˜ S(Pˆ

σ, Qσ, π) =

  • S

σ(σ), Qσ

  • π(dσ),

is a proper scoring rule

23 / 36

slide-30
SLIDE 30

Mathematical framework

Definition

If S is a proper scoring rule. If Qσ, Pσ are probability measure with scaling σ then ˜ S(Pˆ

σ, Qσ, π) =

  • S

σ(σ), Qσ

  • π(dσ),

is a proper scoring rule The difference between this scoring rule and regular scoring rule is that there is no ¯ S(Pˆ

σ, y) function. It is a theortical

construction. However if σi ∼ π and Yi ∼ Qσi then 1 n

n

  • i=1

S (Pˆ

σi, Yi) → ˜

S(Pˆ

σ, Qσ, π)

23 / 36

slide-31
SLIDE 31

Defining ¯ s mathematically

What affects the shape of π be?

24 / 36

slide-32
SLIDE 32

Defining ¯ s mathematically

What affects the shape of π be? If Y is a Gaussian processes σi (and hence π) is bascially determined by the distance of the locations, s.

24 / 36

slide-33
SLIDE 33

Defining ¯ s mathematically

What affects the shape of π be? If Y is a Gaussian processes σi (and hence π) is bascially determined by the distance of the locations, s. if assume that the observations comes from some point processes, we can derive the true leave-one-out standard devations.

24 / 36

slide-34
SLIDE 34

Defining ¯ s mathematically

s ∼ PPois(λ)

  • 0.00

0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

x y

0.002 0.004 0.006

sd

Figure: Standard devation by location

100 200 300 400 500 0.000 0.002 0.004 0.006

sd density

Figure: Estimate of π

25 / 36

slide-35
SLIDE 35

Point distribution and π

  • 0.00
0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x y 0.002 0.004 0.006 sd
  • 0.00
0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 x y 0.002 0.004 0.006 sd
  • 0.00
0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x y 0.001 0.002 0.003 0.004 sd 100 200 300 400 500 0.000 0.002 0.004 0.006 sd density 500 1000 1500 2000 0.000 0.002 0.004 0.006 sd density 200 400 600 0.000 0.002 0.004 0.006 sd density

26 / 36

slide-36
SLIDE 36

Recall the issue

CRPS log(LS)

27 / 36

slide-37
SLIDE 37

locale scale

Definition

Let S be a proper scoring rule and let Qθ = Q[µ,σ] be a probability measure with location µ and scale σ. Assume that there exist a constant p ∈ R and a function s(Qθ, r) : F × R2 → R+, such that for each r ∈ R × R S(Qθ, Qθ) − S(Qθ+tσr, Qθ) = s(Qθ, r)tp + o(tp). Then s is the scale function of S, which is locally scale invariant if s(Qθ, r) ≡ s(Q, r).

28 / 36

slide-38
SLIDE 38

locale scale function

The scale function of the log score,S(P, y) = log(f(y)), is locally scale invariant.

29 / 36

slide-39
SLIDE 39

locale scale function

The scale function of the log score,S(P, y) = log(f(y)), is locally scale invariant. The scale function of the CRPS is s(Qσ, r) = σS(Q1, r), i.e. the scale function is not locally scale invariant.

29 / 36

slide-40
SLIDE 40

Known issue

The issue of unbalanced predictive distribution is not unknown. A lot of work has been put of standardizing observations S(P, y) = |med(P) − y|

  • VP[Y ]

not a proper score.

30 / 36

slide-41
SLIDE 41

Known issue

The issue of unbalanced predictive distribution is not unknown. A lot of work has been put of standardizing observations S(P, y) = |med(P) − y|

  • VP[Y ]

not a proper score. Previous solutions is to use a reference prediction, using a so called skill score Sskill(P, y) = S(P, y) S(Pref, y) here Pref is the reference predictor. However the results will be determined by the reference measure

30 / 36

slide-42
SLIDE 42

Known issue

The issue of unbalanced predictive distribution is not unknown. A lot of work has been put of standardizing observations S(P, y) = |med(P) − y|

  • VP[Y ]

not a proper score. Previous solutions is to use a reference prediction, using a so called skill score Sskill(P, y) = S(P, y) S(Pref, y) here Pref is the reference predictor. However the results will be determined by the reference measure An other alternative is to use a weighted CRPS S(P, y) =

  • (P(X ≤ x) − I(y ≤ x))2 ω(x)dx

see for Gneiting and Ranjan, 2011.

30 / 36

slide-43
SLIDE 43

Our idea

Recall the continuous ranked probability score (CRPS) is given by S(P, y) = −EP|X − y| + 1 2EPEP|X − X′|

31 / 36

slide-44
SLIDE 44

Our idea

Recall the continuous ranked probability score (CRPS) is given by S(P, y) = −EP|X − y| + 1 2EPEP|X − X′| We introduce a different scoring rule, which we denote standardized continuous ranked probability score (SCRPS): S(P, y) = − EP|X − y| EPEP|X − X′| − 1 2 log

  • EPEP|X − X′|
  • 31 / 36
slide-45
SLIDE 45

Kernel scores

Theorem (Gneiting and Raftery (2007))

Let P be a Borel probability measure on a Hausdorff space Ω. Assume that g is a non-negative, continuous negative definite kernel

  • n Ω × Ω and let P denote the class of Borel probability measures
  • n Ω such that EP,P [g(X, Y )] < ∞. Then the scoring rule

Sg(P, y) := 1 2EPEP

  • g(X, X′)
  • − EP [g(X, y)]

is proper on P. CRPS is obtained by noting that g(x, y) = |x − y| is a negative definite kernel. In fact g(x, y) = |x − y|α, α ∈ (0, 2] is a negative definite kernel.

32 / 36

slide-46
SLIDE 46

h-function Kernel scores

Theorem

Let g be a non-negative, continuous negative definite kernel on Ω × Ω, and let P be Borel probability measure on Ω. Let h be a monotonically increasing concave differentiable function on R+. Further let P denote the class of Borel probability measures on Ω s.t EPEP [g(X, X′)] < ∞. Then the scoring rule Sh

g (P, y) := − h

  • EPEP
  • g(X, X′)
  • − 2h′

EPEP

  • g(X, X′)

EP [g(X, y)] − EPEP

  • g(X, X′)
  • is proper on P.

sCRPS is obtained by noting that g(x, y) = |x − y| is a negative definite kernel, and h(x) = 1

2 log(x) is a

monotonically increasing concave differentiable function.

33 / 36

slide-47
SLIDE 47

Scale function

The scale function of the log score, S(P, y) = − log(f(y)), is locally scale invariant. The scale function of the CRPS is s(Qσ, r) = σS(Q1, r), i.e. the scale function is not locally scale invariant. The scale function SCRPS is locally scale invariant!

34 / 36

slide-48
SLIDE 48

Two observations, two models, result

Model 1 Model 2 CRPS log-score sCRPS CRPS log-score sCRPS Y1 0.0023

  • 3.6862
  • 1.5351

0.0234

  • 1.3836
  • 0.3838

Y2 4.0486 16.516 4.9338 3.9204 14.154 4.5666 mean 2.0255 6.4149 1.6994 1.9719 6.3853 2.0914

35 / 36

slide-49
SLIDE 49

Other example

CRPS log(LS) SCRPS

36 / 36