Scaling of scoring rules Jonas Wallin joint work with David Bolin - - PowerPoint PPT Presentation
Scaling of scoring rules Jonas Wallin joint work with David Bolin - - PowerPoint PPT Presentation
Scaling of scoring rules Jonas Wallin joint work with David Bolin (KAUST) CIRM virtual conference 2020-06-02 Forecast and observation classes (a) Forecast (b) Observation (c) Comparison 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8
Forecast and observation classes
(a) Forecast
2 4 6 8 10
(b) Observation
2 4 6 8 10
(c) Comparison
2 4 6 8 10
2 / 36
Scoring functions apply to deterministic forecasts
The forecast x is evaluated against the observation y using scoring functions such as
negative Squared Error (SE) S(x, y) = −(x − y)2 negative Absolute Error (AE) S(x, y) = −|x − y|
3 / 36
Bayes predictors should be used for probilistic forecasts
For a probabilistic forecast P, decision theory tells us that if the scoring function S is given, we should issue the Bayes predictor, ˆ x = arg minx EP [S(x, Y )] as the point forecast, where the expectation is with respect to P.
Squared Error (SE) S(x, y) = −(x − y)2 ˆ x = mean(P) Absolute Error (AE) S(x, y) = −|x − y| ˆ x = median(P)
4 / 36
The basic idea
Assume we have a prediction p ∈ P and an observation o ∈ O where we wish to measure the skill of the prediction by applying a function s : P × O − → R with a higher function value indicating a better skill. What are good theoretical properties for s?
5 / 36
General framework without any formulas...
Assume Q is Nature’s distribution of some event y and denote
- ur forecast for y by P.
6 / 36
General framework without any formulas...
Assume Q is Nature’s distribution of some event y and denote
- ur forecast for y by P.
For forecast evaluation, we should use performance metrics that follow the principle in the long run, we will obtain the optimal performance for P = Q
6 / 36
Probabilistic forecasts should generally be evaluated using proper scoring rules
A consistent scoring function is a special case of a proper scoring rule for probabilistic forecasts
Definition (Murphy and Winkler, 1968)
If F denotes a class of probabilistic forecasts on R, a proper scoring rule is any function S : F × R → R such that S(Q, Q) := EQ S(Q, Y ) ≥ EQ S(P, Y ) =: S(P, Q) for all P, Q ∈ F.
7 / 36
The class of proper scoring rules is large
S(P, y) = −(mean(P) − y)2 S(P, y) = −|median(P) − y|
Gneiting, T. and Raftery, A.E. (2007): Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association, 102, 359-178.
8 / 36
Optimally, forecasts should be probabilistic
All those whose duty it is to issue regular daily forecasts know that there are times when they feel very confident and other times when they are doubtful as to coming weather. It seems to me that the condition of confidence or otherwise forms a very important part of the prediction. Cooke (Monthly Weather Review, 1906)
(d) Forecast
2 4 6 8 10
(e) Observation
2 4 6 8 10
(f) Comparison
2 4 6 8 10
9 / 36
The class of proper scoring rules is large
The perhaps the two most common proper scoring rule is the continuous ranked probability score (CRPS) S(P, y) = −EP|X − y| + 1 2EPEP|X − X′| and the log score S(P, y) = − log(f(y)),
Gneiting, T. and Raftery, A.E. (2007): Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association, 102, 359-178.
10 / 36
The different scores behave somewhat differently
y Score −4 −2 2 4 1 2 3 4
SE AE CRPS IGN
11 / 36
Average scores facilitate comparison across methods
Assume we have two forecasting methods m = 1, 2. They issue point forecasts Pmi with observed values yi, at a finite set of times, locations or instances i = 1, . . . , n The methods are assessed and ranked by the mean score (our contribution starts here) ¯ Sm
n = 1
n
n
- i=1
S(Pmi, yi) for m = 1, 2.
12 / 36
Average scores facilitate comparison across methods
−4 −2 2 4 −4 −2 2 4
13 / 36
Two observations, two models
14 / 36
Two observations, two models
14 / 36
Two observations, two models
14 / 36
Two observations, two models, result
Model 1 Model 2 CRPS CRPS Y1 0.0023 0.02346 Y2 4.0486 3.920 mean 2.0255 1.9719
15 / 36
Other example
Consider a situation with two observations Yi ∼ Qθi = N(0, σ2
i ), i = 1, 2, with σ1 = 0.1 and σ2 = 1.
Assume that we want to evaluate a model which has predictive distributions Pi = N(0, ˆ σ2
i ) for Yi, using the average of a
proper scoring rule for the two observations.
16 / 36
Other example
Consider a situation with two observations Yi ∼ Qθi = N(0, σ2
i ), i = 1, 2, with σ1 = 0.1 and σ2 = 1.
Assume that we want to evaluate a model which has predictive distributions Pi = N(0, ˆ σ2
i ) for Yi, using the average of a
proper scoring rule for the two observations.
17 / 36
Other example
CRPS log(LS)
18 / 36
Varying scale in practice?
- ●
- ● ●
- ●
- ●
- ●●
- ●
- −70
−60 −50 −40 160 170 180 Longitude Latitude
−3 −2 −1 1
Kuusela, M. and Stein, M.L. (2018): Locally stationary spatio-temporal interpolation of argo profiling float data. Proceedings of the Royal Society A, 474 Bolin, D. and Wallin, J. (2020):Multivariate type-G Matérn-SPDE random fields, JRSSB
19 / 36
Example spatial statistics
We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup
20 / 36
Example spatial statistics
We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup Let si, i = 1, . . . , n be a set of, typically irregular, locations.
20 / 36
Example spatial statistics
We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup Let si, i = 1, . . . , n be a set of, typically irregular, locations. We have a set of observations {yi}n
i=1 at the locations {si}n i=1.
20 / 36
Example spatial statistics
We will now go through how model evaluation using a scoring rule is typically done is spatial statistics. We start with the basic setup Let si, i = 1, . . . , n be a set of, typically irregular, locations. We have a set of observations {yi}n
i=1 at the locations {si}n i=1.
The score of the model, P, is given by ¯ s = 1
n
n
i=1 S(Pi, yi).
20 / 36
Realization
- 0.00
0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x y
0.05 0.10 0.15
Y 21 / 36
variation of the standard deviation
- 0.00
0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x y
0.002 0.004 0.006
sd
Figure: true kriging standard devation, by location
0.00 0.25 0.50 0.75 1.00 0.002 0.004 0.006
sd start
Figure: emperical density of the true kriging standard devation, σi
22 / 36
Mathematical framework
Definition
If S is a proper scoring rule. If Qσ, Pσ are probability measure with scaling σ then ˜ S(Pˆ
σ, Qσ, π) =
- S
- Pˆ
σ(σ), Qσ
- π(dσ),
is a proper scoring rule
23 / 36
Mathematical framework
Definition
If S is a proper scoring rule. If Qσ, Pσ are probability measure with scaling σ then ˜ S(Pˆ
σ, Qσ, π) =
- S
- Pˆ
σ(σ), Qσ
- π(dσ),
is a proper scoring rule The difference between this scoring rule and regular scoring rule is that there is no ¯ S(Pˆ
σ, y) function. It is a theortical
construction. However if σi ∼ π and Yi ∼ Qσi then 1 n
n
- i=1
S (Pˆ
σi, Yi) → ˜
S(Pˆ
σ, Qσ, π)
23 / 36
Defining ¯ s mathematically
What affects the shape of π be?
24 / 36
Defining ¯ s mathematically
What affects the shape of π be? If Y is a Gaussian processes σi (and hence π) is bascially determined by the distance of the locations, s.
24 / 36
Defining ¯ s mathematically
What affects the shape of π be? If Y is a Gaussian processes σi (and hence π) is bascially determined by the distance of the locations, s. if assume that the observations comes from some point processes, we can derive the true leave-one-out standard devations.
24 / 36
Defining ¯ s mathematically
s ∼ PPois(λ)
- 0.00
0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x y
0.002 0.004 0.006
sd
Figure: Standard devation by location
100 200 300 400 500 0.000 0.002 0.004 0.006
sd density
Figure: Estimate of π
25 / 36
Point distribution and π
- 0.00
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- 0.00
- 0.00
26 / 36
Recall the issue
CRPS log(LS)
27 / 36
locale scale
Definition
Let S be a proper scoring rule and let Qθ = Q[µ,σ] be a probability measure with location µ and scale σ. Assume that there exist a constant p ∈ R and a function s(Qθ, r) : F × R2 → R+, such that for each r ∈ R × R S(Qθ, Qθ) − S(Qθ+tσr, Qθ) = s(Qθ, r)tp + o(tp). Then s is the scale function of S, which is locally scale invariant if s(Qθ, r) ≡ s(Q, r).
28 / 36
locale scale function
The scale function of the log score,S(P, y) = log(f(y)), is locally scale invariant.
29 / 36
locale scale function
The scale function of the log score,S(P, y) = log(f(y)), is locally scale invariant. The scale function of the CRPS is s(Qσ, r) = σS(Q1, r), i.e. the scale function is not locally scale invariant.
29 / 36
Known issue
The issue of unbalanced predictive distribution is not unknown. A lot of work has been put of standardizing observations S(P, y) = |med(P) − y|
- VP[Y ]
not a proper score.
30 / 36
Known issue
The issue of unbalanced predictive distribution is not unknown. A lot of work has been put of standardizing observations S(P, y) = |med(P) − y|
- VP[Y ]
not a proper score. Previous solutions is to use a reference prediction, using a so called skill score Sskill(P, y) = S(P, y) S(Pref, y) here Pref is the reference predictor. However the results will be determined by the reference measure
30 / 36
Known issue
The issue of unbalanced predictive distribution is not unknown. A lot of work has been put of standardizing observations S(P, y) = |med(P) − y|
- VP[Y ]
not a proper score. Previous solutions is to use a reference prediction, using a so called skill score Sskill(P, y) = S(P, y) S(Pref, y) here Pref is the reference predictor. However the results will be determined by the reference measure An other alternative is to use a weighted CRPS S(P, y) =
- (P(X ≤ x) − I(y ≤ x))2 ω(x)dx
see for Gneiting and Ranjan, 2011.
30 / 36
Our idea
Recall the continuous ranked probability score (CRPS) is given by S(P, y) = −EP|X − y| + 1 2EPEP|X − X′|
31 / 36
Our idea
Recall the continuous ranked probability score (CRPS) is given by S(P, y) = −EP|X − y| + 1 2EPEP|X − X′| We introduce a different scoring rule, which we denote standardized continuous ranked probability score (SCRPS): S(P, y) = − EP|X − y| EPEP|X − X′| − 1 2 log
- EPEP|X − X′|
- 31 / 36
Kernel scores
Theorem (Gneiting and Raftery (2007))
Let P be a Borel probability measure on a Hausdorff space Ω. Assume that g is a non-negative, continuous negative definite kernel
- n Ω × Ω and let P denote the class of Borel probability measures
- n Ω such that EP,P [g(X, Y )] < ∞. Then the scoring rule
Sg(P, y) := 1 2EPEP
- g(X, X′)
- − EP [g(X, y)]
is proper on P. CRPS is obtained by noting that g(x, y) = |x − y| is a negative definite kernel. In fact g(x, y) = |x − y|α, α ∈ (0, 2] is a negative definite kernel.
32 / 36
h-function Kernel scores
Theorem
Let g be a non-negative, continuous negative definite kernel on Ω × Ω, and let P be Borel probability measure on Ω. Let h be a monotonically increasing concave differentiable function on R+. Further let P denote the class of Borel probability measures on Ω s.t EPEP [g(X, X′)] < ∞. Then the scoring rule Sh
g (P, y) := − h
- EPEP
- g(X, X′)
- − 2h′
EPEP
- g(X, X′)
EP [g(X, y)] − EPEP
- g(X, X′)
- is proper on P.
sCRPS is obtained by noting that g(x, y) = |x − y| is a negative definite kernel, and h(x) = 1
2 log(x) is a
monotonically increasing concave differentiable function.
33 / 36
Scale function
The scale function of the log score, S(P, y) = − log(f(y)), is locally scale invariant. The scale function of the CRPS is s(Qσ, r) = σS(Q1, r), i.e. the scale function is not locally scale invariant. The scale function SCRPS is locally scale invariant!
34 / 36
Two observations, two models, result
Model 1 Model 2 CRPS log-score sCRPS CRPS log-score sCRPS Y1 0.0023
- 3.6862
- 1.5351
0.0234
- 1.3836
- 0.3838
Y2 4.0486 16.516 4.9338 3.9204 14.154 4.5666 mean 2.0255 6.4149 1.6994 1.9719 6.3853 2.0914
35 / 36
Other example
CRPS log(LS) SCRPS
36 / 36