[PPT] - ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY PowerPoint Presentation

SLIDE 1

ROBUST ONLINE AGGREGATION OF FORECASTS

APPLICATION TO ELECTRICITY LOAD FORECASTING

Pierre Gaillard October 21, 2015 University of Copenhagen

SLIDE 2

The framework of this talk

Sequential prediction of arbitrary time-series based on expert forecasts:

a time-series y1, . . . , yn ∈ Rd is to be predicted
Expert forecasts are available: e.g., given by some stochastic or

machine-learning models (for us: black boxes) At each forecasting instance t 1, . . . , n

forecasting black-box k ∈ {1, . . . , K} provides forecast xk,t of yt
we are ask to form a prediction

yt of yt with knowledge of

the past observations y1, . . . , yt−1
the current and past expert forecasts (xk,s)s≤t, 1≤k≤K
we observe yt

1 / 16

SLIDE 3

The framework of this talk

Sequential prediction of arbitrary time-series based on expert forecasts:

a time-series y1, . . . , yn ∈ Rd is to be predicted
Expert forecasts are available: e.g., given by some stochastic or

machine-learning models (for us: black boxes) At each forecasting instance t 1, . . . , n

forecasting black-box k ∈ {1, . . . , K} provides forecast xk,t of yt
typical solution: assign a weight

pk,t to each expert and predict

yt

K

k1
pk,txk,t
we observe yt

1 / 16

SLIDE 4

Evaluation criterion

We consider a convex loss function ℓ : Rd × Rd → R, e.g., the square loss ℓ(x, y) x − y2. Goal: minimize our average loss

Ln 1

n

t1

ℓ( yt, yt) . Difficulty: no stochastic assumption on the time series

neither on the observations (yt)
neither on the expert forecasts (xk,t)

They are arbitrary and can be chosen by an adversary. If all experts are bad, good performance is hopeless ➥ relative criterion

2 / 16

SLIDE 5

The regret: a relative criterion

We evaluate our performance relatively to the ones of the experts 1 n

n

t1
ℓt
def

Ln

ur

performance

min

k1,...,K

1 n

n

t1

ℓk,t

def

L⋆

n

reference performance (approximation error) + 1 n

n

t1
ℓt −

min

k1,...,K

1 n

n

t1

ℓk,t

def

Regn average regret (estimation error) where ℓt ℓ( yt, yt) and ℓk,t ℓ(xk,t, yt).

Goal

Perform almost as good as the best of the experts when n → ∞ lim sup

n→∞

sup

(yt),(xk,t)

Regn

≤ 0

3 / 16

SLIDE 6

Best convex combination

A more ambitious approximation error min

q∈∆K

1 n

n

t1

ℓ

K
k1

qkxk,t, yt

where ∆K

q ∈ RK

+ : K k1 qk 1

. If an expert provides inaccurate forecasts which compensate other expert forecasts, we should increase its weight. ➥ The gradient trick formalizes this idea Example for the square loss: (xk,t − yt)2 ➝

(

yt − yt)(xk,t − yt)

4 / 16 Our prediction

SLIDE 7

Brief summary

A meta-statistical interpretation:

expert forecasts are given by some statistical forecasting

methods, each possibly tuned with a different given set of

parameters. They may rely on some stochastic model.
these ensemble forecasts are then combined in a robust and

deterministic manner A trade-off: our final performance expresses these two parts

Ln L⋆

n + Regn 5 / 16

SLIDE 8

Application: electricity load forecasting

Goal: a day-ahead forecasting of the French electricity load Data characteristics:

January 1, 2008 – August 31, 2011 as a training data set
September 1, 2011 – June 15, 2012 (excluding some special

days) as testing set

Electricity demand for EDF clients, at a half-hour step
Typical values: median = 43 496 MW,

maximum = 78 922 MW

Three expert forecasters: GAM, CLR, KWF

6 / 16

SLIDE 9

Data looks like...

7 / 16

SLIDE 10

Application: electricity load forecasting

Convex loss functions considered:

squareloss: ℓ(x, y) (x − y)2 ➝ RMSE
absolute percentage of error: ℓ(x, y) |x − y|/y ➝ MAPE

Operational constraint: One-day ahead prediction at a half-hour step, i.e., 48 aggregated forecasts Expert forecasters:

GAM / generalized additive models

(see Wood 2006; Wood, Goude, Shaw 2014)

CLR / curve linear regression

(see Cho, Goude, Brossat, Yao 2013, 2014)

KWF / functional wavelet-kernel approach

(see Antoniadis, Paparoditis, Sapatinas 2006; Antoniadis, Brossat, Cugliari, Poggi 2012, 2013)

8 / 16

SLIDE 11

How good are our expert?

Loss: RMSE and MAPE on the testing sets (with no warm-up period)

1

n

t1

(

yt − yt)2 1 n

n

t1

|

yt − yt| yt We look at the performance of the oracles: Uniform Best Best Best mean forecaster convex p linear u RMSE (MW) 725 744 629 629 MAPE (MW) 1.18 1.29 1.06 1.06

9 / 16

SLIDE 12

A strategy to pick the convex weights

The exponentially weighted average forecaster (EWA) Parameter: η > 0 Initialization: p1 (1/K, . . . , 1/K) At each time step t, we assign to expert k the weight

pk,t

exp

− η t−1

s1 ℓk,t

K

j1 exp

− η t−1

s1 ℓj,s

Performance: if the loss is convex and bounded by B,

Regn

def

1 n

n

t1
ℓt − min

k

1 n

n

t1

ℓk,t ≤ log K ηn + ηB2 8 ≤ B

log K

2n

for η B−1

8 log K

n

10 / 16

SLIDE 13

A strategy to pick the convex weights

The exponentially weighted average forecaster (EWA) Parameter: η > 0 Initialization: p1 (1/K, . . . , 1/K) At each time step t, we assign to expert k the weight

pk,t
pk,t−1e−ηℓk,t−1

K

j1

pj,t−1e−ηℓj,t−1 Performance: if the loss is convex and bounded by B, Regn

def

1 n

n

t1
ℓt − min

k

1 n

n

t1

ℓk,t ≤ log K ηn + ηB2 8 ≤ B

log K

2n

for η B−1

8 log K

n

10 / 16

SLIDE 14

Proof: let’s do some maths...

Lemma (Hoeffding) Let X be a random variable taking value in [0, B]. Then for any s ∈ R log E esX ≤ sE[X] + s2B2 8

1. Upper bound the instantaneous loss

ℓt

ℓt ℓ(

pt · xt, yt)

by convexity

≤

pt · ℓ(xt, yt)

by Hoeffding

≤ − 1 η log

K
k1
pk,te−ηℓk,t
+ ηB2

8

by definition of pk,t+1

− 1

η log pk,t

pk,t+1

e−ηℓk,t

+ ηB2

8

ℓk,t + 1

η log pk,t+1

pk,t

+ ηB2 8

2. Sum over all t, the sum telescopes

n

t1
ℓt − ℓk,t ≤ 1

η log ✟✟

✟

pk,n+1
pk,1

+ ηnB2 8 ≤ log K ηn + ηB2 8

11 / 16

SLIDE 15

Calibration of η

Best theoretical value η⋆ B−1

8 log K

n Issue: n and B are not known in advance! Solutions:

“doubling trick”
adaptive learning rate ηt picked according to some

theoretical value

use simultaneously several learning rates. . .
calibrate on a grid by choosing

ηt ∈ arg min

η

Loss of Exp. weights with η until time t − 1
12 / 16

SLIDE 16

Application to electricity load forecasting (continued)

Benchmark and oracles (RMSE) Uniform Best Best Best mean forecaster convex p linear u RMSE (MW) 725 744 629 629 vs. Aggregated forecasts with convex weights

Exp. weights (best η for theory)

644

Exp. weights (best η on data)

644

Exp. weights (best η tuned on data)

625 ML-Poly (tuned according to theory) 626

13 / 16

SLIDE 17

Evolution of the weights

No focus on a single member! Weights change significantly over time and do not converge

(illustrate that performance of forecasters varies over time) 14 / 16

SLIDE 18

Are all forecasters useful?

Definitely yes! 3 forecasters ➝ only best 2

Exp. weights

625 ➝ 644 ML-Poly 626 ➝ 646 Forecasters not considered anymore can come back if needed

15 / 16

SLIDE 19

Conclusion

This was only a small glimpse into the work performed during my PhD at EDF R&D. I applied the method to many other data sets with good results ➝ Universality of the method Here, with Olivier we aim at working on

huge number of experts ➝ sparse and efficient methods
better calibration of the learning parameter to get faster rates
lower bounds
probabilistic forecasts by using the pinball loss

Thanks

16 / 16