ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY - - PowerPoint PPT Presentation

robust online aggregation of forecasts
SMART_READER_LITE
LIVE PREVIEW

ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY - - PowerPoint PPT Presentation

ROBUST ONLINE AGGREGATION OF FORECASTS APPLICATION TO ELECTRICITY LOAD FORECASTING Pierre Gaillard October 21, 2015 University of Copenhagen T he framework of this talk Sequential prediction of arbitrary time-series based on expert forecasts:


slide-1
SLIDE 1

ROBUST ONLINE AGGREGATION OF FORECASTS

APPLICATION TO ELECTRICITY LOAD FORECASTING

Pierre Gaillard October 21, 2015 University of Copenhagen

slide-2
SLIDE 2

The framework of this talk

Sequential prediction of arbitrary time-series based on expert forecasts:

  • a time-series y1, . . . , yn ∈ Rd is to be predicted
  • Expert forecasts are available: e.g., given by some stochastic or

machine-learning models (for us: black boxes) At each forecasting instance t 1, . . . , n

  • forecasting black-box k ∈ {1, . . . , K} provides forecast xk,t of yt
  • we are ask to form a prediction

yt of yt with knowledge of

  • the past observations y1, . . . , yt−1
  • the current and past expert forecasts (xk,s)s≤t, 1≤k≤K
  • we observe yt

1 / 16

slide-3
SLIDE 3

The framework of this talk

Sequential prediction of arbitrary time-series based on expert forecasts:

  • a time-series y1, . . . , yn ∈ Rd is to be predicted
  • Expert forecasts are available: e.g., given by some stochastic or

machine-learning models (for us: black boxes) At each forecasting instance t 1, . . . , n

  • forecasting black-box k ∈ {1, . . . , K} provides forecast xk,t of yt
  • typical solution: assign a weight

pk,t to each expert and predict

  • yt

K

  • k1
  • pk,txk,t
  • we observe yt

1 / 16

slide-4
SLIDE 4

Evaluation criterion

We consider a convex loss function ℓ : Rd × Rd → R, e.g., the square loss ℓ(x, y) x − y2. Goal: minimize our average loss

  • Ln 1

n

n

  • t1

ℓ( yt, yt) . Difficulty: no stochastic assumption on the time series

  • neither on the observations (yt)
  • neither on the expert forecasts (xk,t)

They are arbitrary and can be chosen by an adversary. If all experts are bad, good performance is hopeless ➥ relative criterion

2 / 16

slide-5
SLIDE 5

The regret: a relative criterion

We evaluate our performance relatively to the ones of the experts 1 n

n

  • t1
  • ℓt
  • def

Ln

  • ur

performance

  • min

k1,...,K

1 n

n

  • t1

ℓk,t

  • def

L⋆

n

reference performance (approximation error) + 1 n

n

  • t1
  • ℓt −

min

k1,...,K

1 n

n

  • t1

ℓk,t

  • def

Regn average regret (estimation error) where ℓt ℓ( yt, yt) and ℓk,t ℓ(xk,t, yt).

Goal

Perform almost as good as the best of the experts when n → ∞ lim sup

n→∞

  • sup

(yt),(xk,t)

Regn

  • ≤ 0

3 / 16

slide-6
SLIDE 6

Best convex combination

A more ambitious approximation error min

q∈∆K

1 n

n

  • t1

  • K
  • k1

qkxk,t, yt

  • where ∆K

q ∈ RK

+ : K k1 qk 1

. If an expert provides inaccurate forecasts which compensate other expert forecasts, we should increase its weight. ➥ The gradient trick formalizes this idea Example for the square loss: (xk,t − yt)2 ➝

(

yt − yt)(xk,t − yt)

4 / 16 Our prediction

slide-7
SLIDE 7

Brief summary

A meta-statistical interpretation:

  • expert forecasts are given by some statistical forecasting

methods, each possibly tuned with a different given set of

  • parameters. They may rely on some stochastic model.
  • these ensemble forecasts are then combined in a robust and

deterministic manner A trade-off: our final performance expresses these two parts

  • Ln L⋆

n + Regn 5 / 16

slide-8
SLIDE 8

Application: electricity load forecasting

Goal: a day-ahead forecasting of the French electricity load Data characteristics:

  • January 1, 2008 – August 31, 2011 as a training data set
  • September 1, 2011 – June 15, 2012 (excluding some special

days) as testing set

  • Electricity demand for EDF clients, at a half-hour step
  • Typical values: median = 43 496 MW,

maximum = 78 922 MW

  • Three expert forecasters: GAM, CLR, KWF

6 / 16

slide-9
SLIDE 9

Data looks like...

7 / 16

slide-10
SLIDE 10

Application: electricity load forecasting

Convex loss functions considered:

  • squareloss: ℓ(x, y) (x − y)2 ➝ RMSE
  • absolute percentage of error: ℓ(x, y) |x − y|/y ➝ MAPE

Operational constraint: One-day ahead prediction at a half-hour step, i.e., 48 aggregated forecasts Expert forecasters:

  • GAM / generalized additive models

(see Wood 2006; Wood, Goude, Shaw 2014)

  • CLR / curve linear regression

(see Cho, Goude, Brossat, Yao 2013, 2014)

  • KWF / functional wavelet-kernel approach

(see Antoniadis, Paparoditis, Sapatinas 2006; Antoniadis, Brossat, Cugliari, Poggi 2012, 2013)

8 / 16

slide-11
SLIDE 11

How good are our expert?

Loss: RMSE and MAPE on the testing sets (with no warm-up period)

  • 1

n

n

  • t1

(

yt − yt)2 1 n

n

  • t1

|

yt − yt| yt We look at the performance of the oracles: Uniform Best Best Best mean forecaster convex p linear u RMSE (MW) 725 744 629 629 MAPE (MW) 1.18 1.29 1.06 1.06

9 / 16

slide-12
SLIDE 12

A strategy to pick the convex weights

The exponentially weighted average forecaster (EWA) Parameter: η > 0 Initialization: p1 (1/K, . . . , 1/K) At each time step t, we assign to expert k the weight

  • pk,t

exp

  • − η t−1

s1 ℓk,t

  • K

j1 exp

  • − η t−1

s1 ℓj,s

  • Performance: if the loss is convex and bounded by B,

Regn

def

1 n

n

  • t1
  • ℓt − min

k

1 n

n

  • t1

ℓk,t ≤ log K ηn + ηB2 8 ≤ B

  • log K

2n

for η B−1

  • 8 log K

n

10 / 16

slide-13
SLIDE 13

A strategy to pick the convex weights

The exponentially weighted average forecaster (EWA) Parameter: η > 0 Initialization: p1 (1/K, . . . , 1/K) At each time step t, we assign to expert k the weight

  • pk,t
  • pk,t−1e−ηℓk,t−1

K

j1

pj,t−1e−ηℓj,t−1 Performance: if the loss is convex and bounded by B, Regn

def

1 n

n

  • t1
  • ℓt − min

k

1 n

n

  • t1

ℓk,t ≤ log K ηn + ηB2 8 ≤ B

  • log K

2n

for η B−1

  • 8 log K

n

10 / 16

slide-14
SLIDE 14

Proof: let’s do some maths...

Lemma (Hoeffding) Let X be a random variable taking value in [0, B]. Then for any s ∈ R log E esX ≤ sE[X] + s2B2 8

  • 1. Upper bound the instantaneous loss

ℓt

  • ℓt ℓ(

pt · xt, yt)

by convexity

  • pt · ℓ(xt, yt)

by Hoeffding

≤ − 1 η log

  • K
  • k1
  • pk,te−ηℓk,t
  • + ηB2

8

by definition of pk,t+1

  • − 1

η log pk,t

  • pk,t+1

e−ηℓk,t

  • + ηB2

8

  • ℓk,t + 1

η log pk,t+1

  • pk,t

+ ηB2 8

  • 2. Sum over all t, the sum telescopes

n

  • t1
  • ℓt − ℓk,t ≤ 1

η log ✟✟

  • pk,n+1
  • pk,1

+ ηnB2 8 ≤ log K ηn + ηB2 8

11 / 16

slide-15
SLIDE 15

Calibration of η

Best theoretical value η⋆ B−1

  • 8 log K

n Issue: n and B are not known in advance! Solutions:

  • “doubling trick”
  • adaptive learning rate ηt picked according to some

theoretical value

  • use simultaneously several learning rates. . .
  • calibrate on a grid by choosing

ηt ∈ arg min

η

  • Loss of Exp. weights with η until time t − 1
  • 12 / 16
slide-16
SLIDE 16

Application to electricity load forecasting (continued)

Benchmark and oracles (RMSE) Uniform Best Best Best mean forecaster convex p linear u RMSE (MW) 725 744 629 629 vs. Aggregated forecasts with convex weights

  • Exp. weights (best η for theory)

644

  • Exp. weights (best η on data)

644

  • Exp. weights (best η tuned on data)

625 ML-Poly (tuned according to theory) 626

13 / 16

slide-17
SLIDE 17

Evolution of the weights

No focus on a single member! Weights change significantly over time and do not converge

(illustrate that performance of forecasters varies over time) 14 / 16

slide-18
SLIDE 18

Are all forecasters useful?

Definitely yes! 3 forecasters ➝ only best 2

  • Exp. weights

625 ➝ 644 ML-Poly 626 ➝ 646 Forecasters not considered anymore can come back if needed

15 / 16

slide-19
SLIDE 19

Conclusion

This was only a small glimpse into the work performed during my PhD at EDF R&D. I applied the method to many other data sets with good results ➝ Universality of the method Here, with Olivier we aim at working on

  • huge number of experts ➝ sparse and efficient methods
  • better calibration of the learning parameter to get faster rates
  • lower bounds
  • probabilistic forecasts by using the pinball loss

Thanks

16 / 16