"RobExtremes" Robust Extreme Value Statistics Outline a - - PowerPoint PPT Presentation

robextremes robust extreme value statistics outline a new
SMART_READER_LITE
LIVE PREVIEW

"RobExtremes" Robust Extreme Value Statistics Outline a - - PowerPoint PPT Presentation

"RobExtremes" Robust Extreme Value Statistics Outline a New Member in the RobASt-Family of R Packages Introduction Project Robust Risk Estimation UseR! 2013 Infrastructure of R packages Packages for distributions: distr


slide-1
SLIDE 1

"RobExtremes" – Robust Extreme Value Statistics – a New Member in the RobASt-Family of R Packages UseR! 2013 Nataliya Horbenko1, Matthias Kohl2, Peter Ruckdeschel3,4

1 KPMG AG Wirtschaftsprüfungsgesellschaft,

The SQUAIRE / Am Flughafen, 60549 Frankfurt/Main, Germany

2 Furtwangen University, Dept. of Medical and Life Sciences,

Jakob-Kienzle-Straße 17, 78054 Villingen-Schwenningen , Germany

3 Fraunhofer ITWM, Dept. of Financial Mathematics,

Fraunhofer-Platz 1, 67663 Kaiserslautern, Germany

4 TU Kaiserslautern, Dept. of Mathematics,

Erwin-Schrödinger-Straße, Geb 48, 67663 Kaiserslautern, Germany

Albacete, Spain, July 11, 2013

1

Outline Introduction

Project “Robust Risk Estimation”

Infrastructure of R packages

Packages for distributions: distr family Package for models: distrMod

Our Robustification Approach

Contribution of Robust Statistics: Outliers Packages for robust asymptotic statistics: RobASt family Optimally robust estimation in R Storing interpolation grids

New package RobExtremes

Concept GPD as parametric model

Application to OpRisk quantification

Setup Challenges Optimally robust estimation in RobExtremes Diagnostic plots 2

Introduction Project “Robust Risk Estimation”

3

Who we are: Project funded by

. . . aims at theoretical foundation, development and application of robust procedures for risk management of complex systems in the presence of extreme events common base: <theory> robust statistics <implementation> Rpkg’s of distr & RobASt families

4

slide-2
SLIDE 2

Infrastructure of R packages Packages for distributions: distr family Package for models: distrMod

5

Available Infrastructure for Distributions: distr family Package name Short description distr S4 classes for distributions distrEx Functionals for distributions distrMod S4 classes for probability models distrEllipse S4 classes for elliptically contoured distrs distrRmetrics S4 classes for distrs from fBasics & fGarch distrSim S4 classes for simulations distrTEst S4 classes for estimation and testing distrTeach Extensions for teaching distrDoc Documentation for distr packages startupmsg Utilities for start-up messages SweaveListingUtils Utilities for Sweave

6

Distributions as Objects and Arithmetics

## Create a normal and a Poisson Distribution R> N <- Norm(mean = 0, sd = 3) R> P <- Pois(lambda = 2) ## identical calls for r (RNG), d (density), ## p (cdf), q (quantile fct) R> c(p(N)(.5), p(P)(2)) [1] 0.5661838 0.6766764 R> c(q(N)(.5), q(P)(.5)) [1] 0 2 ## Arithmetics R> X <- sin(N+P) R> c(p(X)(.5), q(X)(.5)) [1] 0.6642434 0.008785884 ## plotting density, cdf and quantile fct R> plot(X)

−1.0 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 x d(x)

Density of AbscontDistribution

−1.0 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 q p(q)

CDF of AbscontDistribution

0.0 0.4 0.8 −1.0 −0.5 0.0 0.5 1.0 p q(p)

Quantile function of AbscontDistribution

7

Functionals for Distributions – the E Operator (distrEx)

## Initialize GPD-object R> GPD <- GPareto(loc = 10, scale = 2, shape = 0.5) ## Returns analytical value R> E(GPD) [1] 14 ## Classical integration of density R> E(as(GPD, "AbscontDistribution")) i.e., numerically compute E GPD = ∞

−∞ x dGPD(x) λ(dx)

[1] 13.40216 ## Integration with probability integral transform i.e., numerically compute E fun(GPD) = 1

0 fun(qGPD(x)) λ(dx)

R> E(GPD, fun=function(x){x}) [1] 13.99747 ## Identical code for all distribution objects R> E(Pois(lambda = 10)) [1] 10 8

slide-3
SLIDE 3

Maximum Likelihood Estimation (distrMod)

  • 5

10 15 20 5 10 15 20 25 30

Copper in in wholemeal flour

Observation Index Parts per million [ppm] mean 95%−CI of mean

9

Maximum Likelihood Estimation (distrMod)

Operational risk data Amount of loss Density 10000 30000 50000 0.0000 0.0005 0.0010 0.0015 ML fit Operational risk data zoomed Amount of loss Density 500 1000 1500 2000 0.000 0.002 0.004 0.006 0.008

10

Our Robustification Approach Contribution of Robust Statistics: Outliers Packages for robust asymptotic statistics: RobASt family Optimally robust estimation in R Storing interpolation grids

11

Outliers

  • What makes an observation an outlier?

– happens rarely (5%–10%) – uncontrollable, of unknown distribution, unpredictable – often: no error-free distinction from ideal obs.’ – outlier situation may change from obs. to obs.

  • Despite outliers:

realistic sample should be close to one from ideal setting

  • neighborhoods: U = {Q | Q = (1 − ε)Pθ + εH }

note: no standard mixing model H may vary from obs. to obs.!

  • accuracy as maxMSE(Sn) := maxU EQ |Sn − θ|2

12

slide-4
SLIDE 4

Available Infrastructure for Robust Asymptotic Statistics: RobASt family Package name Short description RandVar Implementation of random variables RobAStBase Robust Asymptotic Statistics ROptEst Optimally robust estimation RobAStRDA sysdata.rda for pkg’s of RobASt - Family RobExtremes Opt-rob. est’ors for extreme value distr’s. RobLox Opt-rob. ICs and est’ors for location and scale RobLoxBioC Opt-rob. est’ors for preprocessing omics data ROptEstOld Optimally robust estimation - old version ROptRegTS Opt-rob. est’ors for regression-type models RobRex Opt-rob. est’ors for regression and scale

13

Optimally robust estimation in R

## data: wholemeal flour R> library(ROptEst) R> s0 <- c(median(chem), mad(chem)) R> ROest1 <- roptest(chem, NormLocationScaleFamily()) # speed-up by interpolation: R> library(RobLox) R> ROest2 <- roblox(chem, returnIC = TRUE) R> rbind(estimate(ROest1),estimate(ROest2)) mean sd [1,] 3.163591 0.6613414 [2,] 3.338290 0.6184967

  • roptest computes estimator with

min max (as)MSE

  • w/o specifying outlier rate, roptest

selects least favor. rate RMXE

  • roptest takes 28sec,

roblox ∼ 0.1sec

  • 5

10 15 20 5 10 15 20 25 30

Copper in in wholemeal flour

Observation Index Parts per million [ppm] mean 95%−CI of mean RMXE 95%−CI of RMXE

14

Storing interpolation grids — some R insights just seen: interpolation is useful; technique: not only store grids, but also interpolating fct’s Issues and solutions / lessons learnt

issue return values from approxfun and splinefun generated ≤ R-2.15.2 no longer valid ≥ R-3.0.0 and vice versa solution store two sets of interpolators and switch acc. to R-version at run-time issue with many models/procedures, pkg containing interpolators can get large conflict with CRAN policies solution delegate interpolators to separate (less frequently updated) package RobAStRDA issue conflicts with namespaces when modifying interpolators outside pkg solution functions to generate/manipulate interpolators from within pkg namespace

15

New package RobExtremes Concept GPD as parametric model

16

slide-5
SLIDE 5

New package RobExtremes

  • infrastructure for opt-rob. estimation for extreme value

distributions / scale shape models, i.e.,

– Gamma – Gumbel – Generalized Extreme Value D. – Generalized Pareto D. – Weibull – Pareto

  • particular methods for expectations
  • high breakdown starting estimators

– scale functionals: Sn and Qn (Rousseeuw&Croux[93]), kMAD (asym. variant of mad, R.&Horbenko[12]) – LDEstimators (Marazzi&Ruffieux[99]) in particular medkMAD, medSn, and medQn – Pickands’ estimator (including asy. variance and IC) – Quantile estimator for Weibull (Boudt et al.[11])

  • speed up by interpolation for opt-rob. estimators
  • enhanced diagnostic plots (from RobAStBase)

17

Extreme Value Setup: GPD as parametric model

  • Fisher-Tippett-Gnedenko Theorem:

possible limit distributions of max(Xi) have cdf Hθ(x) = exp(−(1 + ξ(x − µ)/β)−1/ξ) (GEVD)

  • Pickands-Balkema-de Haan Theorem:

linked to tails ∼ Generalized Pareto distribution (GPD) cdf Fθ(x) = 1−(1 + ξ(x − µ)/β)−1/ξ Parameter θ = (ξ, β, µ)τ:

  • shape ξ (≥ 0) (tail behavior)

[goal]

  • scale β

[goal]

  • location/threshold µ (≤ x)

[fixed]

1e−04 1e−01 1e+02 1e+05 0.0 0.2 0.4 0.6 0.8 1.0

GPD: Fθ(x) ξ = 0.7 β = 1 µ = 0 18

Application to OpRisk quantification Setup Challenges Optimally robust estimation in RobExtremes Diagnostic plots

19

Application to OpRisk quantification Setup

  • OpRisk :: risk of loss resulting from inadequate or failed internal processes, people

and systems or from external events

  • Basel II: standards for regulatory capital required to cover losses from OpRisk
  • assessed by Loss Distribution Approach (LDA):

model severity and frequency of losses separately and cell-wise in a matrix built by business lines and event types, OpRisk quantified as 99%-OpVaR, i.e., 99% quantile of resp. compound distr.

  • involves parameter estimation in GPD

Data

[source Algorithmics, Inc. (IBM)]

  • data: losses in business line Asset Management (AM in the sequel)
  • collected from 2431 institutes in last 20yrs
  • 600 observed damages > 1Mio USD
  • frequency: λ = 0.012/yr

20

slide-6
SLIDE 6

OpRisk: Challenges

  • Modeling the severity distribution of operational losses
  • combining internal and external data (scaling problem)
  • defining the threshold between “body” (frequent, moderate)

and “tail” (rare, severe) losses

  • estimating the parameters for “tail” of severity distribution

Robustness issues:

  • few outliers can render estimated opRisk useless
  • relevance problem due to pooling

21

Optimally robust estimation in RobExtremes

## now the same at OpRisk data ## robust starting estimator R> GP <- GParetoFamily() R> est0 <- medkMAD(AM1-1.6, GP, k=10) .... scale shape 12.082985 1.625142 .... ## opt-robust estimator R> roptest(x = AM1 - 1.6, GP) .... scale shape 13.153116 1.391451 ( 1.336223) ( 0.116243) ....

  • last calculation w/o speed-up takes

240sec, with interpolation speed-up ∼ 3sec Histogram and fitted model densities

Amount of loss in AM [Mio. USD] Density of log10(loss) 1 10 100 1000 10000 1e+05 0.0 0.1 0.2 0.3 0.4 0.5 0.6 MLE−fitted GPD RMXE−fitted GPD medkMAD−fitted GPD

mind log-scale for x-axis 22

Diagnostic plots

  • major contribution in RobAStBase, ROptEst, RobExtremes:

enhanced diagnostic plots to assess estimation quality, i.e.

– influence curve plots – information plots – (enhanced) qqplots –

  • utlyingness plots,

– (not here:) cniper plots

  • all are very flexible and powerful but have many arguments
  • Dasha Pupashenko and Misha Pupashenko have provided

wrapper functions to ease use for beginners

  • each wrapper returns the call to the original powerful

function user may easily enhance his plot

23

Application to OpRisk quantification: Diagnostics Influence Curve Plot

  • shows the local influence of data
  • n both estimated parameters
  • radii according to total influence
  • n parameter estimation
  • top panel: MLE
  • bottom panel: RMXE
  • command:

plot(IC,data, ...)

24

slide-7
SLIDE 7

Application to OpRisk quantification: Diagnostics Information Plot

[with synthetic data here]

  • checks for which parameter

coordinate (scale or shape) information is used

  • in rel. Info plots: percentage of

info used for this coord.

  • shows: large obs. mainly used

for shape, scale determined by medium range obs.

  • circles of observation radii

according to their relative total influence on both parameters

  • command:

infoPlot(IC,data, ...)

  • ● ●

20 40 60 80 100

  • abs. info

IC MLE IC OMSE

0.0 0.2 0.4 0.6 0.8 1.0

  • rel. info scale
  • ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  • ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  • 0.5

1.0 2.0 5.0 10.0 20.0 50.0 200.0 0.0 0.2 0.4 0.6 0.8 1.0

  • rel. info shape
  • ● ● ●●●
  • ● ● ●●●
  • 25

Application to OpRisk quantification: Diagnostics QQ Plot

  • checks goodness of fit
  • we add outlier-adjusted

confidence bands (with least

  • fav. rate)
  • bservation are plotted as circles
  • radii according to influence on

parameter estimation

  • command:

qqplot(data,model, ...)

  • −6

−4 −2 2 4 6 8 −5 5

True log−quantiles Estimated log−quantiles

  • utlier−adjusted symm pointw. asympt. α = 95 %− conf. interval
  • utlier−adjusted simult. asympt. α = 95 %− conf. interval

26

Application to OpRisk quantification: Diagnostics Outlyingness Plot

  • identifies outlying points

according to both size and influence

  • influence measured at classical

(non-robust) scale

  • cutoffs chosen according to

robust procedures (cov, model quantile)

  • four labeled observations are

caused by B. L. Madoff fraud,

  • ne from the Ponzi scheme
  • command:
  • utlyingPlotIC(data,

IC.class, IC.rob, ...)

  • log−quantiles

Mahalanobis distance 99.00%−cutoff = 8.31 99.00%−cutoff = 4.74 246 318 320 321 322 444 27

Application to OpRisk quantification: Threshold plot OpVaR Calculation

  • plot threshold µ of GPD against

estimated OpVaR

  • common practice: select µ

where curve stabilizes, i.e., ∼ X430

  • higher degree of stability of

OMSE already on real data

  • in lower panel: added 5 large
  • utliers
  • OMSE-choice not affected;

sharp rise in OpVaR for MLE

210 220 230 240 250

real data

Threshold OpVaR, Mio. USD x1 x300 x400 x470 x500 OMSE MLE 240 260 280 300

contaminated data

Threshold OpVaR, Mio. USD x1 x300 x400 x470 x500 OMSE MLE

28

slide-8
SLIDE 8

Conclusions

  • shown some class infrastructure for conceptually dealing with distributions in R
  • emphasized the importance of combining robustness ideas to extreme value

modeling to stabilize inference and predictions

  • introduced you to package RobExtremes to achieve this
  • demonstrated some easy-to-use functionality
  • there is interesting additional diagnostics valuable beyond robustness

STILL: our contamination mechanism to capture outliers is an idealization! “. . . as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.” (Donald Rumsfeld (Feb.2002)) . . . so protecting against things we do not know remains difficult

29

Bibliography

  • K. Boudt, D. Caliskan, C. Croux, (2011). Robust explicit estimators of Weibull parameters. Metrika, 73 (2),

187–209.

  • N. Horbenko (2012). Robuste Ansätze für Operationelle Risiken von Banken. PhD Thesis, TU Kaiserslautern,

Verlag Dr. Hut.

  • N. Horbenko, P

. Ruckdeschel, and T. Bae (2011). Robust Estimation of Operational Risk. The Journal of Operational Risk 6(2), 3–30.

  • M. Kohl, H. Rieder, and P

. Ruckdeschel, (2010). Infinitesimally Robust Estimation in General Smoothly Parametrized Models. Stat. Methods Appl., 19, 333–354.

  • M. Kohl and P

. Ruckdeschel (2010). R package distrMod: Object-Oriented Implementation of Probability

  • Models. J. Statist. Softw. 35(10), 1–27.
  • A. Marazzi and C. Ruffieux (1999). The truncated mean of an asymmetric distribution. Comput. Statist. Data

Anal., 32, 79–100.

  • H. Rieder (1994). Robust Asymptotic Statistics. Springer-Verlag.
  • P

.J. Rousseeuw, and C. Croux, (1993). Alternatives to the Median Absolute Deviation, Journal of the American Statistical Association 88, 1273–1283.

  • P

. Ruckdeschel and N. Horbenko (2012). Yet another breakdown point notion: EFSBP — illustrated at scale-shape models. Metrika, 75 (8), 1025–1047.

  • P

. Ruckdeschel, M. Kohl, T. Stabla, and F. Camphausen (2006). S4 Classes for Distributions. R News, 6(2), 2–6 Software

  • R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org.

  • M. Ribatet (2009). R package POT: Generalized Pareto Distribution and Peaks Over Threshold. R Package version

1.1-2 on CRAN.

  • P

. Ruckdeschel, M. Kohl, N. Horbenko (2013). R package RobExtremes: Optimally robust estimation for extreme value distributions. R Package version 0.9 on R-Forge. 30