 
              The diffjculty of verifying small improvements in forecast quality Alan Geer Satellite microwave assimilatjon team, Research Department, ECMWF (Day job: all-sky assimilatjon) Thanks to: Mike Fisher, Michael Rennie, Martjn Janousek, Elias Holm, Stephen English, Erland Kallen, Tomas Wilhelmsson and Deborah Salmond 1 Slide 1 WMO 7th verifjcatjon workshop, May 8-11, 2017
The viewpoint from an NWP research department ● Not: - What is the skill of a forecast? - Is one NWP centre’s forecast betuer than another? ● But this: - Is one experiment betuer than another? - Is the new cycle (upgrade) betuer than current operatjons? ● Philosophy: - Lots of small improvements add up to generate betuer forecasts. 2 Slide 2 WMO 7th verifjcatjon workshop, May 8-11, 2017
Research to operatjons Individual scientjst Team merge Cycle merge testjng Upgrade to Team Z contributjon operatjonal Team X + Y + Z contributjon Expt. A 1  Control system Expt. B Team X + Y contributjon Expt. C Expt. D Team X contributjon Control Control 2 3 Contributjons from other scientjsts on Contributjons the team 4 from other teams 3 Slide 3 WMO 7th verifjcatjon workshop, May 8-11, 2017
“Iver”: an R&D-focused verifjcatjon tool Confjdence interval infmatjon to account for tjme-correlatjons of paired-difgences in forecast errors Normalised change in RMS error in 500hPa geopotential Correctjon for multjplicity: • 4 separate experiments • 2 hemispheres • roughly 2 independent scores across days 3-10 95% confjdence intervals, based on the paired- difgerence t-test Experiments progressively adding Control: current operatjonal system difgerent components for the cycle (“Cycle 43R1”) 4 upgrade Slide 7 WMO 7th verifjcatjon workshop, May 8-11, 2017
Latjtude-pressure verifjcatjon Normalised change in std. dev. of error in Z (experiment - control) A typical dilemma in NWP development: • Should we accept a degradatjon in stratospheric scores to improve tropospheric midlatjtude scores? • Do we even believe the scores are meaningful? Cross-hatching: signifjcant at 95% using t-test with Šidák correctjon assuming one panel contains 20 independent tests Blue = reductjon in error = experiment betuer than control 5 Slide 8 WMO 7th verifjcatjon workshop, May 8-11, 2017
Latjtude-longitude verifjcatjon Because many improvements (and degradatjons) are local Are these patuerns statjstjcally signifjcant? - requires multjplicity correctjon: work in progress +0.2 +0.1 Normalised change in RMS 0.0 T error at 850hPa -0.1 -0.2 But are these patuerns useful despite the lack of signifjcance testjng? - Yes, this turned out to be a problem associated with a new aerosol climatology that put too much optjcal depth over the Gulf of Guinea - Too much optjcal depth = too much IR radiatjve heatjng at low levels = local temperatures too warm 6 Slide 9 WMO 7th verifjcatjon workshop, May 8-11, 2017
Statjstjcal problems in NWP research & development ● The issues: - Every cycle upgrade generates hundreds of experiments - NWP systems are already VERY good: experiments usually test only minor modifjcatjons, with small expected benefjts to forecast scores - Much of what we do is (in the sofuware sense) regression testjng: • We are checking for unexpected changes or interactjons (bugs) anywhere in the atmosphere, at any scale • Verifjcatjon tools will generate 10,000+ plots, and each of those plots themselves may contain multjple statjstjcal tests ● Accurate hypothesis testjng (signifjcance testjng) is critjcal: - Type I error = rejectjon of null hypothesis when it is true = false positjve . Can be more frequent than expected due to: 1 - Multjple testjng (multjplicity) 2 - Temporal correlatjon of forecast error - Type II error = failure to reject null hypothesis when it is false 3 7 - Changes in forecast error are small; many samples required to gain signifjcance 4 ● Are our chosen scores meaningful and useful? Slide 4 WMO 7th verifjcatjon workshop, May 8-11, 2017
1. Multjple comparisons (multjplicity) ● 95% confjdence = 0.95 probability of NOT making a type I error ● What if we make 4 statjstjcal tests at 95% confjdence? - Probability of not making a type I error in any of the four tests is: 0.95 × 0.95 × 0.95 × 0.95 = 0.81 - We have gone from 95% confjdence to 81% confjdence. - There is now a 1 in 5 chance of at least one test falsely rejectjng the null hypothesis (i.e. falsely showing “signifjcant” results) ● Šidák correctjon: - P TEST = (P FAMILY ) (1/n) - If we want a family-wide p-value of 0.95, then each of the four tests should be performed at 0.987 8 Slide 5 WMO 7th verifjcatjon workshop, May 8-11, 2017
Shouldn’t n be very large? ● If we generate 10,000+ plots, why isn’t n> 10,000? ● Because many of the forecast scores we examine are NOT independent
Testjng the statjstjcal signifjcance testjng Geer (2016, Tellus): Signifjcance of changes in forecast scores ● Three experiments with the full ECMWF NWP system, each run over 2.5 years: - Control - AMSU-A denial: Remove one AMSU-A (an important source of temperature informatjon) from the observing system - Chaos: Change a technical aspect of the system (number of processing elements) that causes initjally tjny numerical difgerence in the results, which quickly grow. ▪ A representatjon of the null hypothesis: no scientjfjc change 1 0 Slide 11 WMO 7th verifjcatjon workshop, May 8-11, 2017
Correlatjon of paired difgerences in other scores with paired difgerences in day-5 Z RMSE scores ● All the dynamical scores are fairly correlated over the troposphere, and with one another → Z500 RMSE is suffjcient to verify tropospheric synoptjc forecasts in the medium range 11 But the stratospheric scores, and relatjve humidity, appear more independent Slide 15 WMO 7th verifjcatjon workshop, May 8-11, 2017
Correlatjon of paired difgerences in scores at other tjme ranges with paired difgerences in day-5 Z RMSE scores ● Scores are correlated over a few days through the tjme range →Day 5 RMSE Z is suffjcient to verify the quality of (roughly) the day 4 to day 6 12 forecasts Slide 16 WMO 7th verifjcatjon workshop, May 8-11, 2017
What is a reasonable n ? ● For the regional scores, n is the product of: • Number of experiments • Medium-range and long-range • Two hemispheres • But why not also count the stratosphere, tropics, lat-lon verifjcatjon? • For the moment, n is computed independently for each style of plot
2. Type I error (false rejectjon of the null hypothesis) due to tjme-correlatjon of forecast errors The chaos experiment should generate false positjves at the chosen p-value (e.g. 0.95). Instead, naive testjng generates false positjves far more frequently. Chaos – control, 95% t-test with computed on 8 chunks k=1.22 of 230 forecasts (infmatjon for tjme- correlatjon) 95% t-test with k=1 (no infmatjon) 14 Slide 12 WMO 7th verifjcatjon workshop, May 8-11, 2017
3. Type II error: failure to reject the null hypothesis The AMSU-A denial experiment should degrade forecast scores. AMSU-A is a very important source of data, known to provide benefjt to forecasts AMSU-A denial – control, computed on 8 chunks of Based on 2.5 years testjng, we know the 230 forecasts AMSU-A denial But on 230 forecasts impact is this (about 4 months) we might get this: Type II error 15 Slide 13 WMO 7th verifjcatjon workshop, May 8-11, 2017
Fightjng type II error: How many forecasts are required to get signifjcance? 1 independent test (e.g. we have one experiment and all we care about is NH day 5 RMSE) Once in a while (e.g. moving from 3D-Var to 4D-Var) A typical cycle upgrade? A typical individual change, e.g. one AMSU-A 16 Slide 14 WMO 7th verifjcatjon workshop, May 8-11, 2017
4. Are our scores meaningful? Changing the reference changes the results Problem areas: Tropics, stratosphere, any short-range verifjcatjon, any verifjcatjon of humidity SH Tropics NH Temperature Geopotentjal Vector wind Relatjve humidity 17 Verifjed against own analysis Verifjed against operatjonal analyses Slide 19 WMO 7th verifjcatjon workshop, May 8-11, 2017
Observatjonal verifjcatjon “obstats” Example: verifjcatjon against aircrafu temperature measurements (AIREP) Change in std. dev. of error of the T+12 forecast, relatjve to control 18 Slide 20 WMO 7th verifjcatjon workshop, May 8-11, 2017
Summary: four issues in operatjonal R&D verifjcatjon 1. Type I error due to multjple comparisons: • Try to determine how many independent tests n are being made (e.g. compute correlatjon between scores) • Paired difgerences in medium range dynamical tropospheric scores are all quite correlated • Paired difgerences are correlated at difgerent forecast ranges • Once n is estjmated, use a Šidák correctjon 2. Type I error due to tjme-correlated forecast error: • Chaos experiment used to validate an AR(2) model for correctjng tjme- correlatjons 19 • Note that at forecast day 10, this may not work: long-range tjme- correlatjons? Slide 21 WMO 7th verifjcatjon workshop, May 8-11, 2017
Recommend
More recommend