Fundamentals of Statistical Monitoring: The Good, Bad, & Ugly - - PowerPoint PPT Presentation

fundamentals of statistical monitoring the good bad ugly
SMART_READER_LITE
LIVE PREVIEW

Fundamentals of Statistical Monitoring: The Good, Bad, & Ugly - - PowerPoint PPT Presentation

Fundamentals of Statistical Monitoring: The Good, Bad, & Ugly in Biosurveillance Galit Shmuli Dept of Decision & Info Technologies Robert H Smith School of Business University of Maryland, College Park Overview The main idea


slide-1
SLIDE 1

Fundamentals of Statistical Monitoring: The Good, Bad, & Ugly in Biosurveillance

Galit Shmuéli Dept of Decision & Info Technologies Robert H Smith School of Business University of Maryland, College Park

slide-2
SLIDE 2

Overview

The main idea behind statistical monitoring Traditional monitoring tools

Control charts Regression models

Moving to pre-diagnostic data

slide-3
SLIDE 3

The main idea

Monitor a stream of incoming data, and signal

an alarm if there is indication of abnormality

“Abnormality” – define normal

slide-4
SLIDE 4

Any P&I outbreak(s) in Newark, NJ in this period (2004-2006)?

Weekly % P&I deaths (relative to overall death)

43% 57%

1.

Yes

2.

No

1 2.7 4.4 6 7.7 9.4 11 12.7 14.4 16 17.7 1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005

slide-5
SLIDE 5

Any outbreak(s) of Gonorrhea in Mass. in this period?

Weekly Gonorrhea counts in Mass. ‘04-‘06

77% 23%

1.

Yes

2.

No

  • 3

10 20 32 43 54 65 77 88 100 110 1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005

slide-6
SLIDE 6

Control charts: Shewhart charts

Originally used to monitor a process mean in an

industrial setting.

Assumption: there is an “in-control” mean, and we

want to detect when it goes “out-of-control”.

Natural variability vs. “special cause” Method: draw a small random sample at repeated

time intervals, and compare the sample mean to lower/upper thresholds.

If the sample mean exceeds a threshold, then

trigger an alarm and stop the process.

slide-7
SLIDE 7

What is “normal”? The mean ( ) should be Normal!

X

9973 . 3 3 = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + ≤ ≤ − n X n P σ µ σ µ

slide-8
SLIDE 8

The X-bar chart (A Shewhart 3-sigma chart)

n CL UCL LCL CL / 3 , σ µ ± = =

The thresholds take into account the variability of the sample mean around the process mean

slide-9
SLIDE 9

Shewhart chart assumptions

The statistic measured at time t is normally

distributed

If a single measurement is taken every time unit –

we assume the measurements are normally

  • distributed. This is called an “i-chart”

If the statistic is a rate, you have a “p-chart”

Samples taken at different time points are

independent of each other

slide-10
SLIDE 10

The X-bar chart: Example

sample X1 X2 X3 X4 X5 x-bar 1 240 243 250 253 248 246.8 2 238 242 245 251 247 244.6 3 239 242 246 250 248 245 4 235 237 246 249 246 242.6 5 240 241 246 247 249 244.6 6 240 243 244 248 245 244 7 240 243 244 249 246 244.4 8 245 250 250 247 248 248 9 238 240 245 248 246 243.4 10 240 242 246 249 248 245 11 240 243 246 250 248 245.4 12 241 245 243 247 245 244.2 13 247 245 255 250 249 249.2 14 237 239 243 247 246 242.4 15 242 244 245 248 245 244.8 16 237 239 242 247 245 242 17 242 244 246 251 248 246.2 18 243 245 247 252 249 247.2 19 243 245 248 251 250 247.4 20 244 246 246 250 246 246.4 21 241 239 244 250 246 244 22 242 245 248 251 249 247 23 242 245 248 243 246 244.8 24 241 244 245 249 247 245.2 25 236 239 241 246 242 240.8 26 243 246 247 252 247 247 27 241 243 245 248 246 244.6 28 239 240 242 243 244 241.6 29 239 240 250 252 250 246.2 30 241 243 249 255 253 248.2

Data from Philips Semiconductors. 30 Samples of size n= 5 silicon wafers were taken every time unit. The thickness of each wafer was recorded, and the sample mean calculated. Target thickness = 244 Standard deviation σ = 3.1

slide-11
SLIDE 11

The X-bar chart: Example (cont.)

16 . 248 84 . 239 5 / 1 . 3 3 244 , 244 = = × ± = = UCL LCL UCL LCL CL

X-bar chart

234 236 238 240 242 244 246 248 250 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 time x-bar

slide-12
SLIDE 12

Shewhart chart for weekly data

Use “stable” period to estimate mean and std

for thresholds (used 2004)

% P&I Deaths in Newark, NJ Gonorrhea in Mass.

1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 20 40 60 80 100

Week

1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 5 10 15

slide-13
SLIDE 13

When will a Shewhart signal an alarm?

Probability that a point exceeds the limits,

when the process mean shifts by k std:

k P(Alarm) .0027 1 .0228

  • 1

.0228 2 .1587 3 .5000

slide-14
SLIDE 14

How often should we expect a false alarm with a Shewhart chart? (with weekly data)

1.

Every other week

2.

Once a month

3.

Once a year

4.

Once in 15.5 years

5.

Once in 7 years

E v e r y

  • t

h e r w e e k O n c e a m

  • n

t h O n c e a y e a r O n c e i n 1 5 . 5 y e a r s O n c e i n 7 y e a r s

36% 43% 0% 4% 18% 1/0.0027 = 370 weeks @ 7 years

slide-15
SLIDE 15

Catch #1: How to set LCL, UCL?

Best: underlying domain knowledge

“Rate of Gonorrhea in population above X

considered outbreak”

“Number of weekly cases above X…”

In the absence, use historical data

To estimate of population parameter Make sure the historic period has no outbreaks! How to determine?

The bad: lack of gold standards

slide-16
SLIDE 16

Catch #2: are the data normal?

8 16 24 32 40 48 56 64 72 80 88 96 104 112 2 4 6 8 10 12 14 16 18 20

Gonorrhea % P&I Deaths

Binned %P&I Deaths

2 4 6 8 10 12 14 16 18 5 10 15 20 25

If not, two tricks:

Transform the data

(right skew -> take log)

Use a more suitable

Shewhart chart

Ba Bar Ch Char art

Binned ln(%P&I-Death)

0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 5 10 15 20 25

slide-17
SLIDE 17

Shewhart chart for transformed data

W k

1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 0.5 1 1.5 2 2.5 3 1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 5 10 15

slide-18
SLIDE 18

Catch #3: are the counts correlated?

ACF Plot for Gonorrhea

  • 1
  • 0.5

0.5 1 1 2 3 4 5 6 7 8 9 10 Lags ACF ACF UCI LCI

ACF Plot for %P&I Deaths

  • 1
  • 0.5

0.5 1 1 2 3 4 5 6 7 8 9 10 Lags ACF ACF UCI LCI

Compute autocorrelation at lag 1,2,… If autocorrelated at a low lag, need time-

series model

If autocorrelated at constant multiples then

there is seasonality

slide-19
SLIDE 19

Shewhart Charts – useful for biosurveillance?

The good:

When assumptions are satisfied, these charts are

good at quickly detecting large spikes/dips

Very simple

The bad:

Outbreak that manifests as smaller, consistent

increases will go undetected

Hard in some cases to determine “normal period”

The ugly: Assumptions are often violated.

Even more so with pre-diagnostic data.

slide-20
SLIDE 20

Detecting small or other types of changes

Method 1: make the Shewhart more sensitive Method 2: use a different chart altogether

slide-21
SLIDE 21

Shewhart chart with extra alarming rules

Western Electric Rules (1956) -- Signal if (in

addition to exceeding LCL,UCL):

8 consecutive points are on one side of the CL 2 of 3 consecutive points are in zone A 6 points in a row steadily increasing/decreasing

3 2 1

  • 1
  • 2
  • 3

t A1 A2 B1 B2 C2 C1

  • Increases false alarms
  • Choose only relevant rules
  • Don’t run all rules together
slide-22
SLIDE 22

Detecting a shift with a known pattern

Shewhart charts: Moving Average

charts (with window

  • f 4):

t µ µ0 µ0 +δ t µ µ0 µ0 +δ

slide-23
SLIDE 23

Detecting a shift with a known pattern – cont.

CuSum charts: EWMA charts:

t µ µ0 µ0 +δ µ t µ0

slide-24
SLIDE 24

Chart assumptions

Target mean is constant The statistic measured at time t is normally

distributed

Samples taken at different times are

independent of each other

slide-25
SLIDE 25

The Moving-Average (MA) chart for single daily counts

Points on the plot are averages of sliding window: Control limits:

b X X X MA

b t t t t

/ ) ... (

1 1 + − −

+ + + =

b CL UCL LCL CL σ µ 3 , ± = =

slide-26
SLIDE 26

Moving Average chart (b=4 weeks)

Gonorrhea % P&I Deaths

C l 1

1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 30 40 50 60 70 80

Good way to SEE patterns and trends in the data!

1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 2 4 6 8 10 1/ 3/ 2004 7/ 3/ 2004 1/ 1/ 2005 7/ 2/ 2005 12/ 31/ 2005 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

LOG( % P&I Deaths)

slide-27
SLIDE 27

The Cumulative Sum (CuSum) chart

On day t,

Compute deviation of count from target Accumulate the deviations until time t Restart the counter if it goes below zero

Signal if Can construct Cusum for detecting decrease ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + − 2 δ µ

t

X

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + − + =

+ − +

2 , max

1

δ µ

t t t

X S S

σ h St >

+

slide-28
SLIDE 28

CuSum with (h=4, δ=1)

Gonorrhea % P&I Deaths

100

  • 100

85.7896

  • 85.7896

100 50

Upper CUSUM Low er CUSUM

3 2 1

  • 1
  • 2
  • 3

2.69250

  • 2.69250

100 50

Upper CUSUM Low er CUSUM

10

  • 10

12.3851

  • 12.3851

100 50

Upper CUSUM Low er CUSUM

LOG( % P&I Deaths) Missing values? Zero them?

slide-29
SLIDE 29

Exponentially Weighted Moving-Average (EWMA) chart

Points on the plot: Control limits:

( )

1 2 2 1

~ ) 1 ( ) 1 ( ~

− − −

+ − = + + + − =

t t t t t t

X X X X X X θ θ θ θ θ L

θ θ σ µ + − ± = = 1 1 3 , CL UCL LCL CL

1 < <θ

slide-30
SLIDE 30

EWMA charts for weekly data

100 50 80 70 60 50 40 30

EWMA

Mean=53.98 UCL=75.43 LCL=32.53

Gonorrhea % P&I Deaths

100 50 2.0 1.5 1.0

EWMA

Mean=1.467 UCL=2.140 LCL=0.7935 100 50 9 8 7 6 5 4 3 2

EWMA

Mean=5.443 UCL=8.539 LCL=2.347

LOG( % P&I Deaths)

slide-31
SLIDE 31

Regression models for removing seasonality and trend

Control charts assume no trend, no seasonality Regression models

Exp trend + multiplicative quarterly seasonality Sinusoidal (CDC model for %P&I deaths, annual cycle) Can stratify by adding predictors Use RESIDUALS in control chart The ugly:

  • What if pattern changes?
  • Autocorrelation

t t

t Q Q Q y ε β β β β α + + + + + = ) log(

3 3 2 2 1 1 t t

t t Cos y ε β β β α + + + + = ) 25 . 365 / (

2 1

slide-32
SLIDE 32

Pre-diagnostic data: A whole new ball game

Daily data Day-of-week effect Some series seasonal Non-stationary, local Vastly different

across/within sources

Correlate with other

irrelevant variables

Missing data (school

absences on holidays)

Infected by provider

issues

Low vs. high counts Lack of domain

knowledge

8/8/99 11/15/00 2/24/00 6/3/00 9/11/00 12/20/00 500 1000 1500 2000 2500 3000 3500

OTC Grocery Sales Daily Sales

Analg-Ex Analg-In,Asth Rem Cap/Aller Cough & Cold Fr End Aller Nasal Room Dec Tabs & Caps Tab/Cap Tim Rel Thrt Loz

slide-33
SLIDE 33

What’s the moral?

53% 33% 3% 8% 3%

1.

Preprocess series before applying control charts

2.

Pre-diagnostic data require different tools/treatment than traditional data

3.

I need a refresher statistics course

4.

1&2

5.

All of the above