CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, - - PowerPoint PPT Presentation

cs626 data analysis and simulation
SMART_READER_LITE
LIVE PREVIEW

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, - - PowerPoint PPT Presentation

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462, email:kemper@cs.wm.edu Office hours: Monday,Wednesday 2-4 pm Today: Stochastic Input Modeling Reference: Law/Kelton, Simulation Modeling and Analysis, Ch 6.


slide-1
SLIDE 1

1

CS626 Data Analysis and Simulation

Today: Stochastic Input Modeling

Reference: Law/Kelton, Simulation Modeling and Analysis, Ch 6. NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/

Instructor: Peter Kemper

R 104A, phone 221-3462, email:kemper@cs.wm.edu Office hours: Monday,Wednesday 2-4 pm

slide-2
SLIDE 2

What is input modeling? Input modeling

 Deriving a representation of the uncertainty or randomness in a

stochastic simulation.

 Common representations

 Measurement data  Distributions derived from measurement data <-- focus of “Input modeling”

 usually requires that samples are i.i.d and corresponding random

variables in the simulation model are i.i.d

 i.i.d. = independent and identically distributed  theoretical distributions  empirical distribution

 Time-dependent stochastic process  Other stochastic processes

Examples include

 time to failure for a machining process;  demand per unit time for inventory of a product;  number of defective items in a shipment of goods;  times between arrivals of calls to a call center.

2

slide-3
SLIDE 3

Overview of fitting with data Check if key assumptions hold (i.i.d) Select one or more candidate distributions

 based on physical characteristics of the process and  graphical examination of the data.

Fit the distribution to the data

 determine values for its unknown parameters.

Check the fit to the data

 via statistical tests and  via graphical analysis.

If the distribution does not fit,

 select another candidate and repeat the process, or  use an empirical distribution.

3 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-4
SLIDE 4

Check the fit to the data Graphical analysis

 Plot fitted distribution and data in a way that differences can be

recognized

 beyond obvious cases, there is a grey area of subjective acceptance/rejection

 Challenges

 How much difference is significant enough to trash a fitted distribution?  Which graphical representation is easy to judge?

 Options:

 Histogram-based plots  Probability plots: P-P plot, Q-Q plot

Statistical tests

 define a measure X for the difference between fitted distribution & data  X is an RV, so if we find an argument what distribution X has, we get a

statistical test to see if in a concrete case a value of X is significant

 Goodness-of-fit tests:

 Chi-square test(χ2), Kolmogorov-Smirnov test(K-S), Anderson Darling test(AD)

4

slide-5
SLIDE 5

Sample test characteristic for Chi-Square test (all parameters known)

5

One-sided Right side:

  • critical region
  • region of rejection

Left side:

  • region of acceptance

where we fail to reject hypothesis P-value of x: 1-F(x)

slide-6
SLIDE 6

Graphic Analysis vs Goodness-of-fit tests Graphic analysis includes:

 Histogram with fitted distribution  Probability plots: P-P plot, Q-Q plot.

Goodness-of-fit tests

 represent lack of fit by a summary statistic, while plots show where

the lack of fit occurs and whether it is important.

 may accept the fit, but the plots may suggest the opposite,

especially when the number of observations is small.

6

+*0%1%*/21*34*56*37/2$8%1(3,/*(/* 72-(2820*13*72*4$39*%*,3$9%-* 0(/1$(7:1(3,;*<'2*43--3=(,>*%$2*1'2* !?8%-:2/*4$39*)'(?/@:%$2*12/1*%,0* A?B*12/1C D'(?/@:%$2*12/1C*6;EFF A?B*12/1C*G6;EH I'%1*(/*.3:$*)3,)-:/(3,J

slide-7
SLIDE 7

Density Histogram compares sample histogram (mind the bin sizes) with fitted distribution

7

slide-8
SLIDE 8

Frequency Histogram compares histogram from data with histogram according to fitted distribution

8

slide-9
SLIDE 9

Differences in distributions are easier to see along a straight line:

9

slide-10
SLIDE 10

Graphical comparisons

10

Frequency Comparisons

Features:

  • Graphical comparison of a histogram of

the data with the density function of the fitted distribution.

  • Sensitive to how we group the data.

Probability Plots

Features:

  • Graphical comparison of an estimate of the

true distribution function of the data with the distribution function of the fit.

  • Q-Q (P-P) plot amplifies differences

between the tails (middle) of the model and sample distribution functions.

  • Use every graphical tool in the software to examine the fit.
  • If histogram-based tool, then play with the widths of the cells.
  • Q-Q plot is very highly recommended!

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-11
SLIDE 11

P-P plots and Q-Q plots

11

Q-Q plot vs for q1,...,qn P-P plot vs for p1,...,pn This intuitive definition needs an adjustment to handle ties (multiple samples of same value)

slide-12
SLIDE 12

Q-Q Plot Recall that one way to generate data from cdf F is via The Q-Q plot displays the sorted data

12

  • via
  • )

(

1 R

F Y

  • n

Y Y Y

  • 2

1

  • n

j n j F

  • ,

2 , 1 , 2 / 1

1

  • vs

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-13
SLIDE 13

Q-Q Plot Intuition

13

. /0)1230)2)4256&0)!"#$!%#&#!' 2+7)80)9-()2)7-4(:-;,(-'+) ( (12()80)1'60)-4)<''7= . *9)80)+'8)<0+0:2(0)2):2+7'5)4256&0)'9)4->0)' 9:'5)($

  • ()41',&7)&''?)2;',()&-?0)!"#$!%#&#!')

. @10)#$# 6&'()<0+0:2(04)2)*+,-+./ :2+7'5)4256&0)9':) A'562:-4'+=

slide-14
SLIDE 14

Features of the Q-Q plot It does not depend on how the data are grouped. It is much better than a density-histogram when the number of data points is small. Deviations from a straight line show where the distribution does not match. A straight line implies that the family of distributions is

  • correct. A 45o line implies that parameters fit as well.

14 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-15
SLIDE 15

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

LogLogistic(-113.32, 156.71, 16.107)

20 40 60 80 100 120 20 40 60 80 100 120 Input quantile

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

Exponential(44.468) Shift=-0.58

20 40 60 80 100 120 Fitted quantile 20 40 60 80 100 120 Input quantile

Pretty good fit, but misses a bit on the right tail. Poor fit, misses badly in both tails.

Features of the Q-Q Plot A straight line implies the family of distributions is correct; a 45-degree line implies correct parameters.

15 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-16
SLIDE 16

Examples of Q-Q plot

16

12 13 42 43 !2 12 13 42 43 !2

56)+-700789+7*+ ':)00;+9,,<=

3 12 13 42 43 !2 !3 "2 "3 32 3 12 13 42 43 !2 !3 "2 "3 32

56)+<7*0:7>?07,8+

  • %&7(;+*))&*+@A+

>?0+06)+'%:%&)0):*+ %:)+8,0

slide-17
SLIDE 17

Example of Q-Q plot

17

0++1*,2/3*4255*6%7(8* 29*/:)*(),/*/%2(

;*7%/%*5)/*+,*!<*+65)1=%/2+95* 25*6)(2)=)7*/+*6)*,1+&*%* 9+1&%(*725/126>/2+9?*@:)* ,+((+A29B*%1)*/:)*!.=%(>)5* ,1+&*C:2.5->%1)*/)5/*%97*D.E* /)5/F G:2.5->%1)*/)5/F*<?HII D.E*/)5/F*J<?H"

H< H" K< K" !< !" H< H" K< K" !< !"

slide-18
SLIDE 18

P-P plot vs Q-Q plot: Sensitive to different kinds of deviations

18

slide-19
SLIDE 19

Should we just use the best fit? Software tools

 exercise a set of distributions  optimize parameter settings for data and distribution  evaluate statistical tests  suggest a “best fit”

Some concerns about the fully automated solution:

 Tests represent lack of fit by a single summary statistic, while plots

show where the lack of fit occurs and whether it is important.

 Be sure to try different numbers of histogram cells; it affects the p-

value of the χ2 test, and your perception of the fit.

 Be cautious with ranking fits by Chi-Sq, K-S and A-D statistics and

always check the Q-Q plot.

 If there is a strong physical basis for a particular distribution choice,

then use it even if it is not the best fit.

 Don’t be afraid to use your brain in addition to software!

19 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-20
SLIDE 20

Overview of fitting with data Select one or more candidate distributions, based on physical characteristics of the process and graphical examination of the data. Fit the distribution to the data (determine values for its unknown parameters). Check the fit to the data via tests and graphical analysis. If the distribution does not fit, then select another candidate and repeat the process. What if no distribution provides a good fit?

20 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-21
SLIDE 21

What if no distribution provides a good fit? Use the data itself when…

 No standard distribution fits well.  We have no justification for a standard distribution.  There is too little data to distinguish between standard distributions.

Reuse the data via empirical distribution An example:

21

1/2 2.1 3.4 5.7 8.1 10

input data probability mass function

1/4 3/4

Equally likely to be re-sampled

Objective Fit an input model to data 2.1, 5.7, 3.4, 8.1 via empirical distribution function.

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-22
SLIDE 22

Empirical distribution Each data point is equally likely to be resampled. If you are concerned that only the values you saw can appear again, then fill in gaps by linearly interpolating between the sorted data points:

22

  • Interpolated Empirical cdf

0.33 0.67 1 0.2 0.4 0.6 0.8 1 2 4 6 8 10 X cumulative probability

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-23
SLIDE 23

Empirical distribution Formal definition (Law/Kelton, p 326) Let X1≤ X2...≤ Xn be the sorted sequence of observations F(x) = 0 if x < X1 F(x) = (i-1)/(n-1) + (x-Xi)/[(n-1)(Xi+1-Xi)] if Xi≤x≤Xi+1 for i=1,2,...,n-1 F(x) = 1 Xn ≤ x

Ok, but cannot yield values less than X1 or more than Xn Also, mean F(x) does not match sample mean. If data is grouped, different approach necessary. Law/Kelton describes such an extension with interpolation. Real challenge are skewed distributions (mostly right) with likely too few samples from tail due to small tail probabilities. Consider appending artificial tail with the help of an exponential distribution

23

slide-24
SLIDE 24

What if we have no data at all? We have to use anything we can find...

 Engineering data, standards and ratings can provide central values.  Expert opinion.  Physical or conventional limitations can provide bounds.  Physical basis of the process can suggest appropriate distribution

families.

We model the expert opinion using either

 breakpoints method, or  mean and variability method.

24 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-25
SLIDE 25

Breakpoints method Useful for modeling quantities with a large number of possible outcomes such as quarterly sales volume. Example Sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units.

25

  • Triangular Distribution

X <= 1707 5% X <= 4452 95% 0.0002 0.0003 0.0004 0.0005 0.0006 500

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

density function sales

0.0001 0.0000 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-26
SLIDE 26

Breakpoints method Example

 Sales of XYZ-123 will be between 1000 and 5000 with

 25% chance of being at most $2000,  75% chance of being at most $3500,  99% chance of being at most $4500.

Use only as many breakpoints as you can confidently get. Try to get breakpoints near the extremes if possible. Might be easier for experts to give the chance of exceeding a value.

26

0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030 0.00035 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

X <= 1200 5.0% X <= 4333 95.0%

Cumulative Distribution density function

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-27
SLIDE 27

Mean and variability method Example

 The sales of XYZ-123 was 10,000. This year we expect a 15%

increase, with a typical swing of 5% above or below that value. However, we won’t sell less than 7000 units, or more than 16,000 under any conditions.

27

X <= 12446 95.0% X <= 10554 5.0% 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

density function sales Normal Distribution

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-28
SLIDE 28

Checking the input model Sensitivity analysis (varying the parameters of the input model) is especially important when the model is not based on data. While looking for marked changes in the output results, pay special attention to the standard deviation, bounds, or limits. Concentrate sensitivity analysis on those inputs to which the outputs are most sensitive.

28 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-29
SLIDE 29

What if the process is dependent? Usually we assume that all generated random

  • bservations across a simulation are independent.

Sometimes this is not true:

 A difficult part requires long processing in adjacent operations of a

production system.

 This is positive correlation.

Ignoring such relations can invalidate model.

29 from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-30
SLIDE 30

Does Dependence Really Matter?

30

Independent Arrivals Server A Exit

Inventory in front of Server A

Correlation 0.9 Dependent Arrivals Server B Exit

Inventory in front of Server B

YES, IT DOES!

Average Waiting Time

2 4 6 8 10

@RISK Student Version

Average Number Waiting

3 6 9 12 15

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

slide-31
SLIDE 31

Conclusion Use input models to represent uncertainty in simulation The particular input model chosen matters! Selection of the an input model is not an exact science

 no right answer, but the issues to consider are

 theoretical vs. empirical data  physical basis of the distribution  assessment of the goodness of a fit  independence of samples

Assess the sensitivity of simulation output results to input models chosen Use expert opinion whenever you can Do not automatically trust a completely automated derivation of an input model.

31