Fitting parametric distributions using R : the fitdistrplus package - - PowerPoint PPT Presentation

fitting parametric distributions using r the fitdistrplus
SMART_READER_LITE
LIVE PREVIEW

Fitting parametric distributions using R : the fitdistrplus package - - PowerPoint PPT Presentation

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion Fitting parametric distributions using R : the fitdistrplus package M. L. Delignette-Muller - CNRS UMR 5558 R. Pouillot J.-B. Denis - INRA


slide-1
SLIDE 1

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Fitting parametric distributions using R: the fitdistrplus package

  • M. L. Delignette-Muller - CNRS UMR 5558
  • R. Pouillot

J.-B. Denis - INRA MIAJ useR! 2009,10/07/2009

slide-2
SLIDE 2

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Background

Specifying the probability distribution that best fits a sample data among a predefined family of distributions a frequent need especially in Quantitative Risk Assessment general-purpose maximum-likelihood fitting routine for the parameter estimation step : fitdistr(MASS) (Venables and Ripley, 2002) possibility to implement other steps using R (Ricci, 2005) but no specific package dedicated to the whole process difficulty to work with censored data

slide-3
SLIDE 3

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Objective

Build a package that provides functions to help the whole process of specification of a distribution from data choose among a family of distributions the best candidates to fit a sample estimate the distribution parameters and their uncertainty assess and compare the goodness-of-fit of several distributions that specifically handles different kinds of data discrete continuous with possible censored values (right-, left- and interval-censored with several upper and lower bounds)

slide-4
SLIDE 4

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Technical choices

Skewness-kurtosis graph for the choice of distributions

(Cullen and Frey, 1999)

Two fitting methods

matching moments

for a limited number of distributions and non-censored data

maximum likelihood (mle) using optim(stats)

for any distribution, predefined or defined by the user for non-censored or censored data

Uncertainty on parameter estimations

standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap

Assessment of goodness-of-fit

chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

slide-5
SLIDE 5

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Technical choices

Skewness-kurtosis graph for the choice of distributions

(Cullen and Frey, 1999)

Two fitting methods

matching moments

for a limited number of distributions and non-censored data

maximum likelihood (mle) using optim(stats)

for any distribution, predefined or defined by the user for non-censored or censored data

Uncertainty on parameter estimations

standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap

Assessment of goodness-of-fit

chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

slide-6
SLIDE 6

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Technical choices

Skewness-kurtosis graph for the choice of distributions

(Cullen and Frey, 1999)

Two fitting methods

matching moments

for a limited number of distributions and non-censored data

maximum likelihood (mle) using optim(stats)

for any distribution, predefined or defined by the user for non-censored or censored data

Uncertainty on parameter estimations

standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap

Assessment of goodness-of-fit

chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

slide-7
SLIDE 7

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Technical choices

Skewness-kurtosis graph for the choice of distributions

(Cullen and Frey, 1999)

Two fitting methods

matching moments

for a limited number of distributions and non-censored data

maximum likelihood (mle) using optim(stats)

for any distribution, predefined or defined by the user for non-censored or censored data

Uncertainty on parameter estimations

standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap

Assessment of goodness-of-fit

chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

slide-8
SLIDE 8

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Main functions of fitdistrplus

descdist: provides a skewness-kurtosis graph to help to choose the best candidate(s) to fit a given dataset fitdist and plot.fitdist: for a given distribution, estimate parameters and provide goodness-of-fit graphs and statistics bootdist: for a fitted distribution, simulates the uncertainty in the estimated parameters by bootstrap resampling fitdistcens, plot.fitdistcens and bootdistcens: same functions dedicated to continuous data with censored values

slide-9
SLIDE 9

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Skewness-kurtosis plot for continuous data

  • Ex. on consumption data: food serving sizes (g)

> descdist(serving.size)

  • 1

2 3 4

Cullen and Frey graph

square of skewness kurtosis 10 9 8 7 6 5 4 3 2 1

  • Observation

Theoretical distributions normal uniform exponential logistic beta lognormal gamma

(Weibull is close to gamma and lognormal)

slide-10
SLIDE 10

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Skewness-kurtosis plot for continuous data with bootstrap option

> descdist(serving.size,boot=1001)

  • 1

2 3 4

Cullen and Frey graph

square of skewness kurtosis 10 9 8 7 6 5 4 3 2 1

  • Observation
  • bootstrapped values

Theoretical distributions normal uniform exponential logistic beta lognormal gamma

(Weibull is close to gamma and lognormal)

slide-11
SLIDE 11

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Skewness-kurtosis plot for discrete data

  • Ex. on microbial data: counts of colonies on small food samples

> descdist(colonies.count,discrete=TRUE)

  • 5

10 15

Cullen and Frey graph

square of skewness kurtosis 21 19 17 15 13 11 9 8 7 6 5 4 3 2 1

  • Observation

Theoretical distributions normal negative binomial Poisson

slide-12
SLIDE 12

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Fit of a given distribution by maximum likelihood or matching moments

  • Ex. on consumption data: food serving sizes (g)

Maximum likelihood estimation

> fg.mle<-fitdist(serving.size,"gamma",method="mle") > summary(fg.mle) estimate Std. Error shape 4.0083 0.34134 rate 0.0544 0.00494 Loglikelihood:

  • 1254

Matching moments estimation

> fg.mom<-fitdist(serving.size,"gamma",method="mom") > summary(fg.mom) estimate shape 4.2285 rate 0.0574

slide-13
SLIDE 13

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Fit of a given distribution by maximum likelihood or matching moments

  • Ex. on consumption data: food serving sizes (g)

Maximum likelihood estimation

> fg.mle<-fitdist(serving.size,"gamma",method="mle") > summary(fg.mle) estimate Std. Error shape 4.0083 0.34134 rate 0.0544 0.00494 Loglikelihood:

  • 1254

Matching moments estimation

> fg.mom<-fitdist(serving.size,"gamma",method="mom") > summary(fg.mom) estimate shape 4.2285 rate 0.0574

slide-14
SLIDE 14

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Comparison of goodness-of-fit statistics

  • Ex. on consumption data: food serving sizes (g)

Comparison of the fits of three distributions using the Anderson-Darling statistics Gamma

> fitdist(serving.size,"gamma")$ad [1] 3.566019

lognormal

> fitdist(serving.size,"lnorm")$ad [1] 4.543654

Weibull

> fitdist(serving.size,"weibull")$ad [1] 3.573646

slide-15
SLIDE 15

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Goodness-of-fit graphs for continuous data

  • Ex. on consumption data: food serving sizes (g)

> plot(fg.mle)

Empirical and theoretical distr.

data Density 50 100 150 200 0.000 0.004 0.008 0.012

  • 50

100 150 200 50 100 150 200

QQ−plot

theoretical quantiles sample quantiles 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0

Empirical and theoretical CDFs

data CDF

  • ● ● ●
  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

PP−plot

theoretical probabilities sample probabilities

slide-16
SLIDE 16

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Goodness-of-fit graphs for discrete data

  • Ex. on microbial data: counts of colonies on small food samples

> fnbinom<-fitdist(colonies.count,"nbinom") > plot(fnbinom)

2 4 6 8 10 12 0.0 0.2 0.4

Empirical (black) and theoretical (red) distr.

data Density 2 4 6 8 10 12 0.0 0.4 0.8

Empirical (black) and theoretical (red) CDFs

data CDF

slide-17
SLIDE 17

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Fit of a given distribution by maximum likelihood to censored data

  • Ex. on microbial censored data: concentrations in food

with left censored values (not detected) and interval censored values (detected but not counted)

> log10.conc left right 1 1.73 1.73 2 1.51 1.51 3 0.77 0.77 4 1.96 1.96 5 1.96 1.96 6

  • 1.40

0.00 7

  • 1.40 -0.70

8 NA -1.40 9

  • 0.11 -0.11

... > fnorm<-fitdistcens(log10.conc, "norm") > summary(fnorm) estimate Std. Error mean 0.118 0.332 sd 1.426 0.261 Loglikelihood:

  • 32.1
slide-18
SLIDE 18

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Goodness-of-fit graphs for censored data

  • Ex. on microbial censored data: concentrations in food

> plot(fnorm)

−2 −1 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

Cumulative distribution plot

censored data CDF

slide-19
SLIDE 19

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Bootstrap resampling

  • Ex. on microbial censored data

> bnorm<-bootdistcens(fnorm) > summary(bnorm) Nonparametric bootstrap medians and 95% CI Median 2.5% 97.5% mean 0.233 -0.455 0.875 sd 1.294 0.908 1.776 > plot(bnorm)

  • −0.5

0.0 0.5 1.0 1.0 1.5 2.0

Scatterplot of the boostrapped values of the two parameters

mean sd

slide-20
SLIDE 20

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Use of the bootstrap in risk assessment

The bootstrap sample may be used to take into account uncertainty in risk assessment, in two-dimensional Monte Carlo simulations, as proposed in the package mc2d.

Variability Uncertainty Uncertain and Variable parameter Uncertain hyperparameter 1 Uncertain hyperparameter 2

slide-21
SLIDE 21

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Conclusion

fitdistrplus could help risk assessment. It is a part of a collaborative project with 2 other packages under development, mc2d and ReBaStaBa: The R-Forge project "Risk Assessment with R"

http://riskassessment.r-forge.r-project.org/

fitdistrplus could also be used more largely to help the fit of univariate distributions to data

slide-22
SLIDE 22

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Conclusion

fitdistrplus could help risk assessment. It is a part of a collaborative project with 2 other packages under development, mc2d and ReBaStaBa: The R-Forge project "Risk Assessment with R"

http://riskassessment.r-forge.r-project.org/

fitdistrplus could also be used more largely to help the fit of univariate distributions to data

slide-23
SLIDE 23

Introduction Choice of distributions to fit Fit of distributions Simulation of uncertainty Conclusion

Still many things to do

fitdistrplus is still under development. Many improvements are planned

  • ther goodness-of-fit statistics
  • ther graphs for goodness-of-fit for censored data

(Turnbull,...)

  • ptimized choice of the algorithm used in optim for the

likelihood maximization graphs of likelihood contours (detection of identifiability problems) ...

do not hesitate to provide us other improvement ideas !