Data Analysis and Approximate Models Laurie Davies Fakult at - PowerPoint PPT Presentation

Data Analysis and Approximate Models Laurie Davies Fakult¨ at Mathematik, Universit¨ at Duisburg-Essen CRiSM Workshop: Non-likelihood Based Statistical Modelling, University of Warwick, 7-9 September 2015

Is statistics too difficult? Cambridge 1963: First course on statistics given by John Kingman based on notes by Dennis Lindley. LSE 1966-1967: Courses by David Brillinger, Jim Durbin and Alan Stuart. D. W. M¨ uller Heidelberg (Kiefer-M¨ uller process) Frank Hampel [Hampel, 1998], title as above.

Two phases of analysis Phase 1: EDA; scatter plots, q - q -plots, residual analysis, ... provides possible models for formal treatment in Phase 2 Phase 2: formal statistical inference; hypothesis testing, confidence intervals, prior distributions, posterior distributions, ...

Two phases of analysis The two phases are often treated separately. It is possible to write books on Phase 1 without reference to Phase 2 [Tukey, 1977]. It is possible to write books on Phase 2 without reference to Phase 1 [Cox, 2006].

Two phases of analysis In going from Phase 1 to Phase 2 there is a break in the modus operandi. Phase 1: probing, experimental, provisional. Phase 2: Behaving as if true.

Truth in statistics Phase 2: Parametric family P Θ = { P θ : θ ∈ Θ } Frequentist: There exists a true θ ∈ Θ . Optimal estimators, or at least asymptotically optimal, maximum likelihood An α -confidence region for θ is a region which, in the long run, contains the true parameter value with a relative frequency α .

Truth in statistics Bayesian: The Bayesian paradigm is completely wedded to truth. There exists a true θ ∈ Θ . Two different parameter values θ 1 , θ 2 with P θ 1 � = P θ 2 , cannot both be true. A Dutch book argument now leads to the additivity of a Bayesian prior, the requirement of coherence

An example: copper data 27 measurements of amount of copper (milligrammes per litre) in a sample of drinking water. cu=(2.16 2.21 2.15 2.05 2.06 2.04 1.90 2.03 2.06 2.02 2.06 1.92 2.08 2.05 1.88 1.99 2.01 1.86 1.70 1.88 1.99 1.93 2.20 2.02 1.92 2.13 2.13) ● 2.2 ● ● ● ● ● 2.1 ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● 1.9 ● ● ● ● 1.8 1.7 ● 0 5 10 15 20 25

An example: copper data Outliers? Hampel 5.2mad criterion: max | cu − median ( cu ) | / mad ( cu ) = 3 . 3 < 5 . 2 Three models: (i) the Gaussian (red), (ii) the Laplace (blue), (iii) the comb (green) q-q-plots 2.4 2.2 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● 1.8 ● ● ● 1.6 −4 −2 0 2 4

An example: copper data Distribution functions: 1.0 0.8 0.6 0.4 0.2 0.0 1.7 1.8 1.9 2.0 2.1 2.2 End of phase 1.

An example: copper data Phase 2 For each location-scale model F (( · − µ ) /σ ) behave as if were true. Estimate the parameters µ and σ as efficiently as possible. Maximum likelihood (at least asymptotically efficient). Copper data Model Kuiper, p -value log–lik. 95% –conf. int. length Normal 0.204, 0.441 20.31 [1 . 970 , 2 . 062] 0.092 Laplace 0.200, 0.304 20.09 [1 . 989 , 2 . 071] 0.082 Comb 0.248, 0.321 31.37 [2 . 0248 , 2 . 0256] 0.0008

An example: copper data Bayesian: comb model Prior for µ uniform over [1 . 7835 , 2 . 24832] , for σ independent of µ and uniform over [0 . 042747 , 0 . 315859] . Posterior for µ is essentially concentrated on the interval [2 . 02122 , 2 . 02922] agreeing more or less with the 0.95-confidence interval for µ .

An example: copper data 18 data sets in [Stigler, 1977] Normal Comb Data p -Kuiper log-lik p -Kuiper log-lik Short 1 0.535 -19.25 0.234 -13.92 Short 2 0.049 -21.27 0.003 -18.17 Short 3 0.314 -16.10 0.132 -8.81 Short 4 0.327 -24.42 0.242 -17.66 Short 5 0.102 -19.20 0.022 -13.91 Short 6 0.392 -28.31 0.238 -25.98 Short 7 0.532 12.41 0.495 22.80 Short 8 0.296 -0.49 0.242 10.19 Newcomb 1 0.004 -85.25 0.000 -73.78 Newcomb 2 0.802 -60.55 0.737 -45.85 Newcomb 3 0.483 -75.97 0.330 -59.71 Michelson 1 0.247 -120.9 0.093 -104.7 Michelson 2 0.667 -111.9 0.520 -93.66 Michelson 3 0.001 -115.3 0.000 -100.0 Michelson 4 0.923 -109.8 0.997 -100.8 Michelson 5 0.338 -107.7 0.338 -97.05 Michelson 6 0.425 -139.6 0.077 -134.6 Cavendish 0.991 3.14 0.187 10.22

An example: copper data Now use AIC or BIC ([Akaike, 1973] [Akaike, 1974] [Akaike, 1981] [Schwarz, 1978]) to choose the model. The winner is the comb model. Conclusion 1: This shows the power of likelihood methods demonstrated by their ability to give such a precise estimate of the quantity of copper using data using data of such quality. Conclusion 2: This is nonsense, something has gone badly wrong.

Two topologies Generating random variables. Two distribution functions F and G and a uniform random variable U D D X = F − 1 ( U ) ⇒ X Y = G − 1 ( U ) ⇒ Y = F, = G. Suppose F and G close in the Kolmogorov or Kuiper metrics d ko ( F, G ) = max | F ( x ) − G ( x ) | , d ku ( F, G ) = max x<y | F ( y ) − F ( x ) − ( G ( y ) − G ( x )) | . x Then X and Y will in general be close. Taking finite precision into account can result in X = Y .

Two topologies An example: F = N (0 , 1) and G = C comb , ( k,ds,p ) given by   k  1 � F (( x − ι k ( j )) /ds )  +(1 − p ) F ( x ) C comb , ( k,ds,p ) ( x ) = p k j =1 where ι k ( j ) = F − 1 ( i/ ( k + 1)) , i = 1 , . . . , k and ( k, ds, p ) = (75 , 0 . 005 , 0 . 85) . C comb , ( k,ds,p ) is a mixture of normal distributions, ( k, ds, p ) = (75 , 0 . 005 , 0 . 85) is fixed The Kuiper distance is d ku ( N (0 , 1) , C comb ) = 0 . 02 .

Two topologies Standard normal (black) and comb (red) random variables. ● 2 ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● −2 ● ● 0 5 10 15 20 25

Two topologies Phase 1 is based on distribution functions. This is the level at which data distributed according to the model are generated. The topology of Phase 1 is typified by the Kolmogorov metric d ko or, equivalently, by the Kuiper metric d ku .

Two topologies Move to Phase 2: Analyse the copper data using the normal and comb models. For both models behave as if true, leads to likelihood. Likelihood is density based ℓ ( θ, x n ) = f ( x n , θ ) .

Two topologies Phase 1 based on F ( x, θ ) , Phase 2 on f ( x, θ ) , where � x F ( x, θ ) = f ( u, θ ) du, f ( x, θ ) = D ( F ( x, θ )) −∞ Phase 1 and Phase 2 connected by the linear differential operator D . When are two densities f and g close? Use the L 1 metric � | f − g | d 1 ( f, g ) =

Two topologies F = { F : absolutely continuous, monotone , F ( −∞ ) = 0 , F ( ∞ ) = 1 } D : ( F , d ko ) → ( F , d 1 ) , D ( F ) = f D is an unbounded linear operator and is consequently pathologically discontinuous. The topology O d k o induced by d ko is weak, few open sets. The topology O d 1 induced by d 1 is strong, many open sets. ⊂ O d k o O d 1

Two topologies Standard normal and comb density functions. 1.0 0.8 0.6 0.4 0.2 −0.4 −0.2 0.0 0.2 0.4 d 1 ( N (0 , 1) , C comb ) = 0 . 966 .

Regularization The location-scale problem F (( · − µ ) /σ ) with choice F is ill-posed and requires regularization. The results for the copper data show that ‘efficiency=small confidence interval’ can be imported through the model Tukey ([Tukey, 1993]) call this a free lunch and states that there is no such thing as a free lunch TINSTAAFL He calls models which do not introduce efficiency ‘bland’ or ‘hornless’.

Regularization Measure of blandness is the Fisher information Minimum Fisher models: normal and Huber 4.4 of [Huber and Ronchetti, 2009], see also [Uhrmann-Klingen, 1995] Copper data Model Kuiper, p -value log–lik. 95% –conf. int. length Fisher Inf. 2 . 08 · 10 3 Normal 0.204, 0.441 20.31 [1 . 970 , 2 . 062] 0.092 1 . 41 · 10 4 Laplace 0.200, 0.304 20.09 [1 . 989 , 2 . 071] 0.082 3 . 73 · 10 7 Comb 0.248, 0.321 31.37 [2 . 0248 , 2 . 0256] 0.0008

Regularization Seems to imply - use minimum Fisher information models Location and scale are linked in the model Combined with Bayes or maximum likelihood may be sensitive to outliers Normal and Huber distributions Section 15.6 of [Huber and Ronchetti, 2009]. Cauchy, t -distributions not sensitive - Fr´ echet differentiable - Kent-Tyler functionals.

Regularization Regularize through procedure rather than model Smooth M -functionals, locally uniformly differentiable. ( T L ( P ) , T S ( P )) solution of � x − T L ( P ) � � ψ dP ( x ) = 0 , (1) T S ( P ) � x − T L ( P ) � � χ dP ( x ) = 0 T S ( P )

Data Analysis and Approximate Models Laurie Davies Fakult at - PowerPoint PPT Presentation

Data Analysis and Approximate Models Laurie Davies Fakult at Mathematik, Universit at Duisburg-Essen CRiSM Workshop: Non-likelihood Based Statistical Modelling, University of Warwick, 7-9 September 2015 Is statistics too difficult?

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Faster Parallel Algorithm for Approximate Shortest Path Jason Li (CMU) STOC 2020 March 2, 2020

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Approximate methods for scalable data mining Andrew Clegg Data Analytics & Visualization

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

) ( (6-1) ( = P X ) B f ( x ) dx . X B Note that represents

Data Mining Fundamentals Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University

Where should Background Research contributions infrastructure be Supporting

Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Data Mining for

Computing Case of Interval . . . Standard-Deviation-to-Mean What is Known What We Do in This

Standard Deviation MDM4U: Mathematics of Data Management A deviation is the difference between any

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10

M5S2 - Confidence Intervals for population mean with population standard deviation unknown

Data Analysis and Approximate Models Laurie Davies Fakult at - PowerPoint PPT Presentation

Data Analysis and Approximate Models Laurie Davies Fakult at Mathematik, Universit at Duisburg-Essen CRiSM Workshop: Non-likelihood Based Statistical Modelling, University of Warwick, 7-9 September 2015 Is statistics too difficult?

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Faster Parallel Algorithm for Approximate Shortest Path Jason Li (CMU) STOC 2020 March 2, 2020

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Approximate methods for scalable data mining Andrew Clegg Data Analytics &amp; Visualization

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

) ( (6-1) ( = P X ) B f ( x ) dx . X B Note that represents

Data Mining Fundamentals Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University

Where should Background Research contributions infrastructure be Supporting

Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Data Mining for

Computing Case of Interval . . . Standard-Deviation-to-Mean What is Known What We Do in This

Standard Deviation MDM4U: Mathematics of Data Management A deviation is the difference between any

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10

M5S2 - Confidence Intervals for population mean with population standard deviation unknown

Approximate methods for scalable data mining Andrew Clegg Data Analytics & Visualization