Sta$s$cs & Experimental Design with R Barbara - PowerPoint PPT Presentation

Sta$s$cs ¡& ¡Experimental ¡Design ¡ with ¡R ¡ Barbara ¡Kitchenham ¡ Keele ¡University ¡ 1 ¡

Correla$on ¡and ¡Regression ¡ 2 ¡

Correla$on ¡ • The ¡associa$on ¡between ¡two ¡variable ¡ • Strength ¡of ¡associa$on ¡usually ¡measured ¡by ¡ a ¡correla$on ¡coefficient ¡ρ ¡in ¡range ¡[-‑1, ¡1] ¡ • Most ¡well ¡known ¡ – Pearson ¡Product ¡ ¡Moment ¡Correla$on ¡coefficient ¡ ¡ • Arises ¡from ¡bi-‑variate ¡normal ¡distribu$on ¡ – If ¡both ¡variables ¡are ¡standardized ¡then ¡ploQed ¡ – Elipse ¡shape ¡indicates ¡an ¡associa$on ¡ » Narrower ¡the ¡elipse ¡the ¡closer ¡ ρ ~1(+ve) ¡or ¡-‑1 ¡(-‑ve) ¡ – Circular ¡shape ¡indicates ¡no ¡associate ¡with ¡ρ~0 ¡ ¡ 3 ¡

Bivariate ¡Normal ¡Distribu$on ¡ • Bivariate ¡Normal ¡distribu$on ¡ • Standard ¡Bivariate ¡Normal ¡z~N(0,1) ¡ • Generalises ¡to ¡n ¡dimensions ¡ • Pearson’s ¡ ρ ¡is ¡a ¡parameter ¡of ¡the ¡distribu$on ¡ 4 ¡

Pearson’s ¡ ρ ¡ • From ¡the ¡bivariate ¡normal ¡distribu$on ¡ • Es$mated ¡from ¡data ¡ • Calcula$ng ¡ r ¡ does ¡require ¡normality ¡ – But ¡sta$s$cal ¡tests ¡of ¡significance ¡do ¡ – Test ¡H0 ¡ r =0 ¡can ¡be ¡based ¡on ¡T ¡having ¡Student’s ¡t ¡ distribu$on ¡n-‑2 ¡df, ¡where ¡ – There ¡is ¡also ¡a ¡normalising ¡transforma$on ¡ • Which ¡has ¡standard ¡error ¡ – Used ¡when ¡correla$ons ¡from ¡different ¡sources ¡need ¡to ¡be ¡ aggregated ¡(such ¡as ¡during ¡meta-‑analyses) ¡ 5 ¡

Small ¡Data ¡set ¡ • Using ¡cor.test ¡in ¡R ¡ ρ= 0.57, ¡T=1.9448 ¡n.s. ¡ • Delete ¡A ¡and ¡ ρ= 0.57 , ¡ T=5.887*** ¡ • Delete ¡B ¡and ¡ ρ= 0.28 , ¡ T=0.760 ¡n.s. ¡ Data from ICL 70 B ¡ 60 50 40 LoC 30 20 A 10 5000 15000 25000 35000 Effort 6 ¡

Factors ¡Affec$ng ¡Magnitude ¡ Pearson’s ¡ ρ ¡ • The ¡slope ¡of ¡the ¡line ¡about ¡which ¡points ¡are ¡ clustered ¡ – If ¡slope=0, ¡ ρ= 0, ¡the ¡larger ¡the ¡slope ¡the ¡larger ¡is ¡ ρ ¡ • The ¡magnitude ¡of ¡the ¡devia$ons ¡from ¡the ¡line ¡ – Closer ¡points ¡are ¡to ¡no$onal ¡line ¡the ¡larger ¡is ¡ ρ ¡ • Outliers ¡ • Restric$ng ¡range ¡of ¡X ¡values ¡ – Can ¡increase ¡or ¡decrease ¡ ρ ¡ • Curvature ¡ – ρ ¡ assumes ¡a ¡linear ¡rela$onship ¡ 7 ¡

Robust ¡correla$on ¡ • Spearman’s ¡ ρ ¡ – Replace ¡data ¡values ¡by ¡ranks ¡ – Uses ¡same ¡calcula$on ¡as ¡Pearson ¡ ¡ • With ¡previous ¡data ¡set ¡ – All ¡data, ¡r=0.41 ¡p=0.25 ¡ – With ¡A ¡removed, ¡r=0.67, ¡p=0.059 ¡ – With ¡B ¡removed, ¡r=0.18, ¡p=0.64 ¡ 8 ¡

Non-‑Parametric ¡Correla$on ¡ • Kendall’s ¡tau ¡(τ) ¡ • Based ¡on ¡calcula$ng ¡slopes ¡between ¡all ¡ pairs ¡of ¡points ¡ – Takes ¡median ¡slope ¡ • With ¡previous ¡data ¡set ¡ – All ¡data, ¡r=0.33 ¡p=0.22 ¡ – With ¡A ¡removed, ¡r=0.56, ¡p=0.045 ¡ – With ¡B ¡removed, ¡r=0.17, ¡p=0.61 ¡ 9 ¡

RelPlot ¡ relplot ¡func$on ¡is ¡a ¡bivariate ¡equivalent ¡of ¡box ¡plot ¡ • Shows ¡the ¡central ¡ellipsoid ¡part ¡of ¡the ¡bi-‑variate ¡distribu$on ¡plus ¡outliers ¡ • Calculates ¡a ¡robust ¡es$mate ¡of ¡r=0.90 ¡ • Does ¡not ¡generalise ¡to ¡more ¡ ¡dimensions ¡ • Assuming ¡bi-‑variate ¡normal ¡means ¡nega$ve ¡values ¡are ¡expected ¡ • 80 60 40 y 20 0 -20 -10000 0 10000 20000 30000 40000 x 10 ¡

MGV ¡method ¡for ¡outliers ¡ • Minimum ¡Generalised ¡Variance ¡method ¡ can ¡be ¡used ¡with ¡many ¡variables ¡ MGV method 70 * 60 * 50 40 Y * 30 * 20 * * * 10 o * * 5000 15000 25000 35000 X 11 ¡

Robust ¡Correla$ons ¡ • Winsorized ¡correla$on ¡(wincor(x,y)) ¡ – Replace ¡X ¡and ¡y ¡values ¡at ¡extremes ¡with ¡25 ¡(low) ¡75 ¡(high) ¡ percen$le ¡values ¡ – 0.407 ¡sig.level=.276 ¡ • Percentage ¡Bend ¡Correla$on ¡ – Not ¡es$mate ¡of ¡Pearson’s ¡r ¡ – New ¡correla$on ¡robust ¡to ¡changes ¡in ¡distribu$on ¡ – Based ¡on ¡trimming ¡univariate ¡outliers ¡ – corb(x,y,corfun=pbcor,nboot=599) ¡ – r pb =.441 ¡Boostrap ¡CI=(-‑0.44, ¡0.97) ¡ • Skipped ¡correla$ons ¡(i.e. ¡remove ¡outliers) ¡ – Removed ¡based ¡on ¡MGV ¡ ¡then ¡use ¡Pearson ¡(r=0.91) ¡ – Need ¡to ¡adjust ¡Test ¡value ¡& ¡cri$cal ¡value ¡ ¡ 12 ¡

Comparison ¡on ¡full ¡data ¡set ¡ relplot MGV method 300 300 o 250 o 250 200 200 150 150 y Y o 100 o o 100 50 * * * 50 * * * * 0 * * * * ** * * * * * * o * * * * * * * * 0 10000 20000 30000 40000 -10000 0 10000 20000 30000 40000 X x 13 ¡

Linear ¡Regression ¡ • Finding ¡the ¡parameters ¡of ¡ ¡a ¡model ¡of ¡the ¡form ¡ – Y ¡is ¡the ¡response/outcome/dependent ¡variable ¡ – X i ¡is ¡the ¡ i th ¡ ¡of ¡ p ¡s$mulus/input/independent ¡ variables ¡ – β i ¡is ¡the ¡ith ¡parameter ¡of ¡the ¡model ¡ • A ¡linear ¡model ¡is ¡linear ¡w.r.t ¡the ¡parameters ¡ ¡ – Polynomial ¡models ¡are ¡linear ¡models ¡of ¡the ¡ n th ¡order ¡ where ¡ n ¡is ¡highest ¡power ¡ – I.e. ¡a ¡second-‑order ¡regression ¡model ¡has ¡form ¡ – A ¡non-‑linear ¡model ¡might ¡have ¡form ¡ 14 ¡

Least ¡Squares ¡Principles ¡ • Basic ¡model ¡ ¡for ¡one ¡input ¡variable ¡is ¡ • Sum ¡of ¡squares ¡of ¡devia$ons ¡from ¡true ¡line ¡is ¡ • To ¡es$mate ¡by ¡least ¡squares ¡ – Differen$ate ¡w.r.t ¡each ¡parameter ¡in ¡turn ¡ – To ¡find ¡the ¡turning ¡point ¡(i.e. ¡minimum) ¡set ¡each ¡ differen$al ¡to ¡0 ¡ ¡ • Solve ¡for ¡each ¡parameter ¡in ¡turn ¡ 15 ¡

Parameter ¡Es$ma$on ¡ • Differen$als ¡are ¡ • Solu$ons ¡aser ¡setng ¡each ¡to ¡0 ¡are ¡ • For ¡standardized ¡normal ¡variables ¡ – Slope ¡must ¡less ¡than ¡1, ¡even ¡if ¡Y=X ¡ – The ¡larger ¡the ¡error ¡term, ¡the ¡larger ¡ r ¡and ¡the ¡ lower ¡the ¡value ¡of ¡b 1 ¡ 16 ¡ ¡

Bivariate ¡Normal ¡Distribu$ons ¡ 3 2 2 1 1 0 y y 0 -1 -1 -2 -2 -3 -3 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 3 x x rho=0.5 rho=0.9 b 1 =0.9018 ¡ b 1 =0.57441 ¡ b 0 =-‑0.0097 ¡ b 0 ¡=-‑0.07613 ¡ 17 ¡

Mul$variate ¡Regression ¡ • Formulate ¡in ¡matrix ¡algebra ¡terms, ¡assuming ¡ X ¡and ¡Y ¡have ¡means ¡removed ¡i.e. ¡Y=y-‑μ y ¡ • Y ¡is ¡an ¡(n×1) ¡vector ¡ • X ¡is ¡an ¡(n×p) ¡matrix ¡of ¡known ¡form ¡ • β ¡is ¡a ¡(p×1) ¡vector ¡of ¡parameters ¡ • ϵ ¡is ¡a ¡(n×1) ¡vector ¡of ¡error ¡terms ¡ • Where ¡ ¡E( ϵ )=0, ¡V( ϵ ) ¡= I σ 2 ¡ • Solu$on ¡is ¡ ¡ ¡ 18 ¡

Least ¡Squares ¡Proper$es ¡ • FiQed ¡values ¡are ¡obtained ¡from ¡ • Vector ¡of ¡residuals ¡ • ¡Variance ¡of ¡parameters ¡ • Mul$ple ¡Correla$on ¡Coefficient ¡ • Adjusted ¡ • Both ¡R 2 ¡Vulnerable ¡to ¡outliers ¡ • Many ¡diagnos$c ¡tools ¡available ¡based ¡on ¡ residuals ¡ ¡and ¡Hat ¡ ¡Matrix ¡ 19 ¡

The ¡Hat ¡Matrix ¡ • Hat ¡Matrix ¡is ¡defined ¡as ¡ • Called ¡the ¡Hat ¡matrix ¡because ¡ • Its ¡important ¡because ¡if ¡h ii ¡is ¡ i-‑ the ¡diagonal ¡ element ¡of ¡of ¡ H ¡ – Difference ¡between ¡ ¡ • Parameter ¡with ¡and ¡without ¡observa$on ¡x j ¡is ¡ • FiQed ¡value ¡with ¡and ¡without ¡observa$on ¡x j ¡is ¡ 20 ¡

Three ¡Types ¡of ¡Residual ¡ • Residuals ¡ • Standardized ¡Residuals ¡ • Studen$zed ¡Residuals ¡(based ¡on ¡omitng ¡each ¡ data ¡point ¡in ¡turn ¡from ¡variance) ¡ • Sadly ¡doesn’t ¡automa$cally ¡provide ¡fiQed ¡values ¡ based ¡on ¡i-‑1 ¡points ¡ – However, ¡lm ¡provides ¡access ¡to ¡the ¡hat ¡matrix ¡values ¡ • Via ¡the ¡fiQed ¡model ¡i.e. ¡hatvalues(fit) ¡ • So ¡can ¡be ¡calculated ¡by ¡wri$ng ¡your ¡own ¡R ¡program ¡ 21 ¡

Sta$s$cs & Experimental Design with R Barbara - PowerPoint PPT Presentation

Sta$s$cs & Experimental Design with R Barbara Kitchenham Keele University 1 Correla$on and Regression 2 Correla$on The associa$on between two

Sta$s$cs Sta$s$cs Fourth Dimension of a Sta$s$cal Programmer

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

F orwa rd L ooking Sta te me nt Ce rta in o f the sta te me nts ma de in this Pre se nta tio

Experimental Design and Probability Introduction to course Robin Elahi Experimental Design and

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Sta$s$cs & Experimental Design with R Barbara Kitchenham

Sta$s$cs & Experimental Design with R Barbara Kitchenham

2011 11 12 12 th th at t Sta tate te (3:18.02) :18.02) 2012 12 10 10 th th at t

STA STA 2Q 2Q19 19 An Analyst lyst Pre Presentation entation 1 CO CONTENTS TENTS 1. .

STA STA 4Q 4Q19 19 & FY & FY19 19 An Analy lyst st Pre Presentat sentation ion

STA STA 1Q 1Q19 19 An Analyst lyst Pre Presentation entation 1 CO CONTENTS TENTS 1.

Open Water Swimming Speaker: Dave Candler, STA President Qualifications STA Level 1 Award for

STA STA 1Q 1Q20 20 Pr Prese esentation ntation Opportu ortunity nity Day 5 June e 2020

STA Graduation 2019/20 STA Graduation Application https://forms.gle/tZsKJXUmbAQgcSn57 This google

263-2810: Advanced Compiler Design 2.0 Sta>c Single Assignment Form Thomas R. Gross Computer

WHAT WOULD TREX DO? From Experimental Design to Analysis, the TREX Approach EXPERIMENTAL DESIGN

Overview of Fourier Representation Properties Review of Signal Types Range of equations

Planning and Optimization E7. Linear & Integer Programming Malte Helmert and Gabriele R

LP techniques for set cover Chs. 13, 14, 15 Risto Hakala risto.m.hakala@tkk.fi March 10, 2008

Convex duality and intertemporal consumption choice Peter Bank and joint work with Helena

Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General

Lecture 6. GLM for Binary Response Nan Ye School of Mathematics and Physics University of

Runtime Complexity Mark Redekopp David Kempe Sandra Batista Revised: 12/20/2019 2 2

Program control constructs Branching using if endif and select case loops (repeated

Sta$s$cs & Experimental Design with R Barbara - PowerPoint PPT Presentation

Sta$s$cs & Experimental Design with R Barbara Kitchenham Keele University 1 Correla$on and Regression 2 Correla$on The associa$on between two

Sta$s$cs Sta$s$cs Fourth Dimension of a Sta$s$cal Programmer

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

F orwa rd L ooking Sta te me nt Ce rta in o f the sta te me nts ma de in this Pre se nta tio

Experimental Design and Probability Introduction to course Robin Elahi Experimental Design and

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Sta$s$cs &amp; Experimental Design with R Barbara Kitchenham

Sta$s$cs &amp; Experimental Design with R Barbara Kitchenham

2011 11 12 12 th th at t Sta tate te (3:18.02) :18.02) 2012 12 10 10 th th at t

STA STA 2Q 2Q19 19 An Analyst lyst Pre Presentation entation 1 CO CONTENTS TENTS 1. .

STA STA 4Q 4Q19 19 &amp; FY &amp; FY19 19 An Analy lyst st Pre Presentat sentation ion

STA STA 1Q 1Q19 19 An Analyst lyst Pre Presentation entation 1 CO CONTENTS TENTS 1.

Open Water Swimming Speaker: Dave Candler, STA President Qualifications STA Level 1 Award for

STA STA 1Q 1Q20 20 Pr Prese esentation ntation Opportu ortunity nity Day 5 June e 2020

STA Graduation 2019/20 STA Graduation Application https://forms.gle/tZsKJXUmbAQgcSn57 This google

263-2810: Advanced Compiler Design 2.0 Sta&gt;c Single Assignment Form Thomas R. Gross Computer

WHAT WOULD TREX DO? From Experimental Design to Analysis, the TREX Approach EXPERIMENTAL DESIGN

Overview of Fourier Representation Properties Review of Signal Types Range of equations

Planning and Optimization E7. Linear &amp; Integer Programming Malte Helmert and Gabriele R

LP techniques for set cover Chs. 13, 14, 15 Risto Hakala risto.m.hakala@tkk.fi March 10, 2008

Convex duality and intertemporal consumption choice Peter Bank and joint work with Helena

Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General

Lecture 6. GLM for Binary Response Nan Ye School of Mathematics and Physics University of

Runtime Complexity Mark Redekopp David Kempe Sandra Batista Revised: 12/20/2019 2 2

Program control constructs Branching using if endif and select case loops (repeated

Sta$s$cs & Experimental Design with R Barbara Kitchenham

Sta$s$cs & Experimental Design with R Barbara Kitchenham

STA STA 4Q 4Q19 19 & FY & FY19 19 An Analy lyst st Pre Presentat sentation ion

263-2810: Advanced Compiler Design 2.0 Sta>c Single Assignment Form Thomas R. Gross Computer

Planning and Optimization E7. Linear & Integer Programming Malte Helmert and Gabriele R