Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 - PowerPoint PPT Presentation

Revision: Chapter 1-6 Applied Multivariate Statistics – Spring 2012

Overview  Cov, Cor, Mahalanobis, MV normal distribution  Visualization: Stars plot, mosaic plot with shading  Outlier: chisq.plot  Missing values: md.pattern, mice  MDS: Metric / non-metric  Dissimilarities: daisy  PCA  LDA

Two variables: Covariance and Correlation  Covariance: Cov ( X;Y ) = E [( X ¡ E [ X ])( Y ¡ E [ Y ])] 2 [ ¡1 ; 1 ]  Correlation: Corr ( X; Y ) = Cov ( X;Y ) 2 [ ¡ 1; 1] ¾ X ¾ Y P n  Sample covariance: d 1 Cov ( x; y ) = i =1 ( x i ¡ x )( y i ¡ y ) n ¡ 1 Cor ( x; y ) = c  Sample correlation: r xy = d Cov ( x;y ) ¾ x ^ ^ ¾ y  Correlation is invariant to changes in units, covariance is not (e.g. kilo/gram, meter/kilometer, etc.) 2

Scatterplot: Correlation is scale invariant 3

Intuition and pitfalls for correlation Correlation = LINEAR relation 4

Covariance matrix / correlation matrix: Table of pairwise values  True covariance matrix: § ij = Cov ( X i ;X j )  True correlation matrix: C ij = Cor ( X i ;X j )  Sample covariance matrix: S ij = d Cov ( x i ; x j ) Diagonal: Variances  Sample correlation matrix: R ij = d Cor ( x i ;x j ) Diagonal: 1  R: Functions “ cov ”, “ cor ” in package “stats” 5

Sq. Mahalanobis Distance MD 2 (x ) Multivariate Normal Distribution: = Most common model choice Sq. distance from mean in standard deviations IN DIRECTION OF X ¡ ¢ 1 ¡ 1 2 ¢ ( x ¡ ¹ ) T § ¡ 1 ( x ¡ ¹ ) p f ( x ; ¹; §) = 2 ¼ j § j exp 6

µ ¶ 0 Mahalanobis distance: Example ¹ = ; 0 µ ¶ 25 0 § = 0 1 (0,10) MD = 10 7

µ ¶ 0 Mahalanobis distance: Example ¹ = ; 0 µ ¶ 25 0 § = 0 1 (10, 7) MD = 7.3 8

Glyphplots: Stars • Which cities are special? • Which cities are like New Orleans? • Seattle and Miami are quite far apart; how do they compare? • R: Function “stars” in package “stats” 9

Mosaic plot with shading Suprisingly small R: Function “mosaic” in package “ vcd ” observed cell count p-value of independence test: Highly Suprisingly large significant observed cell count 10

Outliers: Theory of Mahalanobis Distance Assume data is multivariate normally distributed (d dimensions) Squared Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom Expected value: d (“By definition”: Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.) 11

Outliers: Check for multivariate outlier  Are there samples with estimated Mahalanobis distance that don’t fit at all to a Chi -Square distribution?  Check with a QQ-Plot  Technical details: - Chi-Square distribution is still reasonably good for estimated Mahalanobis distance ¹; § - use robust estimates for  R: Function «chisq.plot» in package «mvoutlier» 12

Outliers: chisq.plot Outlier easily detected ! 13

Missing values: Problem of Single Imputation  Too optimistic: Imputation model (e.g. in Y = a + bX) is just estimated, but not the true model  Thus, imputed values have some uncertainty  Single Imputation ignores this uncertainty  Coverage probability of confidence intervals is wrong  Solution: Multiple Imputation Incorporates both - residual error - model uncertainty (excluding model mis-specification)  R: Package «mice» for Multiple Imputation using chained equations 14

Multiple Imputation: MICE ? ? Aggregate results Do standard analysis for each imputed data set; Impute several times get estimate and std.error 15

Idea of MDS  Represent high-dimensional point cloud in few (usually 2) dimensions keeping distances between points similar  Classical/Metric MDS: Use a clever projection - guaranteed to find optimal solution only for euclidean distance - fast R: Function “ cmdscale ” in base distribution  Non-metric MDS: - Squeeze data on table = minimize STRESS - only conserve ranks = allow monotonic transformations before reducing dimensions - slow(er) R: Function “ isoMDS ” in package “MASS” 16

Distance: To scale or not to scale…  If variables are not scaled - variable with largest range has most weight - distance depends on scale  Scaling gives every variable equal weight  Similar alternative is re-weighing: p w 1 ( x i 1 ¡ x j 1 ) 2 + w 2 ( x i 2 ¡ x j 2 ) 2 + ::: + w p ( x ip ¡ x jp ) 2 d ( i;j ) =  Scale if, - variables measure different units (kg, meter, sec,…) - you explicitly want to have equal weight for each variable  Don’t scale if units are the same for all variables  Most often: Better to scale. 17

Dissimilarity for m ixed data: Gower’s Dissim.  Idea: Use distance measure between 0 and 1 for each d ( f ) variable: ij P p  Aggregate: i =1 d ( f ) d ( i; j ) = 1 ij p  Binary (a/s), nominal: Use methods discussed before - asymmetric: one group is much larger than the other d ( f ) ij = j x if ¡ x jf j  Interval-scaled: R f x if : Value for object i in variable f R f : Range of variable f for all objects  Ordinal: Use normalized ranks; then like interval-scaled based on range  R: Function “daisy” in package “cluster” 18

PCA: Goals  Goal 1: Dimension reduction to a few dimensions while explaining most of the variance (use first few PC’s)  Goal 2: Find one-dimensional index that separates objects best (use first PC) 19 Appl. Multivariate Statistics - Spring 2012

PCA (Version 1): Orthogonal directions • PC 1 is direction of largest variance • PC 2 is PC 1 - perpendicular to PC 1 - again largest variance • PC 3 is PC 3 - perpendicular to PC 1, PC 2 - again largest variance PC 2 • etc. 20

How many PC’s: Blood Example Rule 1: 5 PC’s Rule 3: Ellbow after PC 1 (?) Rule 2: 3 PC’s 21

Biplot: Show info on samples AND variables Approximately true: • Data points: Projection on first two PCs Distance in Biplot ~ True Distance • Projection of sample onto arrow gives original (scaled) value of that variable • Arrowlength: Variance of variable • Angle between Arrows: Correlation Approximation is often crude; good for quick overview 22

Supervised Learning: LDA P ( C j X ) = P ( C ) P ( X j C ) » P ( C ) P ( X j C ) P ( X ) Prior / prevalence: Assume: Find some estimate Fraction of samples X j C » N ( ¹ c ; §) in that class Bayes rule: Choose class where P(C|X) is maximal (rule is “optimal” if all types of error are equally costly) Special case: Two classes (0/1) - choose c=1 if P(C=1|X) > 0.5 or - choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1 In Practice: Estimate 𝑄 𝐷 , 𝜈 𝐷 , Σ 23

LDA Orthogonal directions of best separation 1. Principal Component Linear decision boundary 1. Linear Discriminant = 1. Canonical Variable Balance prior and mahalanobis distance 1 Classify to which class? – Consider: • Prior 0 • Mahalanobis distance to class center 24

LDA: Quality of classification  Use training data also as test data: Overfitting Too optimistic for error on new data  Separate test data Test Training  Cross validation (CV; e.g. “leave -one-out cross validation): Every row is the test case once, the rest in the training data 25

Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 - PowerPoint PPT Presentation

Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing values: md.pattern, mice

Week 12 Revision Discrete Math May 14, 2020 Marie Demlova: Discrete Math Revision Revision

Revision! How can we help? Revision Technique Didnt bother to revise ? How do you revise?

REVISION GUIDES: How to use them effectively Miss A Humphries and Mr C Dawson Science revision

REVISION GUIDES: How to use them effectively Miss A Humphries and Mrs K Leafe Science revision

EWBS Receiving Module Specifications 1.00 Century Revision history Revision history Revision

Accounts Revision https://nsa.org.na Overview Revision of national accounts Performance

Effective Revision Techniques Mrs Poole DLA RE Revision skills Much more than simply reading,

Revision Python - Nick Reynolds April 7, 2017 Revision (~15 mins) This Class Quiz

B2 Symmetry and Relativity Revision 1 TT 2020 Revision notes Highlights basic things

STRATEGIC MANAGEMENT REVISION CHAPTER-4 STRATEGIC MANAGEMENT REVISION CHAPTER-4 STRATEGIC

Revision of Pharmaceutical Affairs Law (PAL) - Japan Update - Revision of Pharmaceutical

1 ReVision Energy presentation to SMMC Energy Team 3-13-2014 Sam LaValle of ReVision

EWBS Receiving Module Communication specifications v1.00 Century Revision history Revision

Using Revision Control In Vivado Tim Vanevenhoven Overview of revision control Recent

SoM Curriculum Revision The Curriculum Revision Committee 9/19/2017 Who is the Committee

Year 8 Study Skills Study Skills - Successful Revision Strategies 1. Organisation 2. Revision

Frchet Distance Between Uncertain Trajectories Computing Expected Value and Upper Bound Kevin

Expectation of Random Variables Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of

4. Expectation and Variance Joint PMFs Andrej Bogdanov Expected value The expected value

Random Variables A random variable is a quantity whose value is determined by the outcome of an

Expectation DS GA 1002 Probability and Statistics for Data Science

Lecture 30: Bayes Rules, Expected Value and Variance, and Binormal Distribution Dr. Chengjiang

Recall: random variables A random variable X on a sample space is a function :

Discrete Random Variables; Expectation 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom This