Statistical Data Analysis DS GA 1002 Statistical and Mathematical - PowerPoint PPT Presentation

Statistical Data Analysis DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall16 Carlos Fernandez-Granda

Descriptive statistics Statistical estimation

Histogram Technique to visualize one-dimensional data Bin range of the data, then count the number of instances in each bin The width of the bins can be adjusted to yield higher or lower resolution Approximation to their pmf or pdf if data are iid

Temperature in Oxford 45 January 40 August 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Degrees (Celsius)

GDP per capita 90 80 70 60 50 40 30 20 10 0 0 50 100 150 200 Thousands of dollars

Empirical mean Let { x 1 , x 2 , . . . , x n } be a set of real-valued data The empirical mean is defined as n � av ( x 1 , x 2 , . . . , x n ) := 1 x i n i = 1 Temperature data: 6.73 ◦ C in January and 21.3 ◦ C in August GDP per capita: $16 500

Empirical mean Let { � x 1 , � x 2 , . . . , � x n } be a set of d -dimensional real-valued data The empirical mean is defined as � n x n ) := 1 av ( � x 1 , � x 2 , . . . , � � x i n i = 1 Centering a dataset by subtracting its empirical mean is a common preprocessing step

Empirical variance Let { x 1 , x 2 , . . . , x n } be a set of real-valued data The empirical variance is defined as n � 1 ( x i − av ( x 1 , x 2 , . . . , x n )) 2 var ( x 1 , x 2 , . . . , x n ) := n − 1 i = 1 The sample standard deviation is the square root of the empirical variance Temperature data: 1.99 ◦ C in January and 1.73 ◦ C in August GDP per capita: $25 300

Temperature dataset In January the temperature in Oxford is around 6.73 ◦ C give or take 2 ◦ C

GDP dataset Countries typically have a GDP per capita of about $16 500 give or take $25 300

Quantiles and percentiles Let x ( 1 ) ≤ x ( 2 ) ≤ . . . ≤ x ( n ) denote the ordered elements of a dataset { x 1 , x 2 , . . . , x n } The q quantile of the data for 0 < q < 1 is x ( ⌈ q ( n + 1 ) ⌉ ) The 100 p quantile is known as the p percentile

Quartiles and median The 0 . 25 and 0 . 75 quantiles are the first and third quartiles The 0 . 5 quantile is the empirical median If n is even, the empirical median is usually set to x ( n / 2 ) + x ( n / 2 + 1 ) 2 The difference between the 3rd and 1st quartiles is the interquartile range (IQR)

Quartiles and median ◮ Temperature data (January): ◮ Sample mean: 6.73 ◦ C ◮ Median: 6.80 ◦ C ◮ Interquartile range: 2.9 ◦ C ◮ Temperature data (August): ◮ Sample mean: 21.3 ◦ C ◮ Median: 21.2 ◦ C ◮ Interquartile range: 2.1 ◦ C

Quartiles and median ◮ GDP per capita: ◮ Sample mean: $16 500 (71% of the countries have lower GDP per capita!) ◮ Median: $6 350 ◮ Interquartile range: $18 200 ◮ Five-number summary: $130, $1 960, $6 350, $20 100, $188 000

Boxplot of temperature data 30 25 20 Degrees (Celsius) 15 10 5 0 5 January April August November

Boxplot of GDP data 60 50 Thousands of dollars 40 30 20 10 0

Multidimensional data Each dimension represents a feature We can visualize two-dimensional data using scatter plots

Scatter plot 20 18 16 April 14 12 10 8 16 18 20 22 24 26 28 August

Scatter plot 20 Minimum temperature 15 10 5 0 5 10 5 0 5 10 15 20 25 30 Maximum temperature

Empirical covariance Data: { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } The empirical covariance is defined as n � 1 cov (( x 1 , y 1 ) , . . . , ( x n , y n )) := ( x i − av ( x 1 , . . . , x n )) ( y i − av ( y 1 , . . . , y n )) n − 1 i = 1

Empirical correlation coefficient Data: { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } The empirical correlation coefficient is defined as cov (( x 1 , y 1 ) , . . . , ( x n , y n )) ρ (( x 1 , y 1 ) , . . . , ( x n , y n )) := std ( x 1 , . . . , x n ) std ( y 1 , . . . , y n ) a , � Cauchy-Schwarz inequality: for any � b a T � � b − 1 ≤ ≤ 1 || a || 2 || b || 2 Consequence: − 1 ≤ ρ (( x 1 , y 1 ) , . . . , ( x n , y n )) ≤ 1

ρ = 0 . 269 20 18 16 April 14 12 10 8 16 18 20 22 24 26 28 August

ρ = 0 . 962 20 Minimum temperature 15 10 5 0 5 10 5 0 5 10 15 20 25 30 Maximum temperature

Empirical covariance matrix Data: { � x 1 , � x 2 , . . . , � x n } ( d features) The empirical covariance matrix is defined as n � 1 x n )) T Σ ( � x 1 , . . . , � x n ) := ( � x i − av ( � x 1 , . . . , � x n )) ( � x i − av ( � x 1 , . . . , � n − 1 i = 1 The ( i , j ) entry, 1 ≤ i , j ≤ d , is given by � var (( � x 1 ) i , . . . , ( � x n ) i ) if i = j , �� Σ ( � x 1 , . . . , � x n ) ij = ( � x 1 ) i , ( � ( � x n ) i , ( � cov x 1 ) j , . . . , x n ) j if i � = j .

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n � � �� 2 � n 1 v T � v T � v T � � x i − av � x 1 , . . . , � = x n n − 1 i = 1

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n � � �� 2 � n 1 v T � v T � v T � � x i − av � x 1 , . . . , � = x n n − 1 i = 1 � � 2 n � 1 v T ( � = � x i − av ( � x 1 , . . . , � x n )) n − 1 i = 1

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n � � �� 2 � n 1 v T � v T � v T � � x i − av � x 1 , . . . , � = x n n − 1 i = 1 � � 2 n � 1 v T ( � = � x i − av ( � x 1 , . . . , � x n )) n − 1 i = 1 � � n � 1 x n )) T v T = � ( � x i − av ( � x 1 , . . . , � x n )) ( � x i − av ( � x 1 , . . . , � � v n − 1 i = 1

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n � � �� 2 � n 1 v T � v T � v T � � x i − av � x 1 , . . . , � = x n n − 1 i = 1 � � 2 n � 1 v T ( � = � x i − av ( � x 1 , . . . , � x n )) n − 1 i = 1 � � n � 1 x n )) T v T = � ( � x i − av ( � x 1 , . . . , � x n )) ( � x i − av ( � x 1 , . . . , � � v n − 1 i = 1 v T Σ ( � = � x 1 , . . . , � x n ) � v

Eigendecomposition of the covariance matrix Let � v be a unit-norm vector aligned with a direction of interest x n ) = U Λ U T Σ ( � x 1 , . . . , �   λ 1 0 · · · 0 � � �   � � � T 0 λ 2 · · · 0   = � · · · � � · · · � u 1 u 2 u n u 1 u 2 u n   · · · 0 0 · · · λ n

Eigendecomposition of the covariance matrix For any symmetric matrix A ∈ R n with normalized eigenvectors � u 1 , � u 2 , . . . , � u n and corresponding eigenvalues λ 1 ≥ λ 2 ≥ . . . ≥ λ n v T A � v || 2 = 1 � λ 1 = max v || � v T A � � u 1 = arg max v || 2 = 1 � v || � v T A � λ k = max � v || � v || 2 = 1 ,� u ⊥ � u 1 ,...,� u k − 1 v T A � u k = arg � max � v || � v || 2 = 1 ,� u ⊥ � u 1 ,...,� u k − 1

Principal component analysis Compute eigenvectors of empirical covariance matrix to determine directions of maximum variation Application: dimensionality reduction Example: Seeds from 3 varieties of wheat (Kama, Rosa and Canadian) 7 features: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove Aim: Visualize in two dimensions

PCA dimensionality reduction 2.0 1.5 Projection onto second PC 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Projection onto first PC

PCA dimensionality reduction 2.0 1.5 1.0 Projection onto dth PC 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Projection onto (d-1)th PC

Descriptive statistics Statistical estimation

Statistical estimation Data: Realization of iid sequence Aim: Estimate parameter associated to underlying distribution Frequentist viewpoint: Parameter is deterministic

Estimator Deterministic function of the data x 1 , x 2 , . . . , x n y n := h ( x 1 , x 2 , . . . , x n )

Estimator Under iid assumption � � � X ( 1 ) , � � X ( 2 ) , . . . , � Y ( n ) := h X ( n ) ◮ Does � Y converge to γ as n → ∞ ? ◮ For finite n what is the probability that γ is approximated by the estimator up to a certain accuracy?

Sampling from a population Population of m individuals We are interested in a feature associated to each person (cholesterol level, salary, who they are voting for. . . ) The feature has k possible values { z 1 , z 2 , . . . , z k } m j = number of people for whom feature equals z j

Statistical Data Analysis DS GA 1002 Statistical and Mathematical - PowerPoint PPT Presentation

Statistical Data Analysis DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall16 Carlos Fernandez-Granda Descriptive statistics Statistical estimation Histogram Technique to visualize

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Importing Data from Statistical So ware haven Importing Data into R Statistical So

. Surajit Ray Minjung Kyung Jiezhun (Sherry) Gu Ray SAMSI, June 2 2005 - slide #1 Statistical

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Probabilistic Foundations of Statistical Network Analysis Chapter 5: Statistical modeling paradigm

Preliminary statistical analysis of the international eventing results 2014 Madrid 23/1/15

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Lecture 9/Chapter 7 Summarizing and Displaying Measurement (Quantitative) Data Five Number

Smooth Sensitivity and Sampling CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 7 :

W4231: Analysis of Algorithms Definition of median 9/14/1999 Let A = a 1 a n be a

Chapter 3 : Central Tendency O Overview i Definition: Central tendency is a statistical

CSE101: Algorithm Design and Analysis Russell Impagliazzo Sanjoy Dasgupta Ragesh Jaiswal

Looking For Truth Or At Least Data Elizabeth D. Zwicky zwicky@otoh.org LISA 2009 Important

On Medians of (Randomized) Pairwise Means Pierre Laforgue 1 , Stephan Cl on 1 , Patrice Bertail

Mean, median & mode imputations DEALIN G W ITH MIS S IN G DATA IN P YTH ON Suraj Donthi

Statistical Data Analysis DS GA 1002 Statistical and Mathematical - PowerPoint PPT Presentation

Statistical Data Analysis DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall16 Carlos Fernandez-Granda Descriptive statistics Statistical estimation Histogram Technique to visualize

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Importing Data from Statistical So ware haven Importing Data into R Statistical So

. Surajit Ray Minjung Kyung Jiezhun (Sherry) Gu Ray SAMSI, June 2 2005 - slide #1 Statistical

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Probabilistic Foundations of Statistical Network Analysis Chapter 5: Statistical modeling paradigm

Preliminary statistical analysis of the international eventing results 2014 Madrid 23/1/15

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Lecture 9/Chapter 7 Summarizing and Displaying Measurement (Quantitative) Data Five Number

Smooth Sensitivity and Sampling CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 7 :

W4231: Analysis of Algorithms Definition of median 9/14/1999 Let A = a 1 a n be a

Chapter 3 : Central Tendency O Overview i Definition: Central tendency is a statistical

CSE101: Algorithm Design and Analysis Russell Impagliazzo Sanjoy Dasgupta Ragesh Jaiswal

Looking For Truth Or At Least Data Elizabeth D. Zwicky zwicky@otoh.org LISA 2009 Important

On Medians of (Randomized) Pairwise Means Pierre Laforgue 1 , Stephan Cl on 1 , Patrice Bertail

Mean, median &amp; mode imputations DEALIN G W ITH MIS S IN G DATA IN P YTH ON Suraj Donthi

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

Mean, median & mode imputations DEALIN G W ITH MIS S IN G DATA IN P YTH ON Suraj Donthi