Descriptive Statistics DS GA 1002 Probability and Statistics for - PowerPoint PPT Presentation

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

Descriptive statistics Techniques to visualize and summarize data Can often be interpreted within a probabilistic framework Often probabilistic assumptions do not hold, but techniques are still useful We describe them from a deterministic point of view

Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix

Histogram Technique to visualize one-dimensional data Bin range of the data, then count the number of instances in each bin The width of the bins can be adjusted to yield higher or lower resolution Approximation to their pmf or pdf if data are iid

Temperature in Oxford 45 January 40 August 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Degrees (Celsius)

GDP per capita of different countries 90 80 70 60 50 40 30 20 10 0 0 50 100 150 200 Thousands of dollars

Empirical mean Let { x 1 , x 2 , . . . , x n } be a set of real-valued data The empirical mean is defined as n av ( x 1 , x 2 , . . . , x n ) := 1 � x i n i = 1 Temperature data: 6.73 ◦ C in January and 21.3 ◦ C in August GDP per capita: $16 500

Empirical mean Let { � x 1 , � x 2 , . . . , � x n } be a set of d -dimensional real-valued data The empirical mean is defined as n x n ) := 1 � av ( � x 1 , � x 2 , . . . , � � x i n i = 1

Centering Let { � x 1 , � x 2 , . . . , � x n } be a set of d -dimensional real-valued data To center the data set we: 1. Compute the empirical mean 2. Subtract it from each vector y i := � x i − av ( � x n ) , 1 ≤ i ≤ n � x 1 , � x 2 , . . . , � � y 1 , . . . , � y n are centered at the origin

Centering Uncentered data Centered data

Empirical variance Let { x 1 , x 2 , . . . , x n } be a set of real-valued data The empirical variance is defined as n 1 � ( x i − av ( x 1 , x 2 , . . . , x n )) 2 var ( x 1 , x 2 , . . . , x n ) := n − 1 i = 1 The empirical standard deviation is the square root of the empirical variance Temperature data: 1.99 ◦ C in January and 1.73 ◦ C in August GDP per capita: $25 300

Temperature dataset In January the temperature in Oxford is around 6.73 ◦ C give or take 2 ◦ C

GDP dataset Countries typically have a GDP per capita of about $16 500 give or take $25 300

Quantiles and percentiles Let x ( 1 ) ≤ x ( 2 ) ≤ . . . ≤ x ( n ) denote the ordered elements of a dataset { x 1 , x 2 , . . . , x n } The q quantile of the data for 0 < q < 1 is x ([ q ( n + 1 )]) [ q ( n + 1 )] is the closest integer to q ( n + 1 ) The 100 p quantile is known as the p percentile

Quartiles and median The 0 . 25 and 0 . 75 quantiles are the first and third quartiles The 0 . 5 quantile is the empirical median If n is even, the empirical median is usually set to x ( n / 2 ) + x ( n / 2 + 1 ) 2 The difference between the 3rd and 1st quartiles is the interquartile range (IQR)

Quartiles and median ◮ Temperature data (January): ◮ Sample mean: 6.73 ◦ C ◮ Median: 6.80 ◦ C ◮ Interquartile range: 2.9 ◦ C ◮ Temperature data (August): ◮ Sample mean: 21.3 ◦ C ◮ Median: 21.2 ◦ C ◮ Interquartile range: 2.1 ◦ C

Quartiles and median ◮ GDP per capita: ◮ Sample mean: $16 500 (71% of the countries have lower GDP per capita!) ◮ Median: $6 350 ◮ Interquartile range: $18 200 ◮ Five-number summary: $130, $1 960, $6 350, $20 100, $188 000

Boxplot of temperature data 30 25 20 Degrees (Celsius) 15 10 5 0 5 January April August November

Boxplot of GDP data 60 50 Thousands of dollars 40 30 20 10 0

Multidimensional data Each dimension represents a feature We can visualize two-dimensional data using scatter plots

Scatter plot 20 18 16 April 14 12 10 8 16 18 20 22 24 26 28 August

Scatter plot 20 Minimum temperature 15 10 5 0 5 10 5 0 5 10 15 20 25 30 Maximum temperature

Empirical covariance Data: { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } The empirical covariance is defined as n 1 � cov (( x 1 , y 1 ) , . . . , ( x n , y n )) := ( x i − av ( x 1 , . . . , x n )) ( y i − av ( y 1 , . . . , y n )) n − 1 i = 1

Empirical correlation coefficient Data: { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } The empirical correlation coefficient is defined as cov (( x 1 , y 1 ) , . . . , ( x n , y n )) ρ (( x 1 , y 1 ) , . . . , ( x n , y n )) := std ( x 1 , . . . , x n ) std ( y 1 , . . . , y n ) a , � Cauchy-Schwarz inequality: for any � b a T � � b − 1 ≤ ≤ 1 || a || 2 || b || 2 Consequence: − 1 ≤ ρ (( x 1 , y 1 ) , . . . , ( x n , y n )) ≤ 1

ρ = 0 . 269 20 18 16 April 14 12 10 8 16 18 20 22 24 26 28 August

ρ = 0 . 962 20 Minimum temperature 15 10 5 0 5 10 5 0 5 10 15 20 25 30 Maximum temperature

Empirical covariance matrix Data: { � x 1 , � x 2 , . . . , � x n } ( d features) The empirical covariance matrix is defined as n 1 � x n )) T Σ ( � x 1 , . . . , � x n ) := ( � x i − av ( � x 1 , . . . , � x n )) ( � x i − av ( � x 1 , . . . , � n − 1 i = 1 The ( i , j ) entry, 1 ≤ i , j ≤ d , is given by � var (( � x 1 ) i , . . . , ( � x n ) i ) if i = j , Σ ( � x n ) ij = x 1 , . . . , � �� cov ( � x 1 ) i , ( � x 1 ) j , . . . , ( � x n ) i , ( � x n ) j if i � = j .

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n n 1 �� 2 � � � v T � v T � v T � = � x i − av � x 1 , . . . , � x n n − 1 i = 1

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n n 1 �� 2 � � � v T � v T � v T � = � x i − av � x 1 , . . . , � x n n − 1 i = 1 n 1 � 2 � v T ( � � = � x i − av ( � x 1 , . . . , � x n )) n − 1 i = 1

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n n 1 �� 2 � � � v T � v T � v T � = � x i − av � x 1 , . . . , � x n n − 1 i = 1 n 1 � 2 � v T ( � � = � x i − av ( � x 1 , . . . , � x n )) n − 1 i = 1 � n � 1 � x n )) T v T = � ( � x i − av ( � x 1 , . . . , � x n )) ( � x i − av ( � x 1 , . . . , � � v n − 1 i = 1

Empirical variance in a certain direction Let � v be a unit-norm vector aligned with a direction of interest � � v T � v T � var � x 1 , . . . , � x n n 1 �� 2 � � � v T � v T � v T � = � x i − av � x 1 , . . . , � x n n − 1 i = 1 n 1 � 2 � v T ( � � = � x i − av ( � x 1 , . . . , � x n )) n − 1 i = 1 � n � 1 � x n )) T v T = � ( � x i − av ( � x 1 , . . . , � x n )) ( � x i − av ( � x 1 , . . . , � � v n − 1 i = 1 v T Σ ( � = � x 1 , . . . , � x n ) � v

Eigendecomposition of the covariance matrix Let � v be a unit-norm vector aligned with a direction of interest x n ) = U Λ U T Σ ( � x 1 , . . . , �  0 · · · 0  λ 1 0 λ 2 · · · 0 � T � � �   � � = · · · · · · u 1 u 2 � � u n u 1 u 2 � u n �   · · ·   0 0 · · · λ n

Eigendecomposition of the covariance matrix For any symmetric matrix A ∈ R n with normalized eigenvectors � u 1 , � u 2 , . . . , � u n and corresponding eigenvalues λ 1 ≥ λ 2 ≥ . . . ≥ λ n v T A � λ 1 = max v || 2 = 1 � v || � v T A � u 1 = arg max � v || 2 = 1 � v || � v T A � λ k = max � v || � v || 2 = 1 ,� u ⊥ � u 1 ,...,� u k − 1 v T A � u k = arg max � � v || � v || 2 = 1 ,� u ⊥ � u 1 ,...,� u k − 1

Principal component analysis Compute eigenvectors of empirical covariance matrix to determine directions of maximum variation

Example: 2D data σ 1 σ 2 √ n = 0 . 705 √ n = 0 . 690 u 1 u 2

Centering is important! σ 1 σ 2 √ n = 5 . 077 √ n = 0 . 889 u 1 u 2

Centering is important! σ 1 σ 2 √ n = 1 . 261 √ n = 0 . 139 u 2 u 1

Descriptive Statistics DS GA 1002 Probability and Statistics for - PowerPoint PPT Presentation

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize and summarize data Can often be

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Descriptive Statistics Descriptive and Inferential Statistics Recall that statistical methods are

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring

Descriptive Epidem iology & Descriptive Epidem iology & Study design Study design

Descriptive Complexity of Jonni Virtema Deterministic Polylogarithmic Time Descriptive

Statistics and Data Analysis Descriptive Statistics (2): Summarization Ling-Chieh Kung

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Introduction to Data Science CS 5963 / Math 3900 Lecture 2: Introduction to Descriptive

Descriptive Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc

Descriptive Statistics and Probability: A Look at Real- World

Trademark and Unfair Competition Law Slides 22: Descriptive and Nominative Fair Use LAWS 7341-001

Descriptive combinatorics and ergodic theorems Anush Tserunyan University of Illinois at

Agenda for today 1. Descriptive Data Analysis 2. Graphics XploRe Descriptive Data Analysis 1-2

Games in Descriptive Set Theory, or: its all fun and games until someone loses the axiom of

Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides

About Inequality: How do we get to reducing inequality? Webinar, Part I June 22, 2015 1 Focus

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

vs. Prescriptive The instructor might ask students to discuss as a group which choice they should

Descriptive Statistics Central Tendency Variation Mean and Standard Deviation of Grouped Data

Descripti v e statistics P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

Announcements U nit 1: I ntroduction to data L ecture 2: E xploratory data analysis S tatistics 101

Experimental Analysis Marco Chiarandini Department of Mathematics & Computer Science

Descriptive Statistics DS GA 1002 Probability and Statistics for - PowerPoint PPT Presentation

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize and summarize data Can often be

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Descriptive Statistics Descriptive and Inferential Statistics Recall that statistical methods are

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring

Descriptive Epidem iology &amp; Descriptive Epidem iology &amp; Study design Study design

Descriptive Complexity of Jonni Virtema Deterministic Polylogarithmic Time Descriptive

Statistics and Data Analysis Descriptive Statistics (2): Summarization Ling-Chieh Kung

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Introduction to Data Science CS 5963 / Math 3900 Lecture 2: Introduction to Descriptive

Descriptive Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc

Descriptive Statistics and Probability: A Look at Real- World

Trademark and Unfair Competition Law Slides 22: Descriptive and Nominative Fair Use LAWS 7341-001

Descriptive combinatorics and ergodic theorems Anush Tserunyan University of Illinois at

Agenda for today 1. Descriptive Data Analysis 2. Graphics XploRe Descriptive Data Analysis 1-2

Games in Descriptive Set Theory, or: its all fun and games until someone loses the axiom of

Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides

About Inequality: How do we get to reducing inequality? Webinar, Part I June 22, 2015 1 Focus

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

vs. Prescriptive The instructor might ask students to discuss as a group which choice they should

Descriptive Statistics Central Tendency Variation Mean and Standard Deviation of Grouped Data

Descripti v e statistics P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

Announcements U nit 1: I ntroduction to data L ecture 2: E xploratory data analysis S tatistics 101

Experimental Analysis Marco Chiarandini Department of Mathematics &amp; Computer Science

Descriptive Epidem iology & Descriptive Epidem iology & Study design Study design

Experimental Analysis Marco Chiarandini Department of Mathematics & Computer Science