Descriptive Statistics DS GA 1002 Probability and Statistics for - - PowerPoint PPT Presentation
Descriptive Statistics DS GA 1002 Probability and Statistics for - - PowerPoint PPT Presentation
Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize and summarize data Can often be
Descriptive statistics
Techniques to visualize and summarize data Can often be interpreted within a probabilistic framework Often probabilistic assumptions do not hold, but techniques are still useful We describe them from a deterministic point of view
Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix
Histogram
Technique to visualize one-dimensional data Bin range of the data, then count the number of instances in each bin The width of the bins can be adjusted to yield higher or lower resolution Approximation to their pmf or pdf if data are iid
Temperature in Oxford
5 10 15 20 25 30 Degrees (Celsius) 5 10 15 20 25 30 35 40 45 January August
GDP per capita of different countries
50 100 150 200 Thousands of dollars 10 20 30 40 50 60 70 80 90
Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix
Empirical mean
Let {x1, x2, . . . , xn} be a set of real-valued data The empirical mean is defined as av (x1, x2, . . . , xn) := 1 n
n
- i=1
xi Temperature data: 6.73 ◦C in January and 21.3 ◦C in August GDP per capita: $16 500
Empirical mean
Let { x1, x2, . . . , xn} be a set of d-dimensional real-valued data The empirical mean is defined as av ( x1, x2, . . . , xn) := 1 n
n
- i=1
- xi
Centering
Let { x1, x2, . . . , xn} be a set of d-dimensional real-valued data To center the data set we:
- 1. Compute the empirical mean
- 2. Subtract it from each vector
- yi :=
xi − av ( x1, x2, . . . , xn) , 1 ≤ i ≤ n
- y1, . . . ,
yn are centered at the origin
Centering
Uncentered data Centered data
Empirical variance
Let {x1, x2, . . . , xn} be a set of real-valued data The empirical variance is defined as var (x1, x2, . . . , xn) := 1 n − 1
n
- i=1
(xi − av (x1, x2, . . . , xn))2 The empirical standard deviation is the square root of the empirical variance Temperature data: 1.99 ◦C in January and 1.73 ◦C in August GDP per capita: $25 300
Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix
Temperature dataset
In January the temperature in Oxford is around 6.73 ◦C give or take 2 ◦C
GDP dataset
Countries typically have a GDP per capita of about $16 500 give or take $25 300
Quantiles and percentiles
Let x(1) ≤ x(2) ≤ . . . ≤ x(n) denote the ordered elements of a dataset {x1, x2, . . . , xn} The q quantile of the data for 0 < q < 1 is x([q(n+1)]) [q (n + 1)] is the closest integer to q (n + 1) The 100 p quantile is known as the p percentile
Quartiles and median
The 0.25 and 0.75 quantiles are the first and third quartiles The 0.5 quantile is the empirical median If n is even, the empirical median is usually set to x(n/2) + x(n/2+1) 2 The difference between the 3rd and 1st quartiles is the interquartile range (IQR)
Quartiles and median
◮ Temperature data (January):
◮ Sample mean: 6.73 ◦C ◮ Median: 6.80 ◦C ◮ Interquartile range: 2.9 ◦C
◮ Temperature data (August):
◮ Sample mean: 21.3 ◦C ◮ Median: 21.2 ◦C ◮ Interquartile range: 2.1 ◦C
Quartiles and median
◮ GDP per capita:
◮ Sample mean: $16 500
(71% of the countries have lower GDP per capita!)
◮ Median: $6 350 ◮ Interquartile range: $18 200 ◮ Five-number summary: $130, $1 960, $6 350, $20 100, $188 000
Boxplot of temperature data
January April August November 5 5 10 15 20 25 30 Degrees (Celsius)
Boxplot of GDP data
10 20 30 40 50 60 Thousands of dollars
Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix
Multidimensional data
Each dimension represents a feature We can visualize two-dimensional data using scatter plots
Scatter plot
16 18 20 22 24 26 28
August
8 10 12 14 16 18 20
April
Scatter plot
5 5 10 15 20 25 30
Maximum temperature
10 5 5 10 15 20
Minimum temperature
Empirical covariance
Data: {(x1, y1) , (x2, y2) , . . . , (xn, yn)} The empirical covariance is defined as
cov ((x1, y1) , . . . , (xn, yn)) := 1 n − 1
n
- i=1
(xi − av (x1, . . . , xn)) (yi − av (y1, . . . , yn))
Empirical correlation coefficient
Data: {(x1, y1) , (x2, y2) , . . . , (xn, yn)} The empirical correlation coefficient is defined as ρ ((x1, y1) , . . . , (xn, yn)) := cov ((x1, y1) , . . . , (xn, yn)) std (x1, . . . , xn) std (y1, . . . , yn) Cauchy-Schwarz inequality: for any a, b −1 ≤
- aT
b ||a||2 ||b||2 ≤ 1 Consequence: −1 ≤ ρ ((x1, y1) , . . . , (xn, yn)) ≤ 1
ρ = 0.269
16 18 20 22 24 26 28
August
8 10 12 14 16 18 20
April
ρ = 0.962
5 5 10 15 20 25 30
Maximum temperature
10 5 5 10 15 20
Minimum temperature
Histogram Empirical mean and variance Order statistics Empirical covariance Empirical covariance matrix
Empirical covariance matrix
Data: { x1, x2, . . . , xn} (d features) The empirical covariance matrix is defined as Σ ( x1, . . . , xn) := 1 n − 1
n
- i=1
( xi − av ( x1, . . . , xn)) ( xi − av ( x1, . . . , xn))T The (i, j) entry, 1 ≤ i, j ≤ d, is given by Σ ( x1, . . . , xn)ij =
- var ((
x1)i , . . . , ( xn)i) if i = j, cov
- (
x1)i , ( x1)j
- , . . . ,
- (
xn)i , ( xn)j
- if i = j.
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2 = 1 n − 1
n
- i=1
- v T (
xi − av ( x1, . . . , xn)) 2
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2 = 1 n − 1
n
- i=1
- v T (
xi − av ( x1, . . . , xn)) 2 = v T
- 1
n − 1
n
- i=1
( xi − av ( x1, . . . , xn)) ( xi − av ( x1, . . . , xn))T
- v
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2 = 1 n − 1
n
- i=1
- v T (
xi − av ( x1, . . . , xn)) 2 = v T
- 1
n − 1
n
- i=1
( xi − av ( x1, . . . , xn)) ( xi − av ( x1, . . . , xn))T
- v
= v TΣ ( x1, . . . , xn) v
Eigendecomposition of the covariance matrix
Let v be a unit-norm vector aligned with a direction of interest Σ ( x1, . . . , xn) = UΛUT =
- u1
- u2
· · ·
- un
-
λ1 · · · λ2 · · · · · · · · · λn
- u1
- u2
· · ·
- un
T
Eigendecomposition of the covariance matrix
For any symmetric matrix A ∈ Rn with normalized eigenvectors
- u1,
u2, . . . , un and corresponding eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn λ1 = max
|| v||2=1
v TA v
- u1 = arg max
|| v||2=1
v TA v λk = max
|| v||2=1, u⊥ u1,..., uk−1
- v TA
v
- uk = arg
max
|| v||2=1, u⊥ u1,..., uk−1
- v TA
v
Principal component analysis
Compute eigenvectors of empirical covariance matrix to determine directions of maximum variation
Example: 2D data
σ1 √n = 0.705 σ2 √n = 0.690
u1 u2
Example: 2D data
σ1 √n = 0.9832 σ2 √n = 0.3559
u1 u2
Example: 2D data
σ1 √n = 1.3490 σ2 √n = 0.1438
u1 u2
Centering is important!
σ1 √n = 5.077 σ2 √n = 0.889 u1 u2
Centering is important!
σ1 √n = 1.261 σ2 √n = 0.139 u2 u1
Dimensionality reduction
Projection of data onto a lower-dimensional space Applications: Visualization / computational efficiency / denoising Example: Seeds from 3 varieties of wheat (Kama, Rosa and Canadian) 7 features: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove
PCA dimensionality reduction
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto first PC
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto second PC
PCA dimensionality reduction
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto (d-1)th PC
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto dth PC
Whitening
Preprocessing procedure Linear transformation to eliminate skew in the data Enhances nonlinear structure After whitening, the data are uncorrelated
Whitening
Let x1, . . . , xn be a set of d-dimensional centered data with a full-rank covariance matrix. To whiten the data we
- 1. Compute the eigendecomposition of the empirical covariance matrix
Σ ( x1, . . . , xn) = UΛUT
- 2. For i = 1, . . . , n set
- yi :=
√ Λ
−1UT
xi, √ Λ := √λ1 · · · √λ2 · · · · · · · · · √λn
Whitening
Σ ( y1, . . . , yn)
Whitening
Σ ( y1, . . . , yn) := 1 n − 1
n
- i=1
- yi
yT
i
Whitening
Σ ( y1, . . . , yn) := 1 n − 1
n
- i=1
- yi
yT
i
= 1 n − 1
n
- i=1
√ Λ
−1UT
xi √ Λ
−1UT
xi T
Whitening
Σ ( y1, . . . , yn) := 1 n − 1
n
- i=1
- yi
yT
i
= 1 n − 1
n
- i=1
√ Λ
−1UT
xi √ Λ
−1UT
xi T = √ Λ
−1UT
- 1
n − 1
n
- i=1
- xi
xT
i
- U
√ Λ
−1
Whitening
Σ ( y1, . . . , yn) := 1 n − 1
n
- i=1
- yi
yT
i
= 1 n − 1
n
- i=1
√ Λ
−1UT
xi √ Λ
−1UT
xi T = √ Λ
−1UT
- 1
n − 1
n
- i=1
- xi
xT
i
- U
√ Λ
−1
= √ Λ
−1UTΣ (
x1, . . . , xn) U √ Λ
−1
Whitening
Σ ( y1, . . . , yn) := 1 n − 1
n
- i=1
- yi
yT
i
= 1 n − 1
n
- i=1
√ Λ
−1UT
xi √ Λ
−1UT
xi T = √ Λ
−1UT
- 1
n − 1
n
- i=1
- xi
xT
i
- U
√ Λ
−1
= √ Λ
−1UTΣ (
x1, . . . , xn) U √ Λ
−1
= √ Λ
−1UTU
√ Λ √ ΛUTU √ Λ
−1
Whitening
Σ ( y1, . . . , yn) := 1 n − 1
n
- i=1
- yi
yT
i
= 1 n − 1
n
- i=1
√ Λ
−1UT
xi √ Λ
−1UT
xi T = √ Λ
−1UT
- 1
n − 1
n
- i=1
- xi
xT
i
- U
√ Λ
−1
= √ Λ
−1UTΣ (
x1, . . . , xn) U √ Λ
−1
= √ Λ
−1UTU
√ Λ √ ΛUTU √ Λ
−1
= I
- x