Statistical Data Analysis DS GA 1002 Statistical and Mathematical - - PowerPoint PPT Presentation
Statistical Data Analysis DS GA 1002 Statistical and Mathematical - - PowerPoint PPT Presentation
Statistical Data Analysis DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall16 Carlos Fernandez-Granda Descriptive statistics Statistical estimation Histogram Technique to visualize
Descriptive statistics Statistical estimation
Histogram
Technique to visualize one-dimensional data Bin range of the data, then count the number of instances in each bin The width of the bins can be adjusted to yield higher or lower resolution Approximation to their pmf or pdf if data are iid
Temperature in Oxford
5 10 15 20 25 30 Degrees (Celsius) 5 10 15 20 25 30 35 40 45 January August
GDP per capita
50 100 150 200 Thousands of dollars 10 20 30 40 50 60 70 80 90
Empirical mean
Let {x1, x2, . . . , xn} be a set of real-valued data The empirical mean is defined as av (x1, x2, . . . , xn) := 1 n
n
- i=1
xi Temperature data: 6.73 ◦C in January and 21.3 ◦C in August GDP per capita: $16 500
Empirical mean
Let { x1, x2, . . . , xn} be a set of d-dimensional real-valued data The empirical mean is defined as av ( x1, x2, . . . , xn) := 1 n
n
- i=1
- xi
Centering a dataset by subtracting its empirical mean is a common preprocessing step
Empirical variance
Let {x1, x2, . . . , xn} be a set of real-valued data The empirical variance is defined as var (x1, x2, . . . , xn) := 1 n − 1
n
- i=1
(xi − av (x1, x2, . . . , xn))2 The sample standard deviation is the square root of the empirical variance Temperature data: 1.99 ◦C in January and 1.73 ◦C in August GDP per capita: $25 300
Temperature dataset
In January the temperature in Oxford is around 6.73 ◦C give or take 2 ◦C
GDP dataset
Countries typically have a GDP per capita of about $16 500 give or take $25 300
Quantiles and percentiles
Let x(1) ≤ x(2) ≤ . . . ≤ x(n) denote the ordered elements of a dataset {x1, x2, . . . , xn} The q quantile of the data for 0 < q < 1 is x(⌈q(n+1)⌉) The 100 p quantile is known as the p percentile
Quartiles and median
The 0.25 and 0.75 quantiles are the first and third quartiles The 0.5 quantile is the empirical median If n is even, the empirical median is usually set to x(n/2) + x(n/2+1) 2 The difference between the 3rd and 1st quartiles is the interquartile range (IQR)
Quartiles and median
◮ Temperature data (January):
◮ Sample mean: 6.73 ◦C ◮ Median: 6.80 ◦C ◮ Interquartile range: 2.9 ◦C
◮ Temperature data (August):
◮ Sample mean: 21.3 ◦C ◮ Median: 21.2 ◦C ◮ Interquartile range: 2.1 ◦C
Quartiles and median
◮ GDP per capita:
◮ Sample mean: $16 500
(71% of the countries have lower GDP per capita!)
◮ Median: $6 350 ◮ Interquartile range: $18 200 ◮ Five-number summary: $130, $1 960, $6 350, $20 100, $188 000
Boxplot of temperature data
January April August November 5 5 10 15 20 25 30 Degrees (Celsius)
Boxplot of GDP data
10 20 30 40 50 60 Thousands of dollars
Multidimensional data
Each dimension represents a feature We can visualize two-dimensional data using scatter plots
Scatter plot
16 18 20 22 24 26 28
August
8 10 12 14 16 18 20
April
Scatter plot
5 5 10 15 20 25 30
Maximum temperature
10 5 5 10 15 20
Minimum temperature
Empirical covariance
Data: {(x1, y1) , (x2, y2) , . . . , (xn, yn)} The empirical covariance is defined as
cov ((x1, y1) , . . . , (xn, yn)) := 1 n − 1
n
- i=1
(xi − av (x1, . . . , xn)) (yi − av (y1, . . . , yn))
Empirical correlation coefficient
Data: {(x1, y1) , (x2, y2) , . . . , (xn, yn)} The empirical correlation coefficient is defined as ρ ((x1, y1) , . . . , (xn, yn)) := cov ((x1, y1) , . . . , (xn, yn)) std (x1, . . . , xn) std (y1, . . . , yn) Cauchy-Schwarz inequality: for any a, b −1 ≤
- aT
b ||a||2 ||b||2 ≤ 1 Consequence: −1 ≤ ρ ((x1, y1) , . . . , (xn, yn)) ≤ 1
ρ = 0.269
16 18 20 22 24 26 28
August
8 10 12 14 16 18 20
April
ρ = 0.962
5 5 10 15 20 25 30
Maximum temperature
10 5 5 10 15 20
Minimum temperature
Empirical covariance matrix
Data: { x1, x2, . . . , xn} (d features) The empirical covariance matrix is defined as Σ ( x1, . . . , xn) := 1 n − 1
n
- i=1
( xi − av ( x1, . . . , xn)) ( xi − av ( x1, . . . , xn))T The (i, j) entry, 1 ≤ i, j ≤ d, is given by Σ ( x1, . . . , xn)ij =
- var ((
x1)i , . . . , ( xn)i) if i = j, cov
- (
x1)i , ( x1)j
- , . . . ,
- (
xn)i , ( xn)j
- if i = j.
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2 = 1 n − 1
n
- i=1
- v T (
xi − av ( x1, . . . , xn)) 2
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2 = 1 n − 1
n
- i=1
- v T (
xi − av ( x1, . . . , xn)) 2 = v T
- 1
n − 1
n
- i=1
( xi − av ( x1, . . . , xn)) ( xi − av ( x1, . . . , xn))T
- v
Empirical variance in a certain direction
Let v be a unit-norm vector aligned with a direction of interest var
- v T
x1, . . . , v T xn
- =
1 n − 1
n
- i=1
- v T
xi − av
- v T
x1, . . . , v T xn 2 = 1 n − 1
n
- i=1
- v T (
xi − av ( x1, . . . , xn)) 2 = v T
- 1
n − 1
n
- i=1
( xi − av ( x1, . . . , xn)) ( xi − av ( x1, . . . , xn))T
- v
= v TΣ ( x1, . . . , xn) v
Eigendecomposition of the covariance matrix
Let v be a unit-norm vector aligned with a direction of interest Σ ( x1, . . . , xn) = UΛUT =
- u1
- u2
· · ·
- un
-
λ1 · · · λ2 · · · · · · · · · λn
- u1
- u2
· · ·
- un
T
Eigendecomposition of the covariance matrix
For any symmetric matrix A ∈ Rn with normalized eigenvectors
- u1,
u2, . . . , un and corresponding eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn λ1 = max
|| v||2=1
v TA v
- u1 = arg max
|| v||2=1
v TA v λk = max
|| v||2=1, u⊥ u1,..., uk−1
- v TA
v
- uk = arg
max
|| v||2=1, u⊥ u1,..., uk−1
- v TA
v
Principal component analysis
Compute eigenvectors of empirical covariance matrix to determine directions of maximum variation Application: dimensionality reduction Example: Seeds from 3 varieties of wheat (Kama, Rosa and Canadian) 7 features: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove Aim: Visualize in two dimensions
PCA dimensionality reduction
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto first PC
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto second PC
PCA dimensionality reduction
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto (d-1)th PC
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Projection onto dth PC
Descriptive statistics Statistical estimation
Statistical estimation
Data: Realization of iid sequence Aim: Estimate parameter associated to underlying distribution Frequentist viewpoint: Parameter is deterministic
Estimator
Deterministic function of the data x1, x2, . . . , xn yn := h (x1, x2, . . . , xn)
Estimator
Under iid assumption
- Y (n) := h
- X (1) ,
X (2) , . . . , X (n)
- ◮ Does
Y converge to γ as n → ∞?
◮ For finite n what is the probability that γ is approximated by the
estimator up to a certain accuracy?
Sampling from a population
Population of m individuals We are interested in a feature associated to each person (cholesterol level, salary, who they are voting for. . . ) The feature has k possible values {z1, z2, . . . , zk} mj = number of people for whom feature equals zj
Sampling from a population
Data: Values of the feature for a subset of individuals X If individuals are chosen uniformly at random with replacement p
X(i) (zj) = P (The feature for the ith chosen person equals zj)
= mj m , 1 ≤ j ≤ k The sequence is iid
Mean square error
The mean square error (MSE) of an estimator Y that approximates a parameter γ is MSE (Y ) := E
- (Y − γ)2
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
= E
- (Y − E (Y ) + E (Y ) − γ)2
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
= E
- (Y − E (Y ) + E (Y ) − γ)2
= E ((Y − E (Y )))2 + (E (Y ) − γ)2 + 2 (E (Y ) − γ) E (Y − E (Y ))
- E(Y )−E(Y )
Bias-variance decomposition
MSE (Y ) = E
- (Y − γ)2
= E
- (Y − E (Y ) + E (Y ) − γ)2
= E ((Y − E (Y )))2 + (E (Y ) − γ)2 + 2 (E (Y ) − γ) E (Y − E (Y ))
- E(Y )−E(Y )
= E
- (Y − E (Y ))2
- variance
+ (E (Y ) − γ)2
- bias
Unbiased estimator
An estimator Y that approximates γ is unbiased if and only if E (Y ) = γ
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
- = 1
n
n
- i=1
E
- X (i)
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
- = 1
n
n
- i=1
E
- X (i)
- = µ
Empirical mean is unbiased
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is unbiased E
- Y (n)
- = E
- 1
n
n
- i=1
- X (i)
- = 1
n
n
- i=1
E
- X (i)
- = µ
The empirical variance is also unbiased
Consistency
An estimator Y (n) := h
- X (1) ,
X (2) , . . . , X (n)
- that approximates γ
is consistent if it converges to γ as n → ∞ in mean square, with probability one or in probability
Consistency
The empirical mean of an iid sequence X with mean µ
- Y (n) := 1
n
n
- i=1
- X (i)
is consistent by the law of large numbers if the variance is bounded
Estimating the average height
Population of 25 000 people Goal: Estimate average height from iid samples X The average of the population is the mean of the iid sequence E
- X (i)
- :=
m
- j=1
P (Person j is chosen) · height of person j = 1 m
m
- j=1
hj = av (h1, . . . , hm)
Estimating the average height
60 62 64 66 68 70 72 74 76
Height (inches)
0.05 0.10 0.15 0.20 0.25
Estimating the average height
100 101 102 103 n 64 65 66 67 68 69 70 71 72 Height (inches) True mean Empirical mean
Empirical median is consistent
The empirical median of an iid sequence X is consistent even if the mean is not well defined or the variance is unbounded
Proof
Aim: Show that for any ǫ > 0 lim
n→∞ P
- Y (n) − γ
- ≥ ǫ
- = 0
We will prove that lim
n→∞ P
- Y (n) ≥ γ + ǫ
- = 0
The same argument allows to prove lim
n→∞ P
- Y (n) ≤ γ − ǫ
- = 0
Proof
Assuming n is odd, Y (n) equals the (n + 1) /2th element The event Y (n) ≥ γ + ǫ implies that at least (n + 1) /2 of the elements are larger than γ + ǫ For each individual X (i), the probability that X (i) > γ + ǫ is p := 1 − F
X(i) (γ + ǫ) = 1/2 − ǫ′
Distribution of number of X (i) above 1/2 − ǫ′?
Proof
Assuming n is odd, Y (n) equals the (n + 1) /2th element The event Y (n) ≥ γ + ǫ implies that at least (n + 1) /2 of the elements are larger than γ + ǫ For each individual X (i), the probability that X (i) > γ + ǫ is p := 1 − F
X(i) (γ + ǫ) = 1/2 − ǫ′
Distribution of number of X (i) above 1/2 − ǫ′? Binomial Bn with parameters n and p
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
- = P
- Bn ≥ n + 1
2
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
- = P
- Bn ≥ n + 1
2
- = P
- Bn − np ≥ n + 1
2 − np
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
- = P
- Bn ≥ n + 1
2
- = P
- Bn − np ≥ n + 1
2 − np
- ≤ P
- |Bn − np| ≥ nǫ′ + 1
2
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
- = P
- Bn ≥ n + 1
2
- = P
- Bn − np ≥ n + 1
2 − np
- ≤ P
- |Bn − np| ≥ nǫ′ + 1
2
- ≤ Var (Bn)
- nǫ′ + 1
2
2 by Chebyshev’s inequality
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
- = P
- Bn ≥ n + 1
2
- = P
- Bn − np ≥ n + 1
2 − np
- ≤ P
- |Bn − np| ≥ nǫ′ + 1
2
- ≤ Var (Bn)
- nǫ′ + 1
2
2 by Chebyshev’s inequality = np (1 − p) n2 ǫ′ + 1
2n
2
Proof
P
- Y (n) ≥ γ + ǫ
- ≤ P
n + 1 2
- r more samples ≥ γ + ǫ
- = P
- Bn ≥ n + 1
2
- = P
- Bn − np ≥ n + 1
2 − np
- ≤ P
- |Bn − np| ≥ nǫ′ + 1
2
- ≤ Var (Bn)
- nǫ′ + 1
2
2 by Chebyshev’s inequality = np (1 − p) n2 ǫ′ + 1
2n
2 = p (1 − p) n
- ǫ′ + 1
2n
2
Cauchy iid sequence: empirical mean
10 20 30 40 50 i 5 5 10 15 20 25 30
Moving average Median of iid seq.
Cauchy iid sequence: empirical mean
100 200 300 400 500 i 10 5 5 10
Moving average Median of iid seq.
Cauchy iid sequence: empirical mean
1000 2000 3000 4000 5000 i 60 50 40 30 20 10 10 20 30
Moving average Median of iid seq.
Cauchy iid sequence: empirical median
10 20 30 40 50 i 3 2 1 1 2 3
Moving median Median of iid seq.
Cauchy iid sequence: empirical median
100 200 300 400 500 i 3 2 1 1 2 3
Moving median Median of iid seq.
Cauchy iid sequence: empirical median
1000 2000 3000 4000 5000 i 3 2 1 1 2 3
Moving median Median of iid seq.
Consistency
Empirical variance is consistent if fourth moment is bounded Covariance matrix converges under similar conditions
PCA: n = 5
True covariance Empirical covariance
PCA: n = 5
True covariance Empirical covariance
PCA: n = 20
PCA: n = 100
Confidence intervals
Aim: quantify accuracy of estimator for a fixed number of data A 1 − α confidence interval I for a parameter γ satisfies P (γ ∈ I) ≥ 1 − α where 0 < α < 1
Confidence interval for the mean of an iid sequence
Let X be an iid sequence with mean µ and variance σ2 ≤ b2 for b > 0 For any 0 < α < 1 In :=
- Yn −
b √α n, Yn + b √α n
- Yn := av
- X (1) ,
X (2) , . . . , X (n)
- ,
is a 1 − α confidence interval for µ
Proof
P
- µ ∈
- Yn −
σ √α n, Yn + σ √α n
- = 1 − P
- |Yn − µ| >
σ √α n
Proof
P
- µ ∈
- Yn −
σ √α n, Yn + σ √α n
- = 1 − P
- |Yn − µ| >
σ √α n
- ≥ 1 − α nVar (Yn)
b2
Proof
P
- µ ∈
- Yn −
σ √α n, Yn + σ √α n
- = 1 − P
- |Yn − µ| >
σ √α n
- ≥ 1 − α nVar (Yn)
b2 = 1 − α σ2 b2
Proof
P
- µ ∈
- Yn −
σ √α n, Yn + σ √α n
- = 1 − P
- |Yn − µ| >
σ √α n
- ≥ 1 − α nVar (Yn)
b2 = 1 − α σ2 b2 ≥ 1 − α
Bears in Yosemite
Aim: average weight of bears in Yosemite Scientist captures 300 bears, average weight Y := 200 lbs We need bound on the variance Maximum weight = 880 lbs For a randomly selected bear X σ2 = E
- X 2
− E2 (X) ≤ E
- X 2
≤ 8802 because X ≤ 880 := b
Bears in Yosemite
- Y −
b √α n, Y + b √α n
- = [−27.2, 427.2]
is a 95% confidence interval for the average weight of the whole population
Central limit theorem with empirical standard deviation
Let X be an iid discrete sequence with mean µ such that its variance and fourth moment E
- X (i)4
are bounded. The sequence √n
- av
- X (1) , . . . ,
X (n)
- − µ
- std
- X (1) , . . . ,
X (n)
- converges in distribution to a standard Gaussian random variable
Q function
For x > 0 Q (x) := ∞
u=x
1 √ 2π exp
- −u2
2
- du
If U is a standard Gaussian random variable and y < 0 P (U < y) = Q (−y)
Approximate confidence interval for the mean
Let X be an iid discrete sequence with mean µ such that its variance and fourth moment E
- X (i)4
are bounded. For any 0 < α < 1 In :=
- Yn − Sn
√nQ−1 α 2
- , Yn + Sn
√nQ−1 α 2
- Yn := av
- X (1) ,
X (2) , . . . , X (n)
- Sn := std
- X (1) ,
X (2) , . . . , X (n)
- is an approximate 1 − α confidence interval for µ, i.e.
P (µ ∈ In) ≈ 1 − α
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
- = 1 − P
√n (Yn − µ) Sn > Q−1 α 2
- − P
√n (Yn − µ) Sn < −Q−1 α 2
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
- = 1 − P
√n (Yn − µ) Sn > Q−1 α 2
- − P
√n (Yn − µ) Sn < −Q−1 α 2
- ≈ 1 − 2Q
- Q−1 α
2
Approximate confidence interval for the mean
P (µ ∈ In) = 1 − P
- Yn > µ + Sn
√nQ−1 α 2
- − P
- Yn < µ − Sn
√nQ−1 α 2
- = 1 − P
√n (Yn − µ) Sn > Q−1 α 2
- − P
√n (Yn − µ) Sn < −Q−1 α 2
- ≈ 1 − 2Q
- Q−1 α
2
- = 1 − α
Bears in Yosemite
Empirical standard deviation is 100 lbs Given that Q (1.95) ≈ 0.025,
- Y − σ
√nQ−1 α 2
- , Y + σ
√nQ−1 α 2
- ≈ [188.8, 211.3]
is an approximate 95% confidence interval
Interpreting confidence intervals
The average weight is between 188.8 and 211.3 lbs with probability 0.95
Interpreting confidence intervals
If we repeat the process of sampling the population and computing the confidence interval, then the true parameter will lie in the interval 95% of the time
Estimating the average height
We compute 40 confidence intervals of the form In :=
- Yn − Sn
√nQ−1 α 2
- , Yn + Sn
√nQ−1 α 2
- Yn := av
- X (1) ,
X (2) , . . . , X (n)
- Sn := std
- X (1) ,
X (2) , . . . , X (n)
- for 1 − α = 0.95 and different values of n