Robust PCA for High-Dimensional Data Huan Xu, Constantine Caramanis - - PowerPoint PPT Presentation

robust pca for high dimensional data
SMART_READER_LITE
LIVE PREVIEW

Robust PCA for High-Dimensional Data Huan Xu, Constantine Caramanis - - PowerPoint PPT Presentation

Robust PCA for High-Dimensional Data Huan Xu, Constantine Caramanis and Shie Mannor Talk by Shie Mannor, The Technion Department of Electrical Engineering June 2010 Thank you for staying for the graveyard session PCA - in Words Observe


slide-1
SLIDE 1

Robust PCA for High-Dimensional Data

Huan Xu, Constantine Caramanis and Shie Mannor

Talk by Shie Mannor, The Technion Department of Electrical Engineering

June 2010

Thank you for staying for the graveyard session

slide-2
SLIDE 2

PCA - in Words

  • Observe high-dimensional points
  • Find least-square-error subspace approximation
  • Many applications in feature-extraction and compression
  • data analysis
  • communication theory
  • pattern recognition
  • image processing
slide-3
SLIDE 3

PCA - in Pictures

Observe points: y = Ax + v.

S i g n a l Noise

slide-4
SLIDE 4

PCA - in Pictures

Observe points: y = Ax + v.

S i g n a l Noise

slide-5
SLIDE 5

PCA - in Pictures

Observe points: y = Ax + v.

slide-6
SLIDE 6

PCA - in Pictures

Observe points: y = Ax + v.

slide-7
SLIDE 7

PCA - in Pictures

Observe points: y = Ax + v. Goal: Find least-square-error subspace approximation.

slide-8
SLIDE 8

PCA - in Math

  • Least-square-error subspace approximation
  • How: Singular value decomposition (SVD) performs

eigenvector decomposition of the sample-covariance matrix

slide-9
SLIDE 9

PCA - in Math

  • Least-square-error subspace approximation
  • How: Singular value decomposition (SVD) performs

eigenvector decomposition of the sample-covariance matrix

  • Magic of SVD: solving a non-convex problem
  • Cannot replace quadratic objective here.
slide-10
SLIDE 10

PCA - in Math

  • Least-square-error subspace approximation
  • How: Singular value decomposition (SVD) performs

eigenvector decomposition of the sample-covariance matrix

  • Magic of SVD: solving a non-convex problem
  • Cannot replace quadratic objective here.
  • Consequence: Sensitive to outliers
  • Even one outlier can make the output arbitrarily skewed;
  • What about a constant fraction of “outliers”?
slide-11
SLIDE 11

This Talk: High Dimensions and Corruption

Two key differences to pictures shown (A) High-dimensional regime: # observations ≤ dimensionality. (B) A constant fraction of points arbitrarily corrupted.

slide-12
SLIDE 12

Outline

  • 1. Motivation: PCA, High dimensions, corruption
  • 2. Where things get tricky: usual tools fail
  • 3. HR-PCA: the algorithm
  • 4. The Proof Ideas (and some details)
  • 5. Conclusion
slide-13
SLIDE 13

High-Dimensional Data

  • What is high-dimensional data:

#dimensionality ≈# observations.

  • Why high-dimensional data analysis:
  • Many practical examples: DNA microarray,

financial data, semantic indexing, images, etc Figure: MicroArray: 24, 401 dim.

slide-14
SLIDE 14

High-Dimensional Data

  • What is high-dimensional data:

#dimensionality ≈# observations.

  • Why high-dimensional data analysis:
  • Many practical examples: DNA microarray,

financial data, semantic indexing, images, etc

  • Networks: user-behavior-aware network

algorithms (Cognitive Networks)? Figure: MicroArray: 24, 401 dim.

slide-15
SLIDE 15

High-Dimensional Data

  • What is high-dimensional data:

#dimensionality ≈# observations.

  • Why high-dimensional data analysis:
  • Many practical examples: DNA microarray,

financial data, semantic indexing, images, etc

  • Networks: user-behavior-aware network

algorithms (Cognitive Networks)?

  • The kernel trick generates high-dimensional

data Figure: MicroArray: 24, 401 dim.

slide-16
SLIDE 16

High-Dimensional Data

  • What is high-dimensional data:

#dimensionality ≈# observations.

  • Why high-dimensional data analysis:
  • Many practical examples: DNA microarray,

financial data, semantic indexing, images, etc

  • Networks: user-behavior-aware network

algorithms (Cognitive Networks)?

  • The kernel trick generates high-dimensional

data

  • Traditional statistical tools do not work

Figure: MicroArray: 24, 401 dim.

slide-17
SLIDE 17

Corrupted Data

Figure: No Outliers Figure: With Outliers

slide-18
SLIDE 18

Corrupted Data

Figure: No Outliers Figure: With Outliers

  • Some observations about the corrupted points:
  • They have a large magnitude.
  • They have a large (Mahalanobis) distance.
  • They increase the volume of the smallest containing

ellipsoid.

slide-19
SLIDE 19

Corrupted Data

Figure: No Outliers Figure: With Outliers

  • Some observations about the corrupted points:
  • They have a large magnitude.
  • They have a large (Mahalanobis) distance.
  • They increase the volume of the smallest containing

ellipsoid.

slide-20
SLIDE 20

Our Goal: Robust PCA

  • Want robustness to arbitrarily corrupted data.
  • One measure: Breakdown point
  • Instead: bounded error measure between true PCs and
  • utput PCs.
  • Bound will depend on:
  • Fraction of outliers.
  • Tails of true distribution.
slide-21
SLIDE 21

Problem Setup

  • “Authentic Samples” z1, · · · , zt ∈ Rm: zi = Axi + ni,
slide-22
SLIDE 22

Problem Setup

  • “Authentic Samples” z1, · · · , zt ∈ Rm: zi = Axi + ni,
  • xi ∈ Rd. xi ∼ µ,
  • ni ∈ Rm. ni ∼ N(0, Im),
  • A ∈ Rd×m and µ unknown. µ mean zero, covariance I.
slide-23
SLIDE 23

Problem Setup

  • “Authentic Samples” z1, · · · , zt ∈ Rm: zi = Axi + ni,
  • xi ∈ Rd. xi ∼ µ,
  • ni ∈ Rm. ni ∼ N(0, Im),
  • A ∈ Rd×m and µ unknown. µ mean zero, covariance I.
  • The “Outliers” o1, · · · , on−t ∈ Rm: generated arbitrarily.
slide-24
SLIDE 24

Problem Setup

  • “Authentic Samples” z1, · · · , zt ∈ Rm: zi = Axi + ni,
  • xi ∈ Rd. xi ∼ µ,
  • ni ∈ Rm. ni ∼ N(0, Im),
  • A ∈ Rd×m and µ unknown. µ mean zero, covariance I.
  • The “Outliers” o1, · · · , on−t ∈ Rm: generated arbitrarily.
  • Observe: Y {y1 · · · , yn} = {z1, · · · , zt} {o1, · · · , on−t}.
slide-25
SLIDE 25

Problem Setup

  • “Authentic Samples” z1, · · · , zt ∈ Rm: zi = Axi + ni,
  • xi ∈ Rd. xi ∼ µ,
  • ni ∈ Rm. ni ∼ N(0, Im),
  • A ∈ Rd×m and µ unknown. µ mean zero, covariance I.
  • The “Outliers” o1, · · · , on−t ∈ Rm: generated arbitrarily.
  • Observe: Y {y1 · · · , yn} = {z1, · · · , zt} {o1, · · · , on−t}.
  • Regime of interest:
  • n ≈ m >> d
  • σ = ||A⊤A|| >> 1 (scales slowly).
  • Objective: Retrieve A
slide-26
SLIDE 26

Outline

  • 1. Motivation
  • 2. Where things get tricky
  • 3. HR-PCA: the algorithm
  • 4. The Proof Ideas (and some details)
  • 5. Conclusion
slide-27
SLIDE 27

Features of the High Dimensional regime

  • Noise Explosion in High Dimensions: noise magnitude

scales faster than the signal noise;

  • SNR goes to zero
  • If n ∼ N(0, Im), then E||n||2 = √m, with very sharp

concentration.

  • Meanwhile: E||Ax||2 ≤ σ

√ d.

  • Consequences:
  • Magnitude of true samples may be much bigger than outlier

magnitude.

  • The direction of each sample will be approximately
  • rthogonal to the direction of the signal;
slide-28
SLIDE 28

Features of the High Dimensional regime: Pictures

S i g n a l Noise

Figure: Recall low-dimensional regime

slide-29
SLIDE 29

Features of the High Dimensional regime: Pictures

Signal N

  • i

s e

Figure: High dimensions are different: Noise >> Signal

slide-30
SLIDE 30

Features of the High Dimensional regime: Pictures

Signal Noise Signal N

  • i

s e

Figure: High dimensions are different: Noise >> Signal

slide-31
SLIDE 31

Features of the High Dimensional regime: Pictures

Figure: Every point equidistant from origin and from other points!

slide-32
SLIDE 32

Features of the High Dimensional regime: Pictures

Figure: And every point perpendicular to signal space

slide-33
SLIDE 33

Trouble in High Dimensions

  • Some approaches that will not work:
  • Leave-one-out (more generally, subsample, compare):
  • Either sample size very small: problem
  • r
  • Have many corrupted points in each subsample: problem
slide-34
SLIDE 34

Trouble in High Dimensions

  • Some approaches that will not work:
  • Leave-one-out (more generally, subsample, compare):
  • Either sample size very small: problem
  • r
  • Have many corrupted points in each subsample: problem
  • Standard Robust PCA: PCA on a robust estimation of the

covariance

  • Consistency requires #(observations) ≫ #(dimension)
  • Not enough observations in high-dimensional case
slide-35
SLIDE 35

Trouble in High Dimensions

  • Some more approaches that will not work:
  • Removing points with large magnitude
slide-36
SLIDE 36

Trouble in High Dimensions

  • Some more approaches that will not work:
  • Removing points with large magnitude
slide-37
SLIDE 37

Trouble in High Dimensions

  • Some more approaches that will not work:
  • Removing points with large magnitude
slide-38
SLIDE 38

Trouble in High Dimensions

  • Some more approaches that will not work:
  • Removing points with large magnitude
  • Remove points with large Mahalanobis distance
  • Same example: All λn corrupted points: aligned, length

O(σ) << √m.

  • Very large impact on PCA output.
  • But: Mahalanobis distance of outliers very small.
slide-39
SLIDE 39

Trouble in High Dimensions

  • Some more approaches that will not work:
  • Removing points with large magnitude
  • Remove points with large Mahalanobis distance
  • Same example: All λn corrupted points: aligned, length

O(σ) << √m.

  • Very large impact on PCA output.
  • But: Mahalanobis distance of outliers very small.
  • Remove points with large Stahel-Donoho distance

ui sup

w=1

|w⊤yi − medj(w⊤yj)| medk|w⊤yk − medj(w⊤yj)|.

  • Same example: impact large, but Stahel-Donoho
  • utlyingness small.
slide-40
SLIDE 40

Trouble in High Dimensions

  • For these reasons: Some robust covariance estimators

have breakdown point = O(1/m), m = dimensions.

  • M-estimator,
  • Convex peeling, Ellipsoidal Peeling,
  • Classical outlier rejection
  • Iterative deletion, iterative trimming,
  • and others...
  • These approaches cannot work in high-dimensional

regime.

slide-41
SLIDE 41

Trouble in High Dimensions

  • Algorithmic Tractability
slide-42
SLIDE 42

Trouble in High Dimensions

  • Algorithmic Tractability
  • Minimum volume ellipsoid; Minimum covariance

determinant:

slide-43
SLIDE 43

Trouble in High Dimensions

  • Algorithmic Tractability
  • Minimum volume ellipsoid; Minimum covariance

determinant:

  • Ill-posed: many zero-volume ellipsoids containing data
  • Intractable: removing a fraction of points combinatorial.
slide-44
SLIDE 44

Trouble in High Dimensions

  • Algorithmic Tractability
  • Minimum volume ellipsoid; Minimum covariance

determinant:

  • Ill-posed: many zero-volume ellipsoids containing data
  • Intractable: removing a fraction of points combinatorial.
  • Projection pursuit – maximize univariate estimator
  • Problems are non-convex: Intractable.
slide-45
SLIDE 45

Trouble in High Dimensions

  • Algorithmic Tractability
  • Minimum volume ellipsoid; Minimum covariance

determinant:

  • Ill-posed: many zero-volume ellipsoids containing data
  • Intractable: removing a fraction of points combinatorial.
  • Projection pursuit – maximize univariate estimator
  • Problems are non-convex: Intractable.
  • Choosing subset of directions generated by points:

authentic points ⊥ to signal space, hence no good in high dimensions.

slide-46
SLIDE 46

Outline

  • 1. Motivation
  • 2. Where things get tricky
  • 3. HR-PCA: the algorithm
  • 4. The Proof Ideas (and some details)
  • 5. Conclusion
slide-47
SLIDE 47

High-dimensional Robust PCA: Main Idea

  • Get candidate directions from standard PCA (get w).
  • Project, and use a robust variance estimator: variance of

points nearer origin.

  • Outliers can be near origin. But: impact controlled.
  • Random removal of “strange" points.
slide-48
SLIDE 48

High-dimensional Robust PCA: Main Idea

  • Get candidate directions from standard PCA (get w).
  • Project, and use a robust variance estimator: variance of

points nearer origin.

  • Outliers can be near origin. But: impact controlled.
  • Random removal of “strange" points.
  • Desired properties of an algorithm:
  • Tractable (same complexity as standard PCA);
  • Robust to outliers: performance guarantees;
  • Asymptotically optimal: t = o(n) perfect recovery.
  • Easily kernelizable;
slide-49
SLIDE 49

Problem Setup

  • “Authentic Samples” z1, · · · , zt ∈ Rm: zi = Axi + ni,
  • xi ∈ Rd. xi ∼ µ,
  • ni ∈ Rm. ni ∼ N(0, Im),
  • A ∈ Rd×m and µ unknown. µ mean zero, covariance I.
  • The “Outliers” o1, · · · , on−t ∈ Rm: generated arbitrarily.
  • Observe: Y {y1 · · · , yn} = {z1, · · · , zt} {o1, · · · , on−t}.
  • Assumptions:
  • n, m scale to infinity together;
  • σ = ||A⊤A|| “big” (scales to infinity slowly);
  • µ: spherically symmetric; abs continuous; exponential tails.
slide-50
SLIDE 50

Objective & Performance Measurement

  • For output PCs w1, · · · , wd, “Expressed Variance” w.r.t.

wtrue

1 , · · · , wtrue d

EV(w1, · · · , wd) d

i=1 w⊤ i AA⊤wi

d

i=1(wtrue i

)⊤AA⊤wtrue

i

≤ 1.

  • EV = 1 if the subspace spanned by true PCs is recovered.
  • For d = 1, EV(w1) = cos2(∠w1, wtrue

1 ).

slide-51
SLIDE 51

A Robust Variance Estimator

  • Robust Variance Estimator: Vˆ

t(w) 1 n

ˆ

t i=1 |w⊤y|2 (i).

  • Order statistics: α1, . . . , αn ∈ R, then

α(1) ≤ α(2) ≤ · · · ≤ α(n).

  • Idea: If outliers small, their impact is controlled.
slide-52
SLIDE 52

The HR-PCA Algorithm

(1) Perform PCA on empirical covariance. (2) If robust variance estimate in PC directions highest yet, record it, and PCs. (3) Randomly remove a point in proportion to its variance along PCs. (4) Repeat until “enough" points removed. (5) Output the last PCs recorded.

slide-53
SLIDE 53

The HR-PCA Algorithm

(1) Perform PCA on empirical covariance: {w1, . . . , wd}. (2) Compute b = RVE({w1, . . . , wd}). If b > b∗,

  • Update b∗ = b
  • Update {w∗

1, . . . , w∗ d} = {w1, . . . , wd}.

(3) Randomly remove a point in proportion to its variance along PCs. (4) Repeat until all points removed. (5) Output the last PCs recorded: {w∗

1, . . . , w∗ d}.

slide-54
SLIDE 54

The HR-PCA Algorithm: Pitfalls

  • Things that can go wrong:
slide-55
SLIDE 55

The HR-PCA Algorithm: Pitfalls

  • Things that can go wrong:

∗ Remove authentic points ∗ May not ultimately report “best outcome.” ∗ Corrupted points may contribute to ultimately reported PCs.

slide-56
SLIDE 56

The HR-PCA Algorithm: Pitfalls

  • Things that can go wrong:

∗ Remove authentic points ∗ May not ultimately report “best outcome.” ∗ Corrupted points may contribute to ultimately reported PCs.

  • But: we show the error due to all such factors is controlled.
slide-57
SLIDE 57

The Guarantees: Finite Sample + Asymptotic

  • Results will depend on:
  • Fraction of outliers: λ.
  • Tails of µ.
  • Define: V : [0, 1] → [0, 1]

V(α) = cα

−cα

x2µ(dx).

slide-58
SLIDE 58

The Guarantees: Finite Sample + Asymptotic

Theorem: The following holds in probability (n, m, σ scale): E.V.(output) ≥ max

κ

  V

  • 1 − λ∗(1+κ)

(1−λ∗)κ

  • (1 + κ)

  ×   V ˆ

t t − λ∗ 1−λ∗

  • V

ˆ

t t

 .

slide-59
SLIDE 59

The Guarantees: Finite Sample + Asymptotic

Theorem: The following holds in probability (n, m, σ scale): E.V.(output) ≥ max

κ

  V

  • 1 − λ∗(1+κ)

(1−λ∗)κ

  • (1 + κ)

  ×   V ˆ

t t − λ∗ 1−λ∗

  • V

ˆ

t t

 .

  • The Bound:
  • Term 1: May not remove all outliers, and some authentic

points may be removed.

  • Term 2: May have small outliers that alter PC directions.
  • If t = o(n), RHS = 1: optimal recovery.
  • Breakdown point: 1/2.
slide-60
SLIDE 60

Asymptotic Performance Guarantee

E.V. is lower bounded by If the proportion of outliers goes to zero: the Expressed Variance equals 1.

slide-61
SLIDE 61

Proof Idea

(1) “Blessing of dimensionality”: empirical covariance estimates good, even for high-dimensional regime; (2) Random removal: have a “good” solution, or outlier is removed with large probability; (3) Therefore: at some early iteration, algorithm finds a “good” solution. (4) Output of algorithm has higher robust variance estimate than the “good” solution. We show output must then also be (almost as) “good.”

slide-62
SLIDE 62

Proof Idea - Step 1

With high probability: (1.a) Largest eigenvalue of the empirical noise covariance matrix is bounded: sup

w∈Sm

1 n

t

  • i=1

(w⊤ni)2 ≤ c. (1.b) Largest eigenvalue of the signals in original space converges to 1: sup

w∈Sd

|1 t

t

  • i=1

(w⊤xi)2 − 1| ≤ ǫ.

slide-63
SLIDE 63

Proof Idea - Step 1

(1.c) RVE is a valid variance estimator for the d−dimensional signals x: sup

w∈Sd

  • 1

t

ˆ t

  • i=1

|w⊤x|2

(i) − V

ˆ t t

  • ≤ ǫ.

(1.d) RVE is a valid estimator of the variance of the authentic samples, z = Ax + n: uniformly over all w ∈ Sm, (1 − ǫ)w⊤A2V t′ t

  • − cw⊤A ≤ 1

t

t′

  • i=1

|w⊤z|2

(i) ≤

(1 + ǫ)w⊤A2V t′ t

  • + cw⊤A.
slide-64
SLIDE 64

Proof - Step 1.a - details

(1.a) Largest eigenvalue of the variance of noise matrix is bounded: sup

w∈Sm

1 n

t

  • i=1

(w⊤ni)2 ≤ c.

  • Two keys: “blessing of dimensionality” and uniform laws of

large numbers.

slide-65
SLIDE 65

Proof - Step 1.a - details

(1.a) Largest eigenvalue of the variance of noise matrix is bounded: sup

w∈Sm

1 n

t

  • i=1

(w⊤ni)2 ≤ c.

  • Two keys: “blessing of dimensionality” and uniform laws of

large numbers.

  • Step 1 (a): Need basic Lemma:
  • Lemma: For Γ a m × t matrix (m ≤ t), Γij ∼ N(0, 1), i.i.d.:

Pr

  • σmax(Γ) >

√ m + √ t + √ tǫ

  • ≤ exp(−tǫ2/2).
slide-66
SLIDE 66

Proof - Step 1.a - details

(1.a) Largest eigenvalue of the variance of noise matrix is bounded: sup

w∈Sm

1 n

t

  • i=1

(w⊤ni)2 ≤ c.

  • Two keys: “blessing of dimensionality” and uniform laws of

large numbers.

  • Step 1 (a): Need basic Lemma:
  • Lemma: For Γ a m × t matrix (m ≤ t), Γij ∼ N(0, 1), i.i.d.:

Pr

  • σmax(Γ) >

√ m + √ t + √ tǫ

  • ≤ exp(−tǫ2/2).
  • Observation:

sup

w∈Sm

1 t

t

  • i=1

(w⊤ni)2 = λmax(ΓΓ⊤)/t = σ2

max(Γ)/t.

slide-67
SLIDE 67

Proof - Step 1.a - An Aside

  • Where do these results come from:
  • Basic idea: dimension-free concentration of measure
  • Theorem: Let F be L-Lipschitz w.r.t. Euclidean norm,

X ∼ N(0, I) standard Gaussian measure. MF the mean of F(X). Then P(F(X) ≥ MF + ξ) ≤ e−ξ2/2L2.

  • Basic observation: σmax(·) : Rn1×n2 −

→ R is 1-Lipschitz.

  • Two nice references: (a) Davidson and Szarek: Operators,

Random Matrices & Banach Spaces; (b) Matousek: Lectures on Discrete Geometry.

slide-68
SLIDE 68

Proof Idea

(1) “Blessing of dimensionality”: empirical covariance estimates good, even for high-dimensional regime; (2) Random removal: have a “good” solution, or outlier is removed with large probability; (3) Therefore: at some early iteration, algorithm finds a “good” solution. (4) Output of algorithm has higher robust variance estimate than the “good” solution. We show output must then also be (almost as) “good.”

slide-69
SLIDE 69

Proof Idea - Step 2

  • Let Z(s), O(s) be remaining authentic/outlier points.
  • Fix κ > 0 and call step s a “Good Event”, G(s) if:
slide-70
SLIDE 70

Proof Idea - Step 2

  • Let Z(s), O(s) be remaining authentic/outlier points.
  • Fix κ > 0 and call step s a “Good Event”, G(s) if:

d

  • j=1
  • zi∈Z(s−1)

(wj(s)⊤zi)2 ≥ 1 κ

d

  • j=1
  • i∈O(s−1)

(wj(s)⊤oi)2}.

slide-71
SLIDE 71

Proof Idea - Step 2

  • Let Z(s), O(s) be remaining authentic/outlier points.
  • Fix κ > 0 and call step s a “Good Event”, G(s) if:

d

  • j=1
  • zi∈Z(s−1)

(wj(s)⊤zi)2

  • variance of authentic pts

≥ 1 κ

d

  • j=1
  • i∈O(s−1)

(wj(s)⊤oi)2

  • variance of corrupted pts

.

slide-72
SLIDE 72

Proof Idea - Step 2

  • Let Z(s), O(s) be remaining authentic/outlier points.
  • Fix κ > 0 and call step s a “Good Event”, G(s) if:

d

  • j=1
  • zi∈Z(s−1)

(wj(s)⊤zi)2

  • variance of authentic pts

≥ 1 κ

d

  • j=1
  • i∈O(s−1)

(wj(s)⊤oi)2

  • variance of corrupted pts

.

  • This means: variance on the direction of found PCs is

mostly due to the authentic samples.

  • Hence: {w1, . . . , wd} must be close to true PCs.
slide-73
SLIDE 73

Proof Idea - Step 2

  • Let Z(s), O(s) be remaining authentic/outlier points.
  • Fix κ > 0 and call step s a “Good Event”, G(s) if:

d

  • j=1
  • zi∈Z(s−1)

(wj(s)⊤zi)2

  • variance of authentic pts

≥ 1 κ

d

  • j=1
  • i∈O(s−1)

(wj(s)⊤oi)2

  • variance of corrupted pts

.

  • This means: variance on the direction of found PCs is

mostly due to the authentic samples.

  • Hence: {w1, . . . , wd} must be close to true PCs.
  • Theorem: If Gc(s) — step s is not good — then next point

removed is an outlier with probability at least

κ 1+κ.

slide-74
SLIDE 74

Proof Idea

(1) “Blessing of dimensionality”: empirical covariance estimates good, even for high-dimensional regime; (2) Random removal: have a “good” solution, or outlier is removed with large probability; (3) Therefore: at some early iteration, algorithm finds a “good” solution. (4) Output of algorithm has higher robust variance estimate than the “good” solution. We show output must then also be (almost as) “good.”

slide-75
SLIDE 75

Proof Idea - Step 3

  • Theorem: With high probability, we have a “good event” by

time at most s0 > λn[(1 + κ)/κ].

slide-76
SLIDE 76

Proof Idea - Step 3

  • Theorem: With high probability, we have a “good event” by

time at most s0 > λn[(1 + κ)/κ].

  • Intuition: Suppose subsequent steps were independent.
  • Since, “expected number of corrupted points removed each

step” is κ/(1 + κ).

  • After M steps, expected corrupted points removed is M

κ 1+κ.

  • Therefore: All the outliers removed after M = λn 1+κ

κ (1 + ε)

steps, with exponentially high probability.

slide-77
SLIDE 77

Proof Idea - Step 3

  • Theorem: With high probability, we have a “good event” by

time at most s0 > λn[(1 + κ)/κ].

  • Intuition: Suppose subsequent steps were independent.
  • Since, “expected number of corrupted points removed each

step” is κ/(1 + κ).

  • After M steps, expected corrupted points removed is M

κ 1+κ.

  • Therefore: All the outliers removed after M = λn 1+κ

κ (1 + ε)

steps, with exponentially high probability.

  • The Problem: not i.i.d.
  • The Fix: use martingales and Azuma-Hoeffding.
slide-78
SLIDE 78

Proof Idea - Step 3 - details

  • Let T = min{s|G(s) is true}.
slide-79
SLIDE 79

Proof Idea - Step 3 - details

  • Let T = min{s|G(s) is true}.
  • Define the random variable (w.r.t. natural filtration Fs):

Xs = |O(T − 1)| +

κ 1+κ · (T − 1),

if T ≤ s; |O(s)| +

κ 1+κ · s,

if T > s. Note: X0 = λn.

slide-80
SLIDE 80

Proof Idea - Step 3 - details

  • Let T = min{s|G(s) is true}.
  • Define the random variable (w.r.t. natural filtration Fs):

Xs = |O(T − 1)| +

κ 1+κ · (T − 1),

if T ≤ s; |O(s)| +

κ 1+κ · s,

if T > s. Note: X0 = λn.

  • Lemma: {Xs, Fs} is a supermartingale.
slide-81
SLIDE 81

Proof Idea - Step 3 - details

  • Let T = min{s|G(s) is true}.
  • Define the random variable (w.r.t. natural filtration Fs):

Xs = |O(T − 1)| +

κ 1+κ · (T − 1),

if T ≤ s; |O(s)| +

κ 1+κ · s,

if T > s. Note: X0 = λn.

  • Lemma: {Xs, Fs} is a supermartingale.
  • Now we have: for s0 = λn[(1 + κ)/κ](1 + ε)

P (T > s0) ≤ P

  • Xs0 ≥

κs0 1 + κ

  • = P (Xs0 ≥ (1 + ǫ)λn)
slide-82
SLIDE 82

Proof Idea - Step 3 - details

  • Let T = min{s|G(s) is true}.
  • Define the random variable (w.r.t. natural filtration Fs):

Xs = |O(T − 1)| +

κ 1+κ · (T − 1),

if T ≤ s; |O(s)| +

κ 1+κ · s,

if T > s. Note: X0 = λn.

  • Lemma: {Xs, Fs} is a supermartingale.
  • Now we have: for s0 = λn[(1 + κ)/κ](1 + ε)

P (T > s0) ≤ P

  • Xs0 ≥

κs0 1 + κ

  • = P (Xs0 ≥ (1 + ǫ)λn)
  • Azuma-Hoeffding completes the proof.
slide-83
SLIDE 83

Proof Idea

(1) “Blessing of dimensionality”: empirical covariance estimates good, even for high-dimensional regime; (2) Random removal: have a “good” solution, or outlier is removed with large probability; (3) Therefore: at some early iteration, algorithm finds a “good” solution. (4) Output of algorithm has higher robust variance estimate than the “good” solution. We show output must then also be (almost as) “good.”

slide-84
SLIDE 84

Proof Idea - Step 4

  • Putting it all together:
  • An early iteration produces directions ˆ

w1, . . . , ˆ wd that have “most of” the variance.

  • Bound quality on these directions:

EV(ˆ w1, · · · , ˆ wd) d

i=1 ˆ

w⊤

i AA⊤ ˆ

wi d

i=1(wtrue i

)⊤AA⊤wtrue

i

.

  • The final algorithm only produces directions w∗

1, . . . , w∗ d

with biggest robust variance estimator.

  • Bound quality on these directions:

EV(w∗

1, · · · , w∗ d)

d

i=1(w∗ i )⊤AA⊤w∗ i

d

i=1

d

i=1 ˆ

w⊤

i AA⊤ ˆ

wi .

slide-85
SLIDE 85

Kernelization

  • Using a kernel function k(·, ·) to represent a feature

mapping Υ(·)

  • PCA can be kernelized using Kernel PCA, with output in a

form vq = n−s

i=1 αi(q)Υ(ˆ

yi), q = 1, · · · , d.

  • HR-PCA Algorithm requires:
  • Computing PCA;
  • Computing Robust Variance Estimator;
  • Both steps can be done.
slide-86
SLIDE 86

Conclusion

  • Methodology for handling dimensionality reduction when:
  • 1. #(Observation) ∼ #(Dimension)
  • 2. #(Outliers) is “large"
  • The key idea: verify projections statistics behave in a

certain way, if not - probabilistic point removal

  • Works well in simulations

On the todo list:

  • Generalize to other identification problems with outliers:

when a probabilistic model is available

  • Extend to stochastic programming with corrupted sampled

data

  • Looking for an online algorithm.