PCA for Distributed Data Sets Raymond H. Chan Department of - - PowerPoint PPT Presentation

pca for distributed data sets
SMART_READER_LITE
LIVE PREVIEW

PCA for Distributed Data Sets Raymond H. Chan Department of - - PowerPoint PPT Presentation

PCA for Distributed Data Sets Raymond H. Chan Department of Mathematics The Chinese University of Hong Kong Joint work with Franklin Luk (RPI) and Z.-J. Bai (CUHK) p. 1/34 Grid Powerful processors with relatively slow links.


slide-1
SLIDE 1

PCA for Distributed Data Sets

Raymond H. Chan Department of Mathematics The Chinese University of Hong Kong Joint work with Franklin Luk (RPI) and Z.-J. Bai (CUHK)

– p. 1/34

slide-2
SLIDE 2

Grid

Powerful processors with relatively slow links. Powerful Teraflop Processors

✫✪ ✬✩ ✫✪ ✬✩

Slow Gigabit Networks

✫✪ ✬✩ ❅ ❅ ❅ ❅ ❅ ✫✪ ✬✩

  • ✫✪

✬✩

  • 1

2 3 4

– p. 2/34

slide-3
SLIDE 3

New York State TeraGrid Initiative

State University of NY at Albany Rensselaer Polytechnic Institute Brookhaven National Laboratory State University of NY at Stony Brook in partnership with IBM and NYSERNet

– p. 3/34

slide-4
SLIDE 4

– p. 4/34

slide-5
SLIDE 5

Grid Topology

Four sites, scalable grid architecture 10 to 20 Gb/sec connection 6TF Processors

Computation Communication = 6 × 1012 0.3 × 109 = 20, 000! Communication Bottleneck: computation done locally.

– p. 5/34

slide-6
SLIDE 6

Principal Component Analysis

Let X be an n × p data matrix, where n ≫ p. Data covariace matrix S is given by nS = XT(I − 1

neneT n)X,

where eT

n = (1, 1, . . . , 1).

PCA ⇐

⇒ Karhunen-Loève transform ⇐ ⇒ Hotelling transform

– p. 6/34

slide-7
SLIDE 7

PCA means

get spectral decomposition of n · S:

nS = VΣ2VT.

choose few largest eigenvalues and

eigenvectors V.

form principal component vectors X ·

V.

low-dimension representation of original data.

– p. 7/34

slide-8
SLIDE 8

PCA involves only V and Σ

Since (I − 1

neneT n) is symmetric and idempotent,

nS = XT(I − 1

neneT n)X

= XT(I − 1

neneT n)(I − 1 neneT n)X

Σ and V can be obtained from SVD of:

(I − 1

neneT n)X = UΣVT.

Low-dim’l representation X · V can still be done.

– p. 8/34

slide-9
SLIDE 9

Distributed PCA

Big data matrix X: n ≈ 1012. E.g. visualization, data mining.

Problem:

Data are distributed amongst s processors. Can we find Σ and V without moving X across processors?

– p. 9/34

slide-10
SLIDE 10

Data among s processors

Denote X =      X0 X1 . . . Xs−1               n =

s−1

i=0

ni, where Xi is ni × p, resides on processor i. Typical: ni ≈ 1012 and p ≈ 103.

– p. 10/34

slide-11
SLIDE 11

Aim

Compute PCA of X without moving the data

matrix Xi.

Move O(pα) data across processors instead of

O(ni).

– p. 11/34

slide-12
SLIDE 12

Distributed PCA by Qu et al.

  • 1. At processor i, calculate local PCA using SVD:

(I − 1

nienieT ni)Xi = UiΣiVT i .

Say matrix has numerical rank ki. Send ¯ xi (column sum of Xi), and ki largest principal components Σi and Vi to central processor. Communication costs = O(pki).

– p. 12/34

slide-13
SLIDE 13
  • 2. At central processor: Assemble p × p

covariance matrix and find its PCA n S =

s−1

i=0

  • Vi

Σ2

i

VT

i + s−1

i=0

ni(¯ xi − ¯ x)(¯ xi − ¯ x)T

= VΣ2VT.

Broadcast V, the first k columns of V. Communication costs = O(pk).

– p. 13/34

slide-14
SLIDE 14
  • 3. Calculate principal component vectors at

processor i: Xi V.

– p. 14/34

slide-15
SLIDE 15

Analysis of Qu’s Approach

Advantage: Reduce communication costs:

O(pn) −

→ O

  • p

s−1

i=0

ki

  • .

Disadvantages:

Local SVD’s introduce approximation errors. Central processor becomes bottleneck for

communications and computation.

– p. 15/34

slide-16
SLIDE 16

Luk’s Algorithms

Replace SVD by QR

– p. 16/34

slide-17
SLIDE 17
  • 1a. At processor i

Calculate QR decomposition of (I − 1

nienieT ni)Xi:

(I − 1

nienieT ni)Xi = Q(0) i

R(0)

i ,

where R(0)

i

is p × p. Send ni and ¯ xi to central processor. If i ≥ s/2, send R(0)

i

to processor i − s/2. No need to send Q(0)

i .

– p. 17/34

slide-18
SLIDE 18
  • 1b. At processor i < s/2

Calculate QR decomposition

  • R(0)

i

R(0)

i+s/2

  • = Q(1)

i

R(1)

i ,

where R(1)

i

is p × p. Equals to QRD of

(I − 1

nienieT ni)Xi

and

(I −

1 ni+s/2eni+s/2eT ni+s/2)Xi+s/

If i ≥ s/4, send R(1)

i

to processor i − s/4. No need to send Q(1)

i .

– p. 18/34

slide-19
SLIDE 19
  • 1c. At processor i < s/4

Calculate QR decomposition

  • R(1)

i

R(1)

i+s/4

  • = Q(2)

i

R(2)

i ,

where R(2)

i

is p × p. If i ≥ s/8, send R(2)

i

to processor i − s/8. No need to send Q(2)

i .

– p. 19/34

slide-20
SLIDE 20
  • 1d. Eventually, at processor 0

Calculate QR decomposition of

  • R(l−1)

R(l−1)

1

  • = Q(l)

0 R(l) 0 ,

where l = ⌈log2 s⌉. Send R(l)

0 to central processor.

No need to send Q(l)

0 .

– p. 20/34

slide-21
SLIDE 21

Main Results

Total communication costs = O(⌈log2 s⌉p2). The covariance matrix

nS = XT(I − 1

neneT n)X,

is given by: nS = R(ℓ)

T

R(ℓ)

0 + s−1

i=0

ni(¯ xi − ¯ x)(¯ xi − ¯ x)T.

– p. 21/34

slide-22
SLIDE 22
  • 2. At central processor

Assemble (s + p) × p data matrix: Z =       

√n0(¯

x0 − ¯ x)T

√n1(¯

x1 − ¯ x)T . . .

√ns−1(¯

xs−1 − ¯ x)T R(l)        . Notice that: nS = ZTZ.

– p. 22/34

slide-23
SLIDE 23
  • 2. At central processor

Compute SVD: Z = UΣVT (after triangulation). Say Z has numerical rank k. Broadcast ¯ x and V, first k columns of V. Communication costs = O(pk).

– p. 23/34

slide-24
SLIDE 24
  • 3. At processor i

Calculate principal component vectors: Xi V.

– p. 24/34

slide-25
SLIDE 25

Analysis of Luk’s Algorithm

Advantage over Qu’s Approach:

Communication costs on PCA:

O

  • p

s−1

i=0

ki −

→ O(p2⌈log2 s⌉),

No local PCA approximation errors. Less congestion in central processor for

communications and computation.

Work directly with data matrices.

– p. 25/34

slide-26
SLIDE 26

Data Updating

Assume global synchronization at t0, t1, . . . , tk, i.e. at [tk−1, tk], new data are added to X(k)

i

  • n

processor i.

Aim:

Find the PCA for the new extended matrix, without moving X(k)

i

across processors.

– p. 26/34

slide-27
SLIDE 27

Processor t t0 t1

· · ·

tm 1 . . . s − 1 X(0) X(0)

1

X(0)

s−1

X(1) X(1)

1

X(1)

s−1

X(m) X(m)

1

X(m)

s−1

· · ·

X(0) X(1)

· · ·

X(m) X(m)

– p. 27/34

slide-28
SLIDE 28

At time tk

Let X(k) =       X(k) X(k)

1

. . . X(k)

s−1

                 n(k) =

s−1

i=0

n(k)

i ,

where X(k)

i

is n(k)

i

× p.

Assume PCA of original matrix X(0) = X is available by Luk’s algorithm.

– p. 28/34

slide-29
SLIDE 29

Global Data Matrix at tm

Denote X(m) =      X(0) X(1) . . . X(s−1)               g(m) =

m

i=0

n(k).

Aim: Find PCA for its covariance matrix:

g(m) · Sg(m) = X(m)T(I −

1 g(m)eg(m)eT g(m))X(m).

– p. 29/34

slide-30
SLIDE 30

Our Theorem

Let n(k) · Sk = X(k)T(I −

1 n(k)en(k)eT n(k))X(k).

Then g(m)Sg(m) =

m

k=0

n(k)Sk

+

m

k=1 g(k−1)n(k) g(k)

xg(k−1) − ¯ xn(k))(¯ xg(k−1) − ¯ xn(k))T

– p. 30/34

slide-31
SLIDE 31

Explanation

PCA of update data matrix X(k) can be obtained by Luk’s algorithm, i.e. n(k)Sk = RT

k Rk. Then

g(m)Sg(m) =

m

k=0

RT

k Rk

+

m

k=1 g(k−1)n(k) g(k)

xg(k−1) − ¯ xn(k))(¯ xg(k−1) − ¯ xn(k))T Assemble them to construct global PCA for X(m).

– p. 31/34

slide-32
SLIDE 32

Analysis of Our Algorithm

Global PCA can be computed without moving

X(k).

Communication costs still O(p2⌈log2 s⌉), No local PCA approximation errors. Work directly with data matrices and update

matrices.

Load balancing for communications and

computation.

– p. 32/34

slide-33
SLIDE 33

Load Balancing

Let s = 2ℓ. We can allocate all processors to do the QR factorizations such that:

PCA of X(k) ← PCA of X(k−1)

+ R factor of X(k).

PCA of X(k) obtained in tk+ℓ. The procedure is periodic with period ℓ. Well-balanced among the processors.

– p. 33/34

slide-34
SLIDE 34

Processor 1 2 3 4 5 6 7 Time

XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR XR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR RR

t2 t0 t1 t6 t3 t4 t5 Communication Computation

RR RR RR RR RR RR RR RR XR RR RR RR RR RR RR

– p. 34/34