 
              PCA for Distributed Data Sets Raymond H. Chan Department of Mathematics The Chinese University of Hong Kong Joint work with Franklin Luk (RPI) and Z.-J. Bai (CUHK) – p. 1/34
Grid Powerful processors with relatively slow links. ✬✩ ✬✩ Powerful 0 1 Teraflop ✫✪ ✫✪ � ❅ Processors � ❅ ✬✩ � ❅ Slow � ❅ � ❅ Gigabit 4 ✫✪ Networks � � ✬✩ � ✬✩ � � 3 2 ✫✪ ✫✪ – p. 2/34
New York State TeraGrid Initiative � State University of NY at Albany � Rensselaer Polytechnic Institute � Brookhaven National Laboratory � State University of NY at Stony Brook � in partnership with IBM and NYSERNet – p. 3/34
– p. 4/34
Grid Topology � Four sites, scalable grid architecture � 10 to 20 Gb/sec connection � 6TF Processors Communication = 6 × 10 12 Computation 0.3 × 10 9 = 20, 000! Communication Bottleneck: computation done locally. – p. 5/34
Principal Component Analysis Let X be an n × p data matrix, where n ≫ p . Data covariace matrix S is given by nS = X T ( I − 1 n e n e T n ) X , where e T n = ( 1, 1, . . . , 1 ) . PCA ⇐ ⇒ Karhunen-Loève transform ⇐ ⇒ Hotelling transform – p. 6/34
PCA means � get spectral decomposition of n · S : nS = V Σ 2 V T . � choose few largest eigenvalues and eigenvectors � V . � form principal component vectors X · � V . � low-dimension representation of original data. – p. 7/34
PCA involves only V and Σ Since ( I − 1 n e n e T n ) is symmetric and idempotent, nS = X T ( I − 1 n e n e T n ) X = X T ( I − 1 n e n e T n e n e T n )( I − 1 n ) X Σ and V can be obtained from SVD of: n e n e T n ) X = U Σ V T . ( I − 1 Low-dim’l representation X · � V can still be done. – p. 8/34
Distributed PCA Big data matrix X : n ≈ 10 12 . E.g. visualization, data mining. Problem: Data are distributed amongst s processors. Can we find Σ and V without moving X across processors? – p. 9/34
Data among s processors Denote     X 0      s − 1 X 1   ∑ X = n = n i ,   . .    .  i = 0   X s − 1 where X i is n i × p , resides on processor i . Typical: n i ≈ 10 12 and p ≈ 10 3 . – p. 10/34
Aim � Compute PCA of X without moving the data matrix X i . � Move O ( p α ) data across processors instead of O ( n i ) . – p. 11/34
Distributed PCA by Qu et al. 1. At processor i , calculate local PCA using SVD: n i e n i e T n i ) X i = U i Σ i V T ( I − 1 i . Say matrix has numerical rank k i . Send ¯ x i (column sum of X i ), and k i largest principal components � Σ i and � V i to central processor. Communication costs = O ( pk i ) . – p. 12/34
2. At central processor: Assemble p × p covariance matrix and find its PCA s − 1 s − 1 n � V i � � i � Σ 2 V T x ) T ∑ ∑ S = i + n i ( ¯ x i − ¯ x )( ¯ x i − ¯ i = 0 i = 0 = V Σ 2 V T . Broadcast � V , the first k columns of V . Communication costs = O ( pk ) . – p. 13/34
3. Calculate principal component vectors at processor i : X i � V . – p. 14/34
Analysis of Qu’s Approach Advantage: Reduce communication costs: � s − 1 � �� ∑ O ( pn ) − → O p k i . i = 0 Disadvantages: � Local SVD’s introduce approximation errors. � Central processor becomes bottleneck for communications and computation. – p. 15/34
Luk’s Algorithms Replace SVD by QR – p. 16/34
1a. At processor i Calculate QR decomposition of ( I − 1 n i e n i e T n i ) X i : n i ) X i = Q ( 0 ) R ( 0 ) n i e n i e T ( I − 1 i , i where R ( 0 ) is p × p . i Send n i and ¯ x i to central processor. If i ≥ s /2, send R ( 0 ) to processor i − s /2. i No need to send Q ( 0 ) i . – p. 17/34
1b. At processor i < s /2 Calculate QR decomposition � � R ( 0 ) = Q ( 1 ) R ( 1 ) i i , R ( 0 ) i i + s /2 where R ( 1 ) is p × p . Equals to QRD of i ( I − 1 1 n i e n i e T n i + s /2 e n i + s /2 e T n i ) X i ( I − n i + s /2 ) X i + s / and If i ≥ s /4, send R ( 1 ) to processor i − s /4. i No need to send Q ( 1 ) i . – p. 18/34
1c. At processor i < s /4 Calculate QR decomposition � � R ( 1 ) = Q ( 2 ) R ( 2 ) i i , R ( 1 ) i i + s /4 where R ( 2 ) is p × p . i If i ≥ s /8, send R ( 2 ) to processor i − s /8. i No need to send Q ( 2 ) i . – p. 19/34
1d. Eventually, at processor 0 Calculate QR decomposition of � � R ( l − 1 ) = Q ( l ) 0 R ( l ) 0 0 , R ( l − 1 ) 1 where l = ⌈ log 2 s ⌉ . Send R ( l ) 0 to central processor. No need to send Q ( l ) 0 . – p. 20/34
Main Results � Total communication costs = O ( ⌈ log 2 s ⌉ p 2 ) . � The covariance matrix nS = X T ( I − 1 n e n e T n ) X , is given by: s − 1 T nS = R ( ℓ ) R ( ℓ ) x ) T . ∑ 0 + n i ( ¯ x i − ¯ x )( ¯ x i − ¯ 0 i = 0 – p. 21/34
2. At central processor Assemble ( s + p ) × p data matrix:   √ n 0 ( ¯ x ) T x 0 − ¯ √ n 1 ( ¯   x ) T x 1 − ¯    .  . Z = . .   √ n s − 1 ( ¯   x ) T x s − 1 − ¯   R ( l ) 0 Notice that: nS = Z T Z . – p. 22/34
2. At central processor Compute SVD: Z = U Σ V T (after triangulation). Say Z has numerical rank k . x and � V , first k columns of V . Broadcast ¯ Communication costs = O ( pk ) . – p. 23/34
3. At processor i Calculate principal component vectors: X i � V . – p. 24/34
Analysis of Luk’s Algorithm Advantage over Qu’s Approach: � Communication costs on PCA: �� − � s − 1 � → O ( p 2 ⌈ log 2 s ⌉ ) , ∑ O p k i i = 0 � No local PCA approximation errors. � Less congestion in central processor for communications and computation. � Work directly with data matrices. – p. 25/34
Data Updating Assume global synchronization at t 0 , t 1 , . . . , t k , i.e. at [ t k − 1 , t k ] , new data are added to X ( k ) on i processor i . Aim: Find the PCA for the new extended matrix, without moving X ( k ) across processors. i – p. 26/34
X ( m ) X ( 0 ) X ( 1 ) X ( m ) · · · Processor X ( 0 ) X ( 1 ) X ( m ) s − 1 s − 1 s − 1 s − 1 . . . · · · X ( 0 ) X ( 1 ) X ( m ) 1 1 1 1 X ( 0 ) X ( 1 ) X ( m ) 0 0 0 0 t · · · t 0 t 1 t m – p. 27/34
At time t k Let    X ( k )    0     X ( k )   s − 1 X ( k ) = n ( k ) = n ( k )   ∑ 1 i , .   .  .    i = 0    X ( k ) s − 1 where X ( k ) is n ( k ) × p . i i Assume PCA of original matrix X ( 0 ) = X is available by Luk’s algorithm. – p. 28/34
Global Data Matrix at t m Denote    X ( 0 )       X ( 1 ) m   X ( m ) = n ( k ) . ∑ g ( m ) =   . .    .  i = 0   X ( s − 1 ) Aim: Find PCA for its covariance matrix: g ( m ) · S g ( m ) = X ( m ) T ( I − g ( m ) ) X ( m ) . g ( m ) e g ( m ) e T 1 – p. 29/34
Our Theorem Let n ( k ) · S k = X ( k ) T ( I − n ( k ) ) X ( k ) . n ( k ) e n ( k ) e T 1 Then m n ( k ) S k ∑ g ( m ) S g ( m ) = k = 0 m g ( k − 1 ) n ( k ) ∑ x n ( k ) ) T + ( ¯ x g ( k − 1 ) − ¯ x n ( k ) )( ¯ x g ( k − 1 ) − ¯ g ( k ) k = 1 – p. 30/34
Explanation PCA of update data matrix X ( k ) can be obtained by Luk’s algorithm, i.e. n ( k ) S k = R T k R k . Then m R T ∑ g ( m ) S g ( m ) = k R k k = 0 m g ( k − 1 ) n ( k ) ∑ x n ( k ) ) T + ( ¯ x g ( k − 1 ) − ¯ x n ( k ) )( ¯ x g ( k − 1 ) − ¯ g ( k ) k = 1 Assemble them to construct global PCA for X ( m ) . – p. 31/34
Analysis of Our Algorithm � Global PCA can be computed without moving X ( k ) . � Communication costs still O ( p 2 ⌈ log 2 s ⌉ ) , � No local PCA approximation errors. � Work directly with data matrices and update matrices. � Load balancing for communications and computation. – p. 32/34
Load Balancing Let s = 2 ℓ . We can allocate all processors to do the QR factorizations such that: � PCA of X ( k ) ← PCA of X ( k − 1 ) + R factor of X ( k ) . � PCA of X ( k ) obtained in t k + ℓ . � The procedure is periodic with period ℓ . � Well-balanced among the processors. – p. 33/34
Recommend
More recommend