Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS - - PowerPoint PPT Presentation

β–Ά
shared memory parallelization of mttkrp for dense tensors
SMART_READER_LITE
LIVE PREVIEW

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS - - PowerPoint PPT Presentation

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby Hayashi , Grey Ballard, Yujie Jiang, Michael Tobia hayakb13@,ballard@,jiany14@,tobiamj@wfu.edu Neuroimaging Application ensor: Time by Subjects by


slide-1
SLIDE 1

Shared Memory Parallelization of MTTKRP for Dense Tensors

Koby Hayashi, Grey Ballard, Yujie Jiang, Michael Tobia hayakb13@,ballard@,jiany14@,tobiamj@wfu.edu

BLIS Retreat 2017, September 18th

slide-2
SLIDE 2

Neuroimaging Application

ensor: Time by Subjects by

  • xel Correlation Matrix

est: Rest Γ  Activity Γ  ecovery Subjects: Control, MDD, SAD, COMO Time

slide-3
SLIDE 3

Quick Introduction to Tensors

Multidimensional arrays, an N-dimensional nsors is said to be N-way or order-N.

ay 2-way 5-way 3-way 4-way

slide-4
SLIDE 4

CP Decomposition

nonical Polyadic composition (CP): Decomposes a tensor into a sum of rank 1 tensors

π’΄β‰ˆβˆ‘π‘‘=0β†‘π·βˆ’1▒​𝑣↓𝑗𝑑 βˆ˜β€‹π‘€β†“π‘˜π‘‘ βˆ˜β€‹π‘₯↓𝑙𝑑 π’΄β‰ˆβŸ¦π‘‰,π‘Š,π‘‹βŸ§

slide-5
SLIDE 5

CP via Alternating Least Squares

slide-6
SLIDE 6

Hadamard Product

Element wise matrix product denoted *

𝐷 = π΅βˆ—πΆ β€‹π·β†“π‘—π‘˜ =β€‹π΅β†“π‘—π‘˜ βˆ—β€‹πΆβ†“π‘—π‘˜ β€‹π·β†“π‘—π‘˜ β€‹π΅β†“π‘—π‘˜ β€‹πΆβ†“π‘—π‘˜

*

= 𝐽 𝐾 𝐾 𝐾 β†‘π‘ˆ ​𝑉↓0 )βˆ—β€¦βˆ—β€‹(π‘‰β†“π‘œβˆ’1β†‘π‘ˆ β€‹π‘‰β†“π‘œβˆ’1 )βˆ—β€‹(π‘‰β†“π‘œ+1β†‘π‘ˆ β€‹π‘‰β†“π‘œ+1 )βˆ—β€¦βˆ—β€‹(π‘‰β†“π‘‚βˆ’1β†‘π‘ˆ β€‹π‘‰β†“π‘‚βˆ’1 )

slide-7
SLIDE 7

Khatri Rao Product

Khatri Rao Product (KRP):

𝐿 = π΅βŠ™πΆ

  • lumn-wise Kronecker Product

𝐿(:,𝑗)= 𝐡(:,𝑗)βŠ™πΆ(:,𝑗)

r Hadamard Product of Rows

𝐿(​𝑠↓𝐢 +​𝑠↓𝐡 ​𝐽↓𝐢 ,:)= 𝐡(​𝑠↓𝐡 ,:)βˆ—πΆ(​𝑠↓𝐢 ,:) 𝐷 𝐿(​𝑠↓𝐢 +​𝑠↓𝐡 ​𝐽↓𝐢 ,:) =

A(​𝑠↓𝐡 ,:) B(​𝑠↓𝐢

βŠ™ 𝐷 𝐷 ​𝐽↓𝐡 βˆ™β€‹π½β†“πΆ =​𝒀↓ 𝒀↓(𝒐) (​𝑽↓ π‘½β†“π‘Άβˆ’πŸ ⨀…⨀​𝑽↓ 𝑽↓𝒐+𝟐 ⨀​𝑽↓ π‘½β†“π’βˆ’πŸ ⨀…⨀​𝑽↓ π‘½β†“πŸ )

slide-8
SLIDE 8

Tensor Fibers

β€‹π‘œ=0, 𝒴↓(:π‘˜π‘™) β€‹π‘œ=1, 𝒴↓(𝑗:𝑙) β€‹π‘œ=2, 𝒴↓(π‘—π‘˜:)

slide-9
SLIDE 9

Unfolding Tensors

  • The nth mode matricization of a N-

way tensor 𝒴 that is ​𝐽↓0 ×​𝐽↓1 Γ—β€¦Γ—β€‹π½β†“π‘‚βˆ’1 is denoted β€‹π‘Œβ†“(π‘œ) and is β€‹π½β†“π‘œ Γ—β€‹π½β†“β‰ π‘œ

  • β€‹π½β†“β‰ π‘œ =βˆπ‘œβ‰ π‘™βˆˆ[𝑂]↑▒​𝐽↓𝑙
  • β€‹π‘Œβ†“(𝑛:π‘œ) denotes a matricization

where {𝑛,𝑛+1,…,π‘œ} are the row modes

β€‹π½β†“β‰ π‘œ β€‹π½β†“π‘œ βˆπ‘™={𝑛,𝑛+1,…,π‘œ}↑▒​𝐽↓𝑙 β€‹π‘Œβ†“(π‘œ) β€‹π‘Œβ†“(𝑛:π‘œ) 𝐍=​𝒀↓ 𝒀↓(𝒐) (​𝑽↓ π‘½β†“π‘Άβˆ’πŸ ⨀…⨀​𝑽↓ 𝑽↓𝒐+𝟐 ⨀​𝑽↓ π‘½β†“π’βˆ’πŸ ⨀…⨀​𝑽↓ π‘½β†“πŸ )

slide-10
SLIDE 10

Matricized Tensor Times Khatri Rao Product

𝑁=β€‹π‘Œβ†“(π‘œ) (​𝑉↓0 β¨€β€¦β¨€β€‹π‘‰β†“π‘œβˆ’1 β¨€β€‹π‘‰β†“π‘œ+1 β¨€β€¦β¨€β€‹π‘‰β†“π‘‚βˆ’1 ) NaΓ―ve algorithm 1. Permute 𝒴 to β€‹π‘Œβ†“(π‘œ) 2. Form K=​(𝑉↓0 β¨€β€¦β¨€β€‹π‘‰β†“π‘œβˆ’1 β¨€β€‹π‘‰β†“π‘œ+1 β¨€β€¦β¨€β€‹π‘‰β†“π‘‚βˆ’1 ) 3. Call DGEMM 1-Step and 2-Step MTTKRP 1. Avoid permuting 𝒴 2. Efficiently form the KRP Β§ 1Step

  • (β€‹π‘‰β†“π‘‚βˆ’1 β¨€β€¦β¨€β€‹π‘‰β†“π‘œ+1 β¨€β€‹π‘‰β†“π‘œβˆ’1 ⨀…⨀​𝑉↓0 )

Β§ 2Step

  • ​𝐿↓𝑀 = ​(𝑉↓0 β¨€β€¦β¨€β€‹π‘‰β†“π‘œβˆ’1 )
  • ​𝐿↓𝑆 =​(π‘‰β†“π‘œ+1 ⨀…⨀​𝑉↓𝑂 )

3. Utilize BLAS = β€‹π½β†“π‘œ 𝐷 β€‹π½β†“β‰ π‘œ 𝐷 β€‹π½β†“π‘œ β€‹π‘Œβ†“(π‘œ) K

slide-11
SLIDE 11

Computing the KRP

Consider 𝐿=𝐡⨀𝐢⨀𝐷

  • 𝐿(π‘˜,:)=𝐡(𝑏,:)βˆ—πΆ(𝑐,:)βˆ—π·(𝑑,:)

𝐡 𝐢 𝐷 ⨀ ⨀ = ​𝐽↓𝐡 ​𝐽↓𝐢 ​𝐽↓𝑑 𝐡(0,:)βˆ—πΆ(0,:)⨀𝐷 𝐿

slide-12
SLIDE 12

Timings for KRPs of naΓ―ve and reuse algorithms.

slide-13
SLIDE 13

1-Step MTTKRP

void permuting tensor entries ast computation as matmul y observation: the nth mode tricization of a tensor can be tained by chunking the tensor contiguous submatrices of ual size.

!"#$% !#$% !"& !& !'

) blocks

!' 1 !'

(

2(4) 2(6) 2(7$8) 9

!'

) blocks

slide-14
SLIDE 14

Parallel 1-Step MTTKRP

Form ​𝐿↓𝑀 Form ​𝐿↓𝑆 (π‘˜,:) Form 𝐿(π‘˜,:) MatMul Reduce

slide-15
SLIDE 15

2-Step MTTKRP

  • First Compute a Partial MTTKRP
  • 1. Compute ​𝐿↓𝑀 and ​𝐿↓𝑆
  • 2. ℒ​← π‘Œβ†“(0:π‘œβˆ’1)β†‘π‘ˆ βˆ™β€‹πΏβ†“π‘€
  • β„’ is β€‹π½β†“π‘œ Γ—β€¦Γ—β€‹π½β†“π‘‚βˆ’1 ×𝐷
  • Second Compute a Series of ___?___ operations.
  • a. Tensor Times Vector (TTVs)
  • b. Tensor Times Matrix (TTMs)
  • c. Quasi-Tensor Times Matrix (q-TTMs)
slide-16
SLIDE 16

2-Step MTTKRP: β„’

  • First Compute a Partial MTTKRP

!"

#

!" $ !"

%

& !"

#

'(

!" $ !"

%

((*:,-.-/)

&

1(*:.-/)

2

$ =

slide-17
SLIDE 17

2-Step MTTKRP: β„’

  • Second Compute a series of TTVs

! blocks )*

+(-)[0] 2(: , 0)

! )* !

56(: , 0)

)*

7

)*

7

8 =

slide-18
SLIDE 18

Parallel 2-Step MTTKRP

Call Parallel BLAS WOW!!!

slide-19
SLIDE 19

60Γ—60Γ—60Γ—60Γ—60

slide-20
SLIDE 20
slide-21
SLIDE 21

Per iteration time of a CP decomposition via ALS. Matlab used the Tensor Toolbox cp_als function, version 2.6. [1]

slide-22
SLIDE 22

Findings

wo interesting networks

  • Positive affect
  • Negative affect

Tobia M., Hayashi K., Ballard G., Gotlib I. Dynamic Functional Connectivity and Individual Differences in Emotions During Social Stress - to appear in uman Brain Mapping

slide-23
SLIDE 23

References

Tamara G. Kolda and Bre8 W. Bader. 2009. Tensor DecomposiAons and ApplicaAons. SIAM Rev. 51, 3 (Septembe 2009), 455–500. h8ps://doi.org/10.1137/ 07070111X Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2017. Model Driven Sparse CP DecomposiAon for Higher-Order Tensors. In IEEE InternaAonal Parallel and Distributed Processing Symposium (IPDPS). 1048–10 h8ps://doi.org/10.1109/IPDPS.2017.80 Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix MulAplicaAon. In Proceedings of the 2015 IEEE InternaAonal Parallel and Distribute Processing Symposium (IPDPS ’15). IEEE Computer Society, Washington, DC, USA, 61–70. h8ps://doi.org/10.1109/ IPDPS.2015.27 D.C. Van Essen, K. Ugurbil, E. Auerbach, D. Barch, T.E.J. Behrens, R. Bucholz, A. Chang, L. Chen, M. Corbe8a, S.W. CurAss, S. Della Penna, D. Feinberg, M.F. Glasser, N. Harel, A.C. Heath, L. Larson-Prior, D. Marcus, G. Michalareas

  • S. Moeller, R. Oostenveld, S.E. Petersen, F. Prior, B.L. Schlaggar, S.M. Smith, A.Z. Snyder, J. Xu, and E. Yacoub. 20

The Human Connectome Project: a data acquisiAon perspecAve. Neuroimage 62, 4 (2012), 2222–2231. h8ps:// doi.org/10. 1016/j.neuroimage.2012.02.018 Anh-Huy Phan, Petr Tichavsky, and Andrzej Cichocki. 2013. Fast AlternaAng LS Algorithms for High Order CANDECOMP/PARAFAC Tensor FactorizaAons. IEEE TransacAons on Signal Processing 61, 19 (Oct 2013), 4834–

  • 4846. h8ps://doi. org/10.1109/TSP.2013.2269903
slide-24
SLIDE 24

End

Thanks for listening