[PPT] - Unusual Tensor Decompositions for Informatics Applications Brett W. PowerPoint Presentation

SLIDE 1

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Brett W. Bader

Sandia National Laboratories

NSF Tensor Workshop February 20, 2009

Unusual Tensor Decompositions for Informatics Applications

SLIDE 2

Acknowledgements

Richard Harshman (Univ. Western Ontario)
Peter Chew (Sandia)
Tammy Kolda (Sandia)
Ahmed Abdelali (NMSU)

SLIDE 3

Tucker

Tensor Decompositions

+ + ... 3-way DEDICOM PARAFAC

Tensor

PARAFAC2 ...and many more! Each provides a different interpretation of the data

SLIDE 4

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Temporal Analysis of Enron email using 3-way DEDICOM

SLIDE 5

Three-way DEDICOM

Introduced by Harshman (1978)
DEcomposition into DIrectional COMponents
Columns of A are not necessarily orthogonal
Central matrix R contains asymmetric information from X
*Unique* solution with enough slices of X with sufficient variation
i.e., no rotation of A possible
greater confidence in interpretation of results
Alternating algorithms; least-squares and approximations
Early applications:
World trade (import/export matrices)
Car switching
Variations: constrainted DEDICOM

= X A R AT Xx = ADkRDkAT k = 1, . . . , K D D

SLIDE 6

Application: Enron Email Analysis

Links consist of email communications
What can we learn about this network strictly from their

communication patterns? (Social network analysis)

David Ellen Bob Frank Alice Carl Ingrid Henk Gary

SLIDE 7

!"#$"%"&'"( !"#$" )(*+$#,- !"#$" .$+(#/01#,(*'"2 !"#$" 31-/01#,(*'"2 !"#$" 3("(#1*'$" !"#$" )$#*4 56(#'71 !"#$" /!"(#28 /9(#:'7(- !"#$" ;#$1<=1"< !"#$" .'>(?'"(- !"#$" @#1"->$#*1*'$" 9(#:'7(-

!"#$"/A$#>

Case Study: Enron

N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J 500 1000 1500 2000 2500 3000 3500 Month Messages

Email communications at Enron (1998-2002)

Enron created energy markets
EnronOnline: e-trading business
natural gas
electric power
Investigations
FERC
energy market manipulation
involved energy traders
SEC
accounting fraud
insider trading

SLIDE 8

Temporal Social Network Analysis

N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J 500 1000 1500 2000 2500 3000 3500 Month Messages

Email communications at Enron (1998-2002) Emails among 184 employees

ver 44 months

April March January February

Time series of communication graphs among employees

(data released by U.S. Federal Energy Regulatory Commission)

Joint work with R. Harshman (UWO) and T. Kolda

DEDICOM

Adjacency array

SLIDE 9

Roles of Employees

−0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

J. Dasovich − Employee, Government Relationship Executive
J. Steffes − VP, Government Affairs
R. Shapiro − VP, Regulatory Affairs
S. Kean − VP, Chief of Staff
R. Sanders − VP, Enron Wholesale Services
T. Jones

Financial Trading Group ENA Legal

S. Shackleton

ENA Legal

M. Taylor

Manager Financial Trading Group ENA Legal Column 1 Column 2

Bi-plots of two roles

−0.1 0.1 0.2 0.3 0.4 0.5 0.6 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

K. Watson

Transwestern Pipeline Company (ETS)

M. Lokay
Admin. Asst.

Transwestern Pipeline Company (ETS)

L. Donoho − Employee, Transwestern Pipeline Company (ETS)
M. McConnell − Employee, Transwestern Pipeline Company (ETS)
L. Blair − Employee, Northern Natural Gas Pipeline (ETS)
L. Kitchen

President Enron Online

J. Lavorato

CEO, Enron America Column 3 Column 4 Unaffiliated Executive Legal (ENA) Pipeline (ETS) Energy Trader

roles time patterns

L e g a l E x e c u t i v e ( g

v

ʼ t a f f a i r s ) E x e c u t i v e ( t r a d e ) P i p e l i n e

L. Kitchen - President, Enron Online

0.11

0.09

0.53 0.00

Identify shared characteristics to label group Soft clustering

SLIDE 10

Communication Patterns

roles time patterns

Mostly communication within roles
Some asymmetric exchanges

Legal role Gov't affairs role Executive role Pipeline role 157.8 93.5 13.4 13.8 440.2 211.6 286.7 172.4

SLIDE 11

Temporal Patterns

N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Month Normalized scale Group 1 Group 2 Group 3 Group 4 Enron crisis breaks; investigation begins

Communication patterns over time

roles time patterns

Legal Government & regulatory affairs Trade executives Pipeline employee

Filed for bankruptcy

SLIDE 12

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Multilingual Text Analysis using PARAFAC2

SLIDE 13

PARAFAC2

Introduced by Harshman (1972)
Less constrained than PARAFAC
Related to 3-way DEDICOM
Slices of A are constrained but not necessarily orthogonal
*Unique* solution with enough slices of X with sufficient variation
i.e., no rotation of A possible
greater confidence in interpretation of results
Alternating algorithms: least-squares and approximations
Early applications:
Sets of cross-product matrices
Chromatographic data with retention time shifts

= A C BT X Xk ≈ AkCkBT

SLIDE 14

Cross-language Information Retrieval (CLIR)

Web documents could be in any language

English French Arabic Spanish

English German Japanese French Chinese Simplified Spanish Russian Dutch Korean Polish Portuguese Chinese Traditional Swedish Czech Norwegian Italian Danish Hungarian Finnish Hebrew Arabic Turkish Slovak Indonesian Bulgarian Croatian Catalan Slovenian Greek Romanian Serbian Estonian Icelandic Lithuanian Latvian

Languages on the web Goal: Cluster documents by topic regardless of language

SLIDE 15

Bible as Parallel Corpus

Linguistic differences among translations Translation Terms Total Words English (King James) 12,335 789,744 Spanish (Reina Valera 1909) 28,456 704,004 Russian (Synodal 1876) 47,226 560,524 Arabic (Smith Van Dyke) 55,300 440,435 French (Darby) 20,428 812,947

Languages convey information in different number of words
Isolating language: One morpheme per word
e.g., "He travelled by hovercraft on the sea." Largely isolating, but travelled

and hovercraft each have two morphemes per word.

Synthetic language: High morpheme-per-word ratio
e.g., Aufsichtsratsmitgliederversammlung => "On-view-council-with-limbs-

gathering" meaning "meeting of members of the supervisory board".

SLIDE 16

Term-Doc Matrix

Term-by-verse matrix for all languages

terms Bible verses English Spanish Russian Arabic French 163,745 x 31,230

Look for co-occurrence of terms in the same verses and across languages to capture latent concepts

SLIDE 17

Latent Semantic Indexing

Term-by-verse matrix for all languages

terms Bible verses English Spanish Russian Arabic French

U V Σ

T

Truncated SVD Ak = UkΣkV T

k = k

i=1

σiuivT

i

Project new documents of interest into subspace

f U -1 and compute cosine similarities

Σ

term x concept

dimension 1 0.1375 dimension 2 0.1052 dimension 3 0.0341 dimension 4 0.0441 dimension 5

0.0087

dimension 6 0.0410 dimension 7 0.1011 dimension 8 0.0020 dimension 9 0.0518 dimension 10 0.0822 dimension 11

0.0101

dimension 12

0.1154

dimension 13

0.0990

dimension 14 0.0228 dimension 15

0.0520

dimension 16 0.1096 dimension 17 0.0294 dimension 18 0.0495 dimension 19 0.0553 dimension 20 0.1598

Projection Document feature vector

SLIDE 18

Quran as Test Set

Quran is translated into many languages, just like the Bible
114 suras (or chapters)
More variation across translations = harder clustering task

SLIDE 19

Performance Metrics

MP5: Average multilingual precision at 5 (or n) documents
The average percentage of the top 5 documents that

are translations of the query document

Calculated as an average for all languages
Essentially, MP5 measures success in multilingual

clustering

? ? Lang 1 Lang 2 query

SLIDE 20

LSA Results

Method Average MP5 SVD/LSA 65.5%

Documents tend to cluster more by language than by topic 5 languages, 240 latent dimensions

SLIDE 21

New Approach: Multi-matrix Array

X5 X4 X3 X2

English

X1

Spanish Russian Arabic French

Term-by-verse matrix for each language

(Chew, Bader, Kolda, Abdelali, 2007)

Array size: 55,300 x 31230 x 5 with 2,765,719 nonzeros

SLIDE 22

Tucker1

VT S1 = U1 X1 X2 X3 U2 U3

≈ ≈

Tucker Tucker1

SLIDE 23

Tucker1 Results

Method Average MP5 SVD/LSA 65.5% Tucker1 71.3%

Only minor improvement because each Uk is not orthogonal 5 languages, 240 latent dimensions

SLIDE 24

PARAFAC2

Where each Uk is orthonormal and Sk is diagonal

Xk ≈ UkHSkV T

(Harshman, 1972)

SLIDE 25

PARAFAC2 Results

Modest improvement over LSA

Method Average MP5 SVD/LSA 65.5% Tucker1 71.3% PARAFAC2 78.5%

5 languages, 240 latent dimensions Why PARAFAC2?

SLIDE 26

Tensor Methods and Modeling: Why the Proliferation?

N-way interactions in real world applications
Next frontier after matrix linear algebra
Lots of low hanging fruit
New mathematical and computational challenges
Differences from matrix problems (e.g., rank of 2x2x2)
Original algorithms developed in different research

communities

SLIDE 27

Thoughts on Future Directions of Tensor-based Computation and Modeling

Need scalable algorithms
Fast, efficient for large-scale problems
Handle constraints
non-negativity
sparsity
rthogonality
etc.
Parallel algorithms
Match models to applications
Requires creativity by domain experts and tensor

researchers

Sometimes not a straightforward extension from matrix

approaches

Danger of reinventing whatʼs already in the literature
psychometrics

SLIDE 28

Brett W. Bader

Sandia National Laboratories

NSF Tensor Workshop February 20, 2009

Unusual Tensor Decompositions for Informatics Applications

Acknowledgements

Tucker

Tensor Decompositions

+ + ... 3-way DEDICOM PARAFAC

PARAFAC2 ...and many more! Each provides a different interpretation of the data

Temporal Analysis of Enron email using 3-way DEDICOM

Three-way DEDICOM

= X A R AT Xx = ADkRDkAT k = 1, . . . , K D D

Application: Enron Email Analysis

communication patterns? (Social network analysis)

Case Study: Enron

Email communications at Enron (1998-2002)

Temporal Social Network Analysis

Email communications at Enron (1998-2002) Emails among 184 employees

Time series of communication graphs among employees

DEDICOM

Roles of Employees

Bi-plots of two roles

Communication Patterns

Temporal Patterns

Communication patterns over time

Multilingual Text Analysis using PARAFAC2

PARAFAC2

= A C BT X Xk ≈ AkCkBT

Cross-language Information Retrieval (CLIR)

Web documents could be in any language

Languages on the web Goal: Cluster documents by topic regardless of language

Bible as Parallel Corpus

Linguistic differences among translations Translation Terms Total Words English (King James) 12,335 789,744 Spanish (Reina Valera 1909) 28,456 704,004 Russian (Synodal 1876) 47,226 560,524 Arabic (Smith Van Dyke) 55,300 440,435 French (Darby) 20,428 812,947

Term-Doc Matrix

Term-by-verse matrix for all languages

Look for co-occurrence of terms in the same verses and across languages to capture latent concepts

Latent Semantic Indexing

Term-by-verse matrix for all languages

U V Σ

Truncated SVD Ak = UkΣkV T

σiuivT

Project new documents of interest into subspace

Σ

Projection Document feature vector

Quran as Test Set

Performance Metrics

are translations of the query document

clustering

LSA Results

Method Average MP5 SVD/LSA 65.5%

Documents tend to cluster more by language than by topic 5 languages, 240 latent dimensions

New Approach: Multi-matrix Array

Term-by-verse matrix for each language

Array size: 55,300 x 31230 x 5 with 2,765,719 nonzeros

Tucker1

Tucker Tucker1

Tucker1 Results

Method Average MP5 SVD/LSA 65.5% Tucker1 71.3%

Only minor improvement because each Uk is not orthogonal 5 languages, 240 latent dimensions

PARAFAC2

Where each Uk is orthonormal and Sk is diagonal

Xk ≈ UkHSkV T

PARAFAC2 Results

Modest improvement over LSA

Method Average MP5 SVD/LSA 65.5% Tucker1 71.3% PARAFAC2 78.5%

5 languages, 240 latent dimensions Why PARAFAC2?

Tensor Methods and Modeling: Why the Proliferation?

communities

Thoughts on Future Directions of Tensor-based Computation and Modeling

researchers

approaches

Questions?

http://www.sandia.gov/~bwbader/ bwbader@sandia.gov