Unusual Tensor Decompositions for Informatics Applications Brett W. - - PowerPoint PPT Presentation

unusual tensor decompositions for informatics applications
SMART_READER_LITE
LIVE PREVIEW

Unusual Tensor Decompositions for Informatics Applications Brett W. - - PowerPoint PPT Presentation

Unusual Tensor Decompositions for Informatics Applications Brett W. Bader Sandia National Laboratories NSF Tensor Workshop February 20, 2009 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the


slide-1
SLIDE 1

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Brett W. Bader

Sandia National Laboratories

NSF Tensor Workshop February 20, 2009

Unusual Tensor Decompositions for Informatics Applications

slide-2
SLIDE 2

Acknowledgements

  • Richard Harshman (Univ. Western Ontario)
  • Peter Chew (Sandia)
  • Tammy Kolda (Sandia)
  • Ahmed Abdelali (NMSU)
slide-3
SLIDE 3

Tucker

Tensor Decompositions

+ + ... 3-way DEDICOM PARAFAC

Tensor

PARAFAC2 ...and many more! Each provides a different interpretation of the data

slide-4
SLIDE 4

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Temporal Analysis of Enron email using 3-way DEDICOM

slide-5
SLIDE 5

Three-way DEDICOM

  • Introduced by Harshman (1978)
  • DEcomposition into DIrectional COMponents
  • Columns of A are not necessarily orthogonal
  • Central matrix R contains asymmetric information from X
  • *Unique* solution with enough slices of X with sufficient variation
  • i.e., no rotation of A possible
  • greater confidence in interpretation of results
  • Alternating algorithms; least-squares and approximations
  • Early applications:
  • World trade (import/export matrices)
  • Car switching
  • Variations: constrainted DEDICOM

= X A R AT Xx = ADkRDkAT k = 1, . . . , K D D

slide-6
SLIDE 6

Application: Enron Email Analysis

  • Links consist of email communications
  • What can we learn about this network strictly from their

communication patterns? (Social network analysis)

David Ellen Bob Frank Alice Carl Ingrid Henk Gary

slide-7
SLIDE 7

!"#$"%"&'"( !"#$" )(*+$#,- !"#$" .$+(#/01#,(*'"2 !"#$" 31-/01#,(*'"2 !"#$" 3("(#1*'$" !"#$" )$#*4 56(#'71 !"#$" /!"(#28 /9(#:'7(- !"#$" ;#$1<=1"< !"#$" .'>(?'"(- !"#$" @#1"->$#*1*'$" 9(#:'7(-

!"#$"/A$#>

Case Study: Enron

N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J 500 1000 1500 2000 2500 3000 3500 Month Messages

Email communications at Enron (1998-2002)

  • Enron created energy markets
  • EnronOnline: e-trading business
  • natural gas
  • electric power
  • Investigations
  • FERC
  • energy market manipulation
  • involved energy traders
  • SEC
  • accounting fraud
  • insider trading
slide-8
SLIDE 8

Temporal Social Network Analysis

N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J 500 1000 1500 2000 2500 3000 3500 Month Messages

Email communications at Enron (1998-2002) Emails among 184 employees

  • ver 44 months

April March January February

Time series of communication graphs among employees

(data released by U.S. Federal Energy Regulatory Commission)

Joint work with R. Harshman (UWO) and T. Kolda

DEDICOM

Adjacency array

slide-9
SLIDE 9

Roles of Employees

−0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

  • J. Dasovich − Employee, Government Relationship Executive
  • J. Steffes − VP, Government Affairs
  • R. Shapiro − VP, Regulatory Affairs
  • S. Kean − VP, Chief of Staff
  • R. Sanders − VP, Enron Wholesale Services
  • T. Jones

Financial Trading Group ENA Legal

  • S. Shackleton

ENA Legal

  • M. Taylor

Manager Financial Trading Group ENA Legal Column 1 Column 2

Bi-plots of two roles

−0.1 0.1 0.2 0.3 0.4 0.5 0.6 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

  • K. Watson

Transwestern Pipeline Company (ETS)

  • M. Lokay
  • Admin. Asst.

Transwestern Pipeline Company (ETS)

  • L. Donoho − Employee, Transwestern Pipeline Company (ETS)
  • M. McConnell − Employee, Transwestern Pipeline Company (ETS)
  • L. Blair − Employee, Northern Natural Gas Pipeline (ETS)
  • L. Kitchen

President Enron Online

  • J. Lavorato

CEO, Enron America Column 3 Column 4 Unaffiliated Executive Legal (ENA) Pipeline (ETS) Energy Trader

roles time patterns

L e g a l E x e c u t i v e ( g

  • v

ʼ t a f f a i r s ) E x e c u t i v e ( t r a d e ) P i p e l i n e

  • L. Kitchen - President, Enron Online

0.11

  • 0.09

0.53 0.00

Identify shared characteristics to label group Soft clustering

slide-10
SLIDE 10

Communication Patterns

roles time patterns

  • Mostly communication within roles
  • Some asymmetric exchanges

Legal role Gov't affairs role Executive role Pipeline role 157.8 93.5 13.4 13.8 440.2 211.6 286.7 172.4

slide-11
SLIDE 11

Temporal Patterns

N D 99 F M A M J J A S O N D 00 F M A M J J A S O N D 01 F M A M J J A S O N D 02 F M A M J 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Month Normalized scale Group 1 Group 2 Group 3 Group 4 Enron crisis breaks; investigation begins

Communication patterns over time

roles time patterns

Legal Government & regulatory affairs Trade executives Pipeline employee

Filed for bankruptcy

slide-12
SLIDE 12

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Multilingual Text Analysis using PARAFAC2

slide-13
SLIDE 13

PARAFAC2

  • Introduced by Harshman (1972)
  • Less constrained than PARAFAC
  • Related to 3-way DEDICOM
  • Slices of A are constrained but not necessarily orthogonal
  • *Unique* solution with enough slices of X with sufficient variation
  • i.e., no rotation of A possible
  • greater confidence in interpretation of results
  • Alternating algorithms: least-squares and approximations
  • Early applications:
  • Sets of cross-product matrices
  • Chromatographic data with retention time shifts

= A C BT X Xk ≈ AkCkBT

slide-14
SLIDE 14

Cross-language Information Retrieval (CLIR)

Web documents could be in any language

English French Arabic Spanish

English German Japanese French Chinese Simplified Spanish Russian Dutch Korean Polish Portuguese Chinese Traditional Swedish Czech Norwegian Italian Danish Hungarian Finnish Hebrew Arabic Turkish Slovak Indonesian Bulgarian Croatian Catalan Slovenian Greek Romanian Serbian Estonian Icelandic Lithuanian Latvian

Languages on the web Goal: Cluster documents by topic regardless of language

slide-15
SLIDE 15

Bible as Parallel Corpus

Linguistic differences among translations Translation Terms Total Words English (King James) 12,335 789,744 Spanish (Reina Valera 1909) 28,456 704,004 Russian (Synodal 1876) 47,226 560,524 Arabic (Smith Van Dyke) 55,300 440,435 French (Darby) 20,428 812,947

  • Languages convey information in different number of words
  • Isolating language: One morpheme per word
  • e.g., "He travelled by hovercraft on the sea." Largely isolating, but travelled

and hovercraft each have two morphemes per word.

  • Synthetic language: High morpheme-per-word ratio
  • e.g., Aufsichtsratsmitgliederversammlung => "On-view-council-with-limbs-

gathering" meaning "meeting of members of the supervisory board".

slide-16
SLIDE 16

Term-Doc Matrix

Term-by-verse matrix for all languages

terms Bible verses English Spanish Russian Arabic French 163,745 x 31,230

Look for co-occurrence of terms in the same verses and across languages to capture latent concepts

slide-17
SLIDE 17

Latent Semantic Indexing

Term-by-verse matrix for all languages

terms Bible verses English Spanish Russian Arabic French

U V Σ

T

Truncated SVD Ak = UkΣkV T

k = k

  • i=1

σiuivT

i

Project new documents of interest into subspace

  • f U -1 and compute cosine similarities

Σ

term x concept

dimension 1 0.1375 dimension 2 0.1052 dimension 3 0.0341 dimension 4 0.0441 dimension 5

  • 0.0087

dimension 6 0.0410 dimension 7 0.1011 dimension 8 0.0020 dimension 9 0.0518 dimension 10 0.0822 dimension 11

  • 0.0101

dimension 12

  • 0.1154

dimension 13

  • 0.0990

dimension 14 0.0228 dimension 15

  • 0.0520

dimension 16 0.1096 dimension 17 0.0294 dimension 18 0.0495 dimension 19 0.0553 dimension 20 0.1598

Projection Document feature vector

slide-18
SLIDE 18

Quran as Test Set

  • Quran is translated into many languages, just like the Bible
  • 114 suras (or chapters)
  • More variation across translations = harder clustering task
slide-19
SLIDE 19

Performance Metrics

  • MP5: Average multilingual precision at 5 (or n) documents
  • The average percentage of the top 5 documents that

are translations of the query document

  • Calculated as an average for all languages
  • Essentially, MP5 measures success in multilingual

clustering

? ? Lang 1 Lang 2 query

slide-20
SLIDE 20

LSA Results

Method Average MP5 SVD/LSA 65.5%

Documents tend to cluster more by language than by topic 5 languages, 240 latent dimensions

slide-21
SLIDE 21

New Approach: Multi-matrix Array

X5 X4 X3 X2

English

X1

Spanish Russian Arabic French

Term-by-verse matrix for each language

(Chew, Bader, Kolda, Abdelali, 2007)

Array size: 55,300 x 31230 x 5 with 2,765,719 nonzeros

slide-22
SLIDE 22

Tucker1

VT S1 = U1 X1 X2 X3 U2 U3

≈ ≈

Tucker Tucker1

slide-23
SLIDE 23

Tucker1 Results

Method Average MP5 SVD/LSA 65.5% Tucker1 71.3%

Only minor improvement because each Uk is not orthogonal 5 languages, 240 latent dimensions

slide-24
SLIDE 24

PARAFAC2

Where each Uk is orthonormal and Sk is diagonal

Xk ≈ UkHSkV T

(Harshman, 1972)

slide-25
SLIDE 25

PARAFAC2 Results

Modest improvement over LSA

Method Average MP5 SVD/LSA 65.5% Tucker1 71.3% PARAFAC2 78.5%

5 languages, 240 latent dimensions Why PARAFAC2?

slide-26
SLIDE 26

Tensor Methods and Modeling: Why the Proliferation?

  • N-way interactions in real world applications
  • Next frontier after matrix linear algebra
  • Lots of low hanging fruit
  • New mathematical and computational challenges
  • Differences from matrix problems (e.g., rank of 2x2x2)
  • Original algorithms developed in different research

communities

slide-27
SLIDE 27

Thoughts on Future Directions of Tensor-based Computation and Modeling

  • Need scalable algorithms
  • Fast, efficient for large-scale problems
  • Handle constraints
  • non-negativity
  • sparsity
  • rthogonality
  • etc.
  • Parallel algorithms
  • Match models to applications
  • Requires creativity by domain experts and tensor

researchers

  • Sometimes not a straightforward extension from matrix

approaches

  • Danger of reinventing whatʼs already in the literature
  • psychometrics
slide-28
SLIDE 28

Questions?

http://www.sandia.gov/~bwbader/ bwbader@sandia.gov