Automating the Detection of Anomalies and Trends from Text NGDM07 - - PowerPoint PPT Presentation

automating the detection of anomalies and trends from text
SMART_READER_LITE
LIVE PREVIEW

Automating the Detection of Anomalies and Trends from Text NGDM07 - - PowerPoint PPT Presentation

Automating the Detection of Anomalies and Trends from Text NGDM07 Workshop Baltimore, MD Michael W. Berry Department of Electrical Engineering & Computer Science University of Tennessee October 11, 2007 1 / 40 1 Nonnegative Matrix


slide-1
SLIDE 1

Automating the Detection of Anomalies and Trends from Text

NGDM’07 Workshop Baltimore, MD Michael W. Berry

Department of Electrical Engineering & Computer Science University of Tennessee

October 11, 2007

1 / 40

slide-2
SLIDE 2

1 Nonnegative Matrix Factorization (NNMF)

Motivation Underlying Optimization Problem MM Method (Lee and Seung) Smoothing and Sparsity Constraints Hybrid NNMF Approach

2 Anomaly Detection in ASRS Collection

Document Parsing and Term Weighting Preliminary Training SDM07 Contest Performance

3 NNTF Classification of Enron Email

Corpus and Historical Events Discussion Tracking via PARAFAC/Tensor Factorization Multidimensional Data Analysis PARAFAC Model

4 References

2 / 40

slide-3
SLIDE 3

NNMF Origins

NNMF (Nonnegative Matrix Factorization) can be used to approximate high-dimensional data having nonnegative components. Lee and Seung (1999) demonstrated its use as a sum-by-parts representation of image data in order to both identify and classify image features. Xu et al. (2003) demonstrated how NNMF-based indexing could outperform SVD-based Latent Semantic Indexing (LSI) for some information retrieval tasks.

3 / 40

slide-4
SLIDE 4

NNMF for Image Processing

SVD W H A i

i

U Vi

Σ

Sparse NNMF verses Dense SVD Bases; Lee and Seung (1999)

4 / 40

slide-5
SLIDE 5

NNMF Analogue for Text Mining (Medlars)

1 2 3 4 10 9 8 7 6 5 4 3 2 1 ventricula r aortic septa l left defect regurgitation ventricle valve cardiac pressure Highest Weighted Terms in Basis Vector W

*1

weight term 0.5 1 1.5 2 2.5 10 9 8 7 6 5 4 3 2 1

  • xygen

flow pressure blood cerebral hypothermia fluid venous arterial perfusion Highest Weighted Terms in Basis Vector W

*2

weight term 1 2 3 4 10 9 8 7 6 5 4 3 2 1 childre n child autistic speech group early visual anxiety emotional autism Highest Weighted Terms in Basis Vector W

*5

weight term 0.5 1 1.5 2 2.5 10 9 8 7 6 5 4 3 2 1 kidney marro w dna cells nephrectom y unilateral lymphocyte s bon e thymidine rats Highest Weighted Terms in Basis Vector W

*6

weight term

Interpretable NNMF feature vectors; Langville et al. (2006)

5 / 40

slide-6
SLIDE 6

Derivation

Given an m × n term-by-document (sparse) matrix X. Compute two reduced-dim. matrices W ,H so that X ≃ WH; W is m × r and H is r × n, with r ≪ n. Optimization problem: min

W ,H X − WH2 F,

subject to Wij ≥ 0 and Hij ≥ 0, ∀i, j. General approach: construct initial estimates for W and H and then improve them via alternating iterations.

6 / 40

slide-7
SLIDE 7

Minimization Challenges and Formulations [Berry et al., 2007]

Local Minima: Non-convexity of functional f (W , H) = 1

2X − WH2 F in both W and H.

Non-unique Solutions: WDD−1H is nonnegative for any nonnegative (and invertible) D. Many NNMF Formulations:

Lee and Seung (2001) – information theoretic formulation based on Kullback-Leibler divergence of X from WH. Guillamet, Bressan, and Vitria (2001) – diagonal weight matrix Q used (XQ ≈ WHQ) to compensate for feature redundancy (columns of W ). Wang, Jiar, Hu, and Turk (2004) – constraint-based formulation using Fisher linear discriminant analysis to improve extraction of spatially localized features. Other Cost Function Formulations – Hamza and Brady (2006), Dhillon and Sra (2005), Cichocki, Zdunek, and Amari (2006)

7 / 40

slide-8
SLIDE 8

Multiplicative Method (MM)

Multiplicative update rules for W and H (Lee and Seung, 1999):

1 Initialize W and H with nonnegative values, and scale the

columns of W to unit norm.

2 Iterate for each c, j, and i until convergence or after k

iterations:

1

Hcj ← Hcj (W TX)cj (W TWH)cj + ǫ

2

Wic ← Wic (XHT)ic (WHHT)ic + ǫ

3

Scale the columns of W to unit norm.

Setting ǫ = 10−9 will suffice to avoid division by zero [Shahnaz et al., 2006].

8 / 40

slide-9
SLIDE 9

Multiplicative Method (MM) contd.

Multiplicative Update MATLAB R Code for NNMF W = rand(m,k);

% W initially random

H = rand(k,n);

% H initially random

for i = 1 : maxiter H = H .* (WTA) ./ (WTWH + ǫ); W = W .* (AHT) ./ (WHHT + ǫ); end

9 / 40

slide-10
SLIDE 10

Lee and Seung MM Convergence

Convergence: when the MM algorithm converges to a limit point in the interior of the feasible region, the point is a stationary point. The stationary point may or may not be a local minimum. If the limit point lies on the boundary of the feasible region, one cannot determine its stationarity [Berry et al., 2007]. Several modifications have been proposed: Gonzalez and Zhang (2005) accelerated convergence somewhat but stationarity issue remains; Lin (2005) modified the algorithm to guarantee convergence to a stationary point; Dhillon and Sra (2005) derived update rules that incorporate weights for the importance of certain features of the approximation.

10 / 40

slide-11
SLIDE 11

Hoyer’s Method

From neural network applications, Hoyer (2002) enforced statistical sparsity for the weight matrix H in order to enhance the parts-based data representations in the matrix W . Mu et al. (2003) suggested a regularization approach to achieve statistical sparsity in the matrix H: point count regularization; penalize the number of nonzeros in H rather than

ij Hij.

Goal of increased sparsity (or smoothness) – better representation of parts or features spanned by the corpus (X) [Berry and Browne, 2005].

11 / 40

slide-12
SLIDE 12

GD-CLS – Hybrid Approach

First use MM to compute an approximation to W for each iteration – a gradient descent (GD) optimization step. Then, compute the weight matrix H using a constrained least squares (CLS) model to penalize non-smoothness (i.e., non-sparsity) in H – common Tikohonov regularization technique used in image processing (Prasad et al., 2003). Convergence to a non-stationary point evidenced (proof still needed).

12 / 40

slide-13
SLIDE 13

GD-CLS Algorithm

1 Initialize W and H with nonnegative values, and scale the

columns of W to unit norm.

2 Iterate until convergence or after k iterations: 1 Wic ← Wic

(XHT)ic (WHHT)ic + ǫ, for c and i

2 Rescale the columns of W to unit norm. 3 Solve the constrained least squares problem:

min

Hj {Xj − WHj2 2 + λHj2 2},

where the subscript j denotes the jth column, for j = 1, . . . , m.

Any negative values in Hj are set to zero. The parameter λ is a regularization value that is used to balance the reduction of the metric Xj − WHj2

2 with enforcement of smoothness and

sparsity in H.

13 / 40

slide-14
SLIDE 14

Two Penalty Term Formulation

Introduce smoothing on Wk (feature vectors) in addition to Hk: min

W ,H{X − WH2 F + αW 2 F + βH2 F},

where · F is the Frobenius norm. Constrained NNMF (CNMF) iteration: Hcj ← Hcj (W TX)cj − βHcj (W TWH)cj + ǫ Wic ← Wic (XHT)ic − αWic (WHHT)ic + ǫ

14 / 40

slide-15
SLIDE 15

Improving Feature Interpretability

Gauging Parameters for Constrained Optimization

How sparse (or smooth) should factors (W , H) be to produce as many interpretable features as possible? To what extent do different norms (l1, l2, l∞) improve/degradate feature quality or span? At what cost? Can a nonnegative feature space be built from objects in both images and text? Are there opportunities for multimodal document similarity?

15 / 40

slide-16
SLIDE 16

Anomaly Detection (ASRS)

Classify events described by documents from the Airline Safety Reporting System (ASRS) into 22 anomaly categories; contest from SDM07 Text Mining Workshop. General Text Parsing (GTP) Software Environment in C++ [Giles et al., 2003] used to parse both ASRS training set and a combined ASRS training and test set: Dataset Terms ASRS Documents Training 15,722 21,519 Training+Test 17,994 28,596 (7,077) Global and document frequency of required to be at least 2; stoplist of 493 common words used; char length of any term ∈ [2, 200]. Download Information: GTP: http://www.cs.utk.edu/∼lsi ASRS: http://www.cs.utk.edu/tmw07

16 / 40

slide-17
SLIDE 17

Initialization Schematic

1 n

H

Features Documents

H (Filtered)

1 n

Documents Features

R T

k 1 1 k

17 / 40

slide-18
SLIDE 18

Anomaly to Feature Mapping and Scoring Schematic

k k 1 1

Features Features

T T

1

Extract Anomalies Per Feature

R H H

Documents in R Anomaly 1

1 n 22

18 / 40

slide-19
SLIDE 19

Training/Testing Performance (ROC Curves)

Best/Worst ROC curves (False Positive Rate versus True Positive Rate)

ROC Area Anomaly Type (Description) Training Contest 22 Security Concern/Threat .9040 .8925 5 Incursion (collision hazard) .8977 .8716 4 Excursion (loss of control) .8296 .7159 21 Illness/Injury Event .8201 .8172 12 Traffic Proximity Event .7954 .7751 7 Altitude Deviation .7931 .8085 18 Aircraft Damage/Encounter .7250 .7261 11 Terrain Proximity Event .7234 .7575 9 Speed Deviation .7060 .6893 10 Uncommanded (loss of control) .6784 .6504 13 Weather Issue .6287 .6018 2 Noncompliance (policy/proc.) .6009 .5551

19 / 40

slide-20
SLIDE 20

Anomaly Summarization Prototype - Sentence Ranking

Sentence rank = f(global term weights) – B. Lamb

20 / 40

slide-21
SLIDE 21

Improving Summarization and Steering

What versus why:

Extraction of textual concepts still requires human interpretation (in the absence of ontologies or domain-specific classifications). How can previous knowledge or experience be captured for feature matching (or pruning)? To what extent can feature vectors be annotated for future use or as the text collection is updated? What is the cost for updating the NNMF (or similar) model?

21 / 40

slide-22
SLIDE 22

Unresolved Modeling Issues

Parameters and dimensionality:

Further work needed in determining effects of alternative term weighting schemes (for X) and choices of control parameters (e.g., α, β for CNMF). How does document (or object) clustering change with different ranks (or features)? How should feature vectors from competing models (Bayesian, neural nets, etc.) be compared in both interpretability and computational cost?

22 / 40

slide-23
SLIDE 23

Email Collection

By-product of the FERC investigation of Enron (originally contained 15 million email messages). This study used the improved corpus known as the Enron Email set, which was edited by Dr. William Cohen at CMU. This set had over 500,000 email messages. The majority were sent in the 1999 to 2001 timeframe.

23 / 40

slide-24
SLIDE 24

Enron Historical 1999-2001

Ongoing, problematic, development of the Dabhol Power Company (DPC) in the Indian state of Maharashtra. Deregulation of the Calif. energy industry, which led to rolling electricity blackouts in the summer of 2000 (and subsequent investigations). Revelation of Enron’s deceptive business and accounting practices that led to an abrupt collapse of the energy colossus in October, 2001; Enron filed for bankruptcy in December, 2001.

24 / 40

slide-25
SLIDE 25

Multidimensional Data Analysis via PARAFAC

+ + ... Third dimension offers more explanatory power: uncovers new latent information and reveals subtle relationships Build a 3-way array such that there is a term-author matrix for each month. PARAFAC Multilinear algebra

term-author matrix term-author-month array

Email graph + + ... Nonnegative PARAFAC

25 / 40

slide-26
SLIDE 26

Temporal Assessment via PARAFAC

David Ellen Bob Frank Alice Carl Ingrid Henk Gary

term-author-month array

+ + ... =

26 / 40

slide-27
SLIDE 27

Mathematical Notation

Kronecker product A ⊗ B =    A11B · · · A1nB . . . ... . . . Am1B · · · AmnB    Khatri-Rao product (columnwise Kronecker) A ⊙ B =

  • A1 ⊗ B1

· · · An ⊗ Bn

  • Outer product

A1 ◦ B1 =    A11B11 · · · A11Bm1 . . . ... . . . Am1B11 · · · Am1Bm1   

27 / 40

slide-28
SLIDE 28

PARAFAC Representations

PARAllel FACtors (Harshman, 1970) Also known as CANDECOMP (Carroll & Chang, 1970) Typically solved by Alternating Least Squares (ALS)

Alternative PARAFAC formulations

Xijk ≈

r

  • i=1

AirBjrCkr X ≈

r

  • i=1

Ai ◦ Bi ◦ Ci, where X is a 3-way array (tensor). Xk ≈ A diag(Ck:) BT, where Xk is a tensor slice. X I×JK ≈ A(C ⊙ B)T, where X is matricized.

28 / 40

slide-29
SLIDE 29

PARAFAC (Visual) Representations

+ + ... Scalar form Outer product form Tensor slice form Matrix form =

29 / 40

slide-30
SLIDE 30

Nonnegative PARAFAC Algorithm

Adapted from (Mørup, 2005) and based on NNMF by (Lee and Seung, 2001) ||X I×JK − A(C ⊙ B)T||F = ||X J×IK − B(C ⊙ A)T||F = ||X K×IJ − C(B ⊙ A)T||F Minimize over A, B, C using multiplicative update rule: Aiρ ← Aiρ (X I×JKZ)iρ (AZ TZ)iρ + ǫ, Z = (C ⊙ B) Bjρ ← Bjρ (X J×IKZ)jρ (BZ TZ)jρ + ǫ, Z = (C ⊙ A) Ckρ ← Ckρ (X K×IJZ)kρ (CZ TZ)kρ + ǫ, Z = (B ⊙ A)

30 / 40

slide-31
SLIDE 31

Tensor-Generated Group Discussions

NNTF Group Discussions in 2001 197 authors; 8 distinguishable discussions “Kaminski/Education” topic previously unseen

J F M A M J J A S O N D 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Month Normalized scale Group 3 Group 4 Group 10 Group 12 Group 17 Group 20 Group 21 Group 25

CA/Energy India Fastow Companies Downfall College Football Kaminski/ Education

Downfall newsfeeds

31 / 40

slide-32
SLIDE 32

Gantt Charts from PARAFAC Models

NNTF/PARAFAC PARAFAC

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Key

Interval # Bars [.01,0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 0.7) [0.7,1.0) (Gray = unclassified topic)

24 25

Downfall Newsfeeds California Energy College Football Fastow Companies

18 15 16 17 23 12 13 14 19 20 21 22

India

Education (Kaminski)

11 5 6 7 8 9 10

Downfall

India

1 2 3 4 Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

College Football India College Football

Key

# Bars [.01,0.1) [0.1, 0.3) Interval [-1.0,.01) [0.3, 0.5) [0.5, 0.7) [0.7,1.0) (Gray = unclassified topic)

24 25

Downfall

18 15 16 17 23 12 13 22

Education (Kaminski)

11 14 19 20 21 9 10 5 6 7 8

California

1 2 3 4

32 / 40

slide-33
SLIDE 33

Day-level Analysis for PARAFAC (Three Groups)

Rank-25 tensor for 357 out of 365 days of 2001: A (69, 157 × 25), B (197 × 25), C (357 × 25) Groups 3,4,5:

33 / 40

slide-34
SLIDE 34

Day-level Analysis for NN-PARAFAC (Three Groups)

Rank-25 tensor (best minimizer) for 357 out of 365 days

  • f 2001: A (69, 157 × 25), B (197 × 25), C (357 × 25)

Groups 1,7,8:

34 / 40

slide-35
SLIDE 35

Day-level Analysis for NN-PARAFAC (Two Groups)

Groups 20 (California Energy) and 9 (Football) (from C factor

  • f best minimizer) in day-level analysis of 2001:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0.2 0.4 0.6 0.8 1

Weekly pro & college football betting pool at Enron Discussions about energy projects in California

Month Conversation Level

= ... + +

Terms Authors Days Terms Authors Days Term-author-day array Conversation #1 Conversation #2

35 / 40

slide-36
SLIDE 36

Four-way Tensor Results (Sept. 2007)

Apply NN-PARAFAC to term-author-recipient-day array (39, 573 × 197 × 197 × 357); construct a rank-25 tensor (best minimizer among 10 runs). Goal: track more focused discussions between individuals/ small groups; for example, betting pool (football).

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0.2 0.4 0.6 0.8 1 Weekly pro & college football betting pool at Enron Conversation Level 3−way Results Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0.5 1 From the list of 197 recipients 3 individuals appeared one week Month Conversation Level 4−way Results 36 / 40

slide-37
SLIDE 37

Four-way Tensor Results (Sept. 2007)

Four-way tensor may track subconversation already found by three-way tensor; for example, RTO (Regional Transmission Organization) discussions.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0.2 0.4 0.6 0.8 Conversation about FERC and Regional Transmission Organizations (RTOs) Conversation Level 3−way Results Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0.5 1 Subconversation between J. Steffes and 3 other VPs on the same topic Month Conversation Level 4−way Results 37 / 40

slide-38
SLIDE 38

NNTF Optimal Rank?

No known algorithm for computing the rank of a k-way array for k ≥ 3 [Kruskal, 1989]. The maximum rank is not a closed set for a given random tensor. The maximum rank of a m × n × k tensor is unknown; one weak inequality is given by max{m, n, k} ≤ rank ≤ min{m × n, m × k, n × k} For our rank-25 NNTF, the size of the relative residual norm suggests we are still far from the maximum rank of the 3-way and 4-way arrays.

38 / 40

slide-39
SLIDE 39

Further Reading

◮ M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons.

  • Alg. and Applic. for Approx. Nonnegative Matrix Factorization.
  • Comput. Stat. & Data Anal. 52(1):155-173, 2007.

◮ F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons.

Document Clustering Using Nonnegative Matrix Factorization.

  • Info. Proc. & Management 42(2):373-386, 2006.

◮ M.W. Berry and M. Browne.

Email Surveillance Using Nonnegative Matrix Factorization.

  • Comp. & Math. Org. Theory 11:249-264, 2005.

◮ P. Hoyer.

Nonnegative Matrix Factorization with Sparseness Constraints.

  • J. Machine Learning Research 5:1457-1469, 2004.

39 / 40

slide-40
SLIDE 40

Further Reading (contd.)

◮ J.T. Giles and L. Wo and M.W. Berry.

GTP (General Text Parser) Software for Text Mining. Software for Text Mining, in Statistical Data Mining and Knowledge Discovery. CRC Press, Boca Raton, FL, 2003, pp. 455-471.

◮ W. Xu, X. Liu, and Y. Gong.

Document-Clustering based on Nonneg. Matrix Factorization. Proceedings of SIGIR’03, Toronto, CA, 2003, pp. 267-273.

◮ J.B. Kruskal.

Rank, Decomposition, and Uniqueness for 3-way and n-way Arrays. In Multiway Data Analysis, Eds. R. Coppi and S. Bolaso, Elsevier 1989, pp. 7-18.

40 / 40