Spectral Methods for Analyzing Large Data using Reweighted Topic - - PowerPoint PPT Presentation

spectral methods for analyzing large data using
SMART_READER_LITE
LIVE PREVIEW

Spectral Methods for Analyzing Large Data using Reweighted Topic - - PowerPoint PPT Presentation

Spectral Methods for Analyzing Large Data using Reweighted Topic Modeling Blake Hunter Jason Bello, Brian de Silva, Arjuna Flenner , Jerry Luo Daniel Bernstein , Yang Hu , Anna Ma , Paul Sharkey UCLA Applied Math UCLA


slide-1
SLIDE 1

Spectral Methods for Analyzing Large Data using Reweighted Topic Modeling

Blake Hunter⋆ Jason Bello, Brian de Silva, Arjuna Flenner†, Jerry Luo Daniel Bernstein‡, Yang Hu‡, Anna Ma‡, Paul Sharkey‡

⋆UCLA Applied Math

UCLA Applied Math REU 2013

†China Lake Naval Research Lab and CGU ‡IPAM RIPS 2013

February 5, 2014

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 1 / 35

slide-2
SLIDE 2

Data

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 2 / 35

slide-3
SLIDE 3

Data Mining

Data Mining

Extracting knowledge from a dataset. Goal to transform it into understandable/usable structure for future use. Search and Summarization

◮ Topic Modeling ◮ LDA and NNMF ◮ Ranking and PCA ◮ multiple modalities and data fusion

Clustering and Classification

◮ Spectral Clustering ◮ Diffusion Maps

· · ·

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 3 / 35

slide-4
SLIDE 4

Applications

Applications

Imaging

◮ Navel Research ◮ Hyperspectral ◮ Medical - MRI, fMRI, PET, EEG

Text Mining

◮ Medical Reports, Exams, Analysis ◮ Large Documents ◮ Classified Documents ◮ Twitter ◮ Emerging Topics

Networks and Social Networks

◮ Community Detection ◮ Twitter ◮ Gang Networks

· · ·

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 4 / 35

slide-5
SLIDE 5

Sidewinder Documents

Original Sidewinder Document Converted from Image to Text Thousands of Sidewinder Documents from the Navy.

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 5 / 35

slide-6
SLIDE 6

Classification of Sidewinder Documents

Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search

◮ Searching with entire document for documents with similar document ◮ More useful than keyword search for an unsupervised problem

Limitations of current search

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 6 / 35

slide-7
SLIDE 7

Graphs

Data points xi are represented by nodes in an undirected graph. Similarity is encoded in edge weights wij.

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 7 / 35

slide-8
SLIDE 8

Test Documents

40 test documents

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 8 / 35

slide-9
SLIDE 9

Using ’keyword search’ on a document

Table: Search Results for Cincinatti Reds Recap 7/5

Document Description Date Cincinatti Reds Recap 7/5 Cincinatti Reds Recap 6/23 Cincinatti Reds Recap 6/23 Toronto Blue Jays Recap 6/31 Toronto Blue Jays Recap 7/3 Minnesota Twins Recap 7/25 Cincinatti Reds Recap 8/13 Minnesota Twins Recap 7/30 Toronto Blue Jays Recap 7/23

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 9 / 35

slide-10
SLIDE 10

Converting a Corpus of Documents into a Matrix

Bag-of-Words

◮ Removes most common words, e.g. “the”, “and”, “because” ◮ Produces histogram vector for each document where each entry is the

count of a specific word

Term Frequency - Inverse Document Frequency (TF-IDF) (more popular)

◮ Diminishes the weight of words that occur frequently throughout the

corpus and adds weight to those that occur rarely

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 10 / 35

slide-11
SLIDE 11

Histogram Matrix

X = Documents                 x11 . . . x1n Words . . . ... . . . xm1 . . . xmn

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 11 / 35

slide-12
SLIDE 12

Topic Modeling

Topic modeling attempts to uncover the hidden thematic structure in sets

  • f documents, images and other data.

Doci = hi1 × Word1 + hi2 × Word2 + . . .

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

slide-13
SLIDE 13

Topic Modeling

Topic modeling attempts to uncover the hidden thematic structure in sets

  • f documents, images and other data.

Doci = hi1 × Word1 + hi2 × Word2 + . . . Doci = vi1 × Topic1 + vi2 × Topic2 + . . .

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

slide-14
SLIDE 14

Topic Modeling

Topic modeling attempts to uncover the hidden thematic structure in sets

  • f documents, images and other data.

Doci = hi1 × Word1 + hi2 × Word2 + . . . Doci = vi1 × Topic1 + vi2 × Topic2 + . . . Topici = ui1 × Word1 + ui2 × Word2 + . . .

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

slide-15
SLIDE 15

Topic Modeling Methods

Latent Dirichlet Allocation (LDA) 1 (computationally expensive) Nonnegative Matrix Factorization (NMF) 2

◮ OMP - Orthogonal matching pursuit (Lozano, Swirszez, and Abe) ◮ LSAS - Alternating least squares using active sets (Kim and Park) ◮ AM - Alternating Multiplicative update (Lee and Seung) ◮ ℓ1 - convex model for NMF (Esser, Moller, Osher and Sapiro) 1David Blei, Andrew Ng, and Michael Jordan. ”Latent dirichlet allocation.”

the Journal of machine Learning research 3 (2003): 993-1022.

  • 2D. Seung and L. Lee. ”Algorithms for non-negative matrix factorization.”

Advances in neural information processing systems 13 (2001): 556-562.

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 13 / 35

slide-16
SLIDE 16

Poisson Factor Analysisa

  • aM. Zhou and L. Carin ”Beta-Negative Binomial Process and Poisson Factor

Analysis” 2012.

assume that each histogram bin Xdw satisfies Xdw Pois K

  • k=1

λdkψkw

  • ,
  • w

ψkw = 1 and λdk ≥ 0, ψkw ≥ 0. X Pois (ΛΨ) , ψk1 = and λdk ≥ 0, ψkw ≥ 0, where H Pois(ΛΨ) is interpreted component wise.

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 14 / 35

slide-17
SLIDE 17

Nonnegative Matrix Factorization

min

U,V X −UV T

where U = [U]+, V = [V ]+. Documents                 Words

X

≈ Topics                 Words

U

Documents     Topics

V

T Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 15 / 35

slide-18
SLIDE 18

Nonnegative Matrix Factorization

min

U,V X −UV T

where U = [U]+, V = [V ]+. Documents                 Words

X

≈ Topics                 Words

U

Documents     Topics

V

T Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 15 / 35

slide-19
SLIDE 19

Similarity Measures

Euclidean Based Similarity

◮ Let u, v ∈ Rn and let u(i) and v (j) denote different histogram vectors in

the corpus 1 − u − v max

j

u(i) − v (j)

Cosine Similarity

◮ Let u, v ∈ Rn

cos θ = u · v uv

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 16 / 35

slide-20
SLIDE 20

Example Topics

Topic 1

blue jays toronto hit second time runs good three single

Topic 2

reds hit second season cincinnati pirates innings three games game

Topic 3

twins runs innings game minnesota inning three third start hit

Topic 4

invasion allied june german troops normandy british landing france beaches

Topic 5

heart blood vein veins artery arteries motion cavity small body

Topic 6

truth nature reason god will

  • bjects

thought men place

  • pinions

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 17 / 35

slide-21
SLIDE 21

Documents as Linear Combinations of Topics

40 test documents Dark squares indicate a strong presence of a given topic in a document

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 18 / 35

slide-22
SLIDE 22

Content-Based Search Results

Histogram Similarity

Document Similarity cinc3.txt 1.0000 cinc2.txt 0.7772 cinc4.txt 0.7729 cinc9.txt 0.7569 minn9.txt 0.7470 toronto7.txt 0.7468 cinc7.txt 0.7428 minn7.txt 0.7419 dotm14.txt 0.7406 cinc6.txt 0.7367 WW2 8.txt 0.7361 minn2.txt 0.7358 toronto5.txt 0.7300

Topic Similarity

Document Similarity cinc3.txt 1.0000 cinc9.txt 0.9991 cinc1.txt 0.9981 cinc10.txt 0.9980 cinc4.txt 0.9972 cinc2.txt 0.9970 cinc5.txt 0.9959 cinc7.txt 0.9940 cinc8.txt 0.9929 cinc6.txt 0.9905 WW2 8.txt 0.9001 dotm18.txt 0.8996 toronto1.txt 0.8993

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 19 / 35

slide-23
SLIDE 23

Disadvantage of Topic Search

Searching based on similiarity of each document’s topic vectors will do well to find documents of similar topic compositions, but does not take into account the difference or similarity between distinct topics. Document Similarity Index to ’dotm5.txt’ ’dotm5.txt’ 1 ’dotm2.txt’ 0.9997 ... ... ’dotm12.txt’ 0.9664 ’WW2 8.txt’ 0.8924 ... ... ’toronto4.txt’ 0.7027 ’dotm11.txt’ 0.7009 ... ... ’dotm14.txt’ 0.0212

Table: Topic Search

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 20 / 35

slide-24
SLIDE 24

Reweighted Topic Modeling

Topic Affinity Weighting

Define A, the affinity topic matrix of U by Aij = e−

||Ui −Uj ||2 σ

AV T is a reweighting of V T using similarity between topics

Gram Matrix Weighting

Define G, the Gram topic matrix of U by Gij = Ui, Uj For dot product, G = UTU GV T is a reweighting of V T using orthogonality between topics

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 21 / 35

slide-25
SLIDE 25

Search Results Using 8 Topics

Search Results

Topic Vector Affinity Reweighting Gram Reweighting cinc3.txt cinc3.txt cinc3.txt cinc9.txt cinc9.txt cinc9.txt cinc1.txt cinc5.txt cinc5.txt cinc10.txt cinc10.txt cinc10.txt cinc4.txt cinc1.txt cinc1.txt cinc2.txt cinc7.txt cinc7.txt cinc5.txt cinc8.txt cinc8.txt cinc7.txt cinc2.txt cinc2.txt cinc8.txt cinc6.txt cinc6.txt cinc6.txt cinc4.txt cinc4.txt WW2 8.txt minn6.txt minn6.txt WW2 6.txt minn1.txt minn1.txt toronto1.txt minn7.txt minn7.txt

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 22 / 35

slide-26
SLIDE 26

Topic Matrix Vs Number of Topics

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 23 / 35

slide-27
SLIDE 27

Gram Matrix Reweighting: Purity Vs Number of Topics

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 24 / 35

slide-28
SLIDE 28

Sidewinder Modified Topic Vectors

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 25 / 35

slide-29
SLIDE 29

Search Results on Sidewinder Documents

Search Results Gram Matrix Modification (using doc13.txt)

Document Similarity Description doc13.txt 1.0000 Design and development of a Fuze Triggering Device test machine doc22.txt 0.5251 Sidewinder Fuze Triggering Device evaluation doc08.txt 0.4283 Developmental Program Plan For the Fuze Triggering Device for AIM-9L Missile System doc23.txt 0.3073 Wing assembly, Studies of the sidewinder 1c aeromechanics, structures and loads doc21.txt 0.3067 Results of evaluation of contact-delay self-destruct modules for use in the sidewinder missile doc05.txt 0.3045 Military Specification Test Set, AIM-9H/L Missile Guidance Control Section doc03.txt 0.3012 Document Control Plan for AIM-9H/AIM-9L Missile Production doc14.txt 0.2968 Test report of diagnostic and safety testing of the WDU/9B Sidewinder Exercise Warhead doc04.txt 0.2965 Development and Evaluation of MK 16 Mod 0 Guided Missile Cradle for Sidewinder 1C Missiles doc19.txt 0.2935 (Letter) Version numbers for rockers and guided missiles; assignment of Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 26 / 35

slide-30
SLIDE 30

Semi-Supervised Topics with/without Initialization

Topics Before Initializing U

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 hit blue jays heart truth invasion innings jays blood nature allied runs toronto vein reason june twins hit veins god german game time artery will troops second second arteries

  • bjects

normandy three good motion thought british

Topics After Initializing U

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 blue twins heart reds invasion jays runs blood hit allied toronto innings vein season june hit game artery second german second inning veins cincinnati troops time minnesota arteries pirates normandy good three nature innings british

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 27 / 35

slide-31
SLIDE 31

Linking Tweets to Crime and Disorder

IPAM RIPS 2013 LAPD Project

Lots of Tweets Keyword search has limitations

◮ May return thousands of results ◮ Requires that you know what to look for

Topic modeling addresses these issues

Tweets

“Jst bus a .9 n drawn the cops to scene after I walked away” “I will NEVER get used to is the sound of gun shots.” Some Tweets manually identiffed as crime related had no respective 911 call

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 28 / 35

slide-32
SLIDE 32

Tweet Keywords

Split a collection of documents into topics based on content Requires no user input Documents within a topic are ranked by how well they fit Topics can be summarized by most frequently occuring words

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 29 / 35

slide-33
SLIDE 33

New Challenges

Use the Jaro-Winkler string similarity metric to identify misspellings and pluralization

◮ “moose” “mose” “mooses”

Term Frequency Inverse Document Frequency (TF-IDF)

◮ X(t, w) = log (1 + tc(t, w)) log

total num tweets

num tweets with w

  • Time, Space and Content

◮ K = αAs + βAt + χAc + δAd Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 30 / 35

slide-34
SLIDE 34

Searching Shots Fired Tweets

Focus on cluster containing incidence of crime or 911 call ex. “shots fired”

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 31 / 35

slide-35
SLIDE 35

Searching Shots Fired Tweets

1302 “relevant” Tweets reduced to 35 Top Tweets returned, “i heard a bunch of gunshots now a bunch of cops” “yourheinous they did supposedly they shot em in the head i was hearing the gunshots”

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 32 / 35

slide-36
SLIDE 36

Image Search

Hyperspectral Pixel Search Image Search

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 33 / 35

slide-37
SLIDE 37

Bag of Features

1 Images 2 Features - pixel neighborhoods, wavelets, curvelets, SIFT, HOG, . . . 3 Cluster Images Features k-means, spectral clustering, . . . 4 Histogram - bag of words 5 Topic Modeling 6 Graph 7 Mine - cluster images, search, . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 34 / 35

slide-38
SLIDE 38

Image Search

Graph of 300,000 images with reweighted topic distance. DARPA

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 35 / 35

slide-39
SLIDE 39

E-mail: blakehunter@math.UCLA.edu Web: http://www.math.ucla.edu/~blakehunter References:

  • D. Bernstein, Y. Hu, A. Ma, P. Sharkey, B. Hunter, “Linking Social Media and

Disorder, Emerging Topic Detection in Microblogs.” IPAM RIPS Report, IPAM, 2013.

  • J. Bello, J. Luo, B. de Silva, B. Hunter, “Content Based Document Search.” REU

Report, University of California, Los Angeles, 2013.

  • J. Bello, J. Luo, B. de Silva, A. Flenner (advisor), B. Hunter (advisor),“Text

mining via content weighted topic modeling.”SIAM Undergraduate Research Online (SIURO), to appear 2013.

  • A. Flenner, B. Hunter,“Reweighted topic modeling.” in preparation 2013.

Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 35 / 35