Spectral Methods for Analyzing Large Data using Reweighted Topic - PowerPoint PPT Presentation

Spectral Methods for Analyzing Large Data using Reweighted Topic Modeling Blake Hunter ⋆ Jason Bello, Brian de Silva, Arjuna Flenner † , Jerry Luo Daniel Bernstein ‡ , Yang Hu ‡ , Anna Ma ‡ , Paul Sharkey ‡ ⋆ UCLA Applied Math UCLA Applied Math REU 2013 † China Lake Naval Research Lab and CGU ‡ IPAM RIPS 2013 February 5, 2014 Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 1 / 35

Data Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 2 / 35

Data Mining Data Mining Extracting knowledge from a dataset. Goal to transform it into understandable/usable structure for future use. Search and Summarization ◮ Topic Modeling ◮ LDA and NNMF ◮ Ranking and PCA ◮ multiple modalities and data fusion Clustering and Classification ◮ Spectral Clustering ◮ Diffusion Maps · · · Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 3 / 35

Applications Applications Imaging ◮ Navel Research ◮ Hyperspectral ◮ Medical - MRI, fMRI, PET, EEG Text Mining ◮ Medical Reports, Exams, Analysis ◮ Large Documents ◮ Classified Documents ◮ Twitter ◮ Emerging Topics Networks and Social Networks ◮ Community Detection ◮ Twitter ◮ Gang Networks · · · Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 4 / 35

Sidewinder Documents Original Sidewinder Document Converted from Image to Text Thousands of Sidewinder Documents from the Navy. Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 5 / 35

Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search ◮ Searching with entire document for documents with similar document ◮ More useful than keyword search for an unsupervised problem Limitations of current search Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 6 / 35

Graphs Data points x i are represented by nodes in an undirected graph. Similarity is encoded in edge weights w ij . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 7 / 35

Test Documents 40 test documents Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 8 / 35

Using ’keyword search’ on a document Table: Search Results for Cincinatti Reds Recap 7/5 Document Description Date Cincinatti Reds Recap 7/5 Cincinatti Reds Recap 6/23 Cincinatti Reds Recap 6/23 Toronto Blue Jays Recap 6/31 Toronto Blue Jays Recap 7/3 Minnesota Twins Recap 7/25 Cincinatti Reds Recap 8/13 Minnesota Twins Recap 7/30 Toronto Blue Jays Recap 7/23 Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 9 / 35

Converting a Corpus of Documents into a Matrix Bag-of-Words ◮ Removes most common words, e.g. “the”, “and”, “because” ◮ Produces histogram vector for each document where each entry is the count of a specific word Term Frequency - Inverse Document Frequency (TF-IDF) (more popular) ◮ Diminishes the weight of words that occur frequently throughout the corpus and adds weight to those that occur rarely Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 10 / 35

Histogram Matrix Documents x 11 . . . x 1 n     Words   X =   . . ...   . . . .       x m 1 . . . x mn Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 11 / 35

Topic Modeling Topic modeling attempts to uncover the hidden thematic structure in sets of documents, images and other data. Doc i = h i 1 × Word 1 + h i 2 × Word 2 + . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

Topic Modeling Topic modeling attempts to uncover the hidden thematic structure in sets of documents, images and other data. Doc i = h i 1 × Word 1 + h i 2 × Word 2 + . . . Doc i = v i 1 × Topic 1 + v i 2 × Topic 2 + . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

Topic Modeling Topic modeling attempts to uncover the hidden thematic structure in sets of documents, images and other data. Doc i = h i 1 × Word 1 + h i 2 × Word 2 + . . . Doc i = v i 1 × Topic 1 + v i 2 × Topic 2 + . . . Topic i = u i 1 × Word 1 + u i 2 × Word 2 + . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

Topic Modeling Methods Latent Dirichlet Allocation (LDA) 1 (computationally expensive) Nonnegative Matrix Factorization (NMF) 2 ◮ OMP - Orthogonal matching pursuit (Lozano, Swirszez, and Abe) ◮ LSAS - Alternating least squares using active sets (Kim and Park) ◮ AM - Alternating Multiplicative update (Lee and Seung) ◮ ℓ 1 - convex model for NMF (Esser, Moller, Osher and Sapiro) 1 David Blei, Andrew Ng, and Michael Jordan. ”Latent dirichlet allocation.” the Journal of machine Learning research 3 (2003): 993-1022. 2 D. Seung and L. Lee. ”Algorithms for non-negative matrix factorization.” Advances in neural information processing systems 13 (2001): 556-562. Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 13 / 35

Poisson Factor Analysis a a M. Zhou and L. Carin ”Beta-Negative Binomial Process and Poisson Factor Analysis” 2012. assume that each histogram bin X dw satisfies � K � � X dw Pois λ dk ψ kw , k =1 � ψ kw = 1 and λ dk ≥ 0 , ψ kw ≥ 0 . w X Pois (ΛΨ) , � ψ k � 1 = and λ dk ≥ 0 , ψ kw ≥ 0 , where H Pois(ΛΨ) is interpreted component wise. Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 14 / 35

Nonnegative Matrix Factorization U , V � X − UV T � min where U = [ U ] + , V = [ V ] + . Documents Topics     Documents Topics       Words Words         T ≈ V       X U             Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 15 / 35

Similarity Measures Euclidean Based Similarity ◮ Let u , v ∈ R n and let u ( i ) and v ( j ) denote different histogram vectors in the corpus � u − v � 1 − � u ( i ) − v ( j ) � max j Cosine Similarity ◮ Let u , v ∈ R n u · v cos θ = � u �� v � Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 16 / 35

Example Topics Topic 1 Topic 2 Topic 5 Topic 6 Topic 3 Topic 4 blue reds heart truth twins invasion jays hit blood nature runs allied toronto second vein reason innings june hit season veins god game german second cincinnati artery will minnesota troops time pirates inning normandy arteries objects runs innings motion thought three british good three cavity men third landing three games small place start france single game body opinions hit beaches Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 17 / 35

Documents as Linear Combinations of Topics 40 test documents Dark squares indicate a strong presence of a given topic in a document Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 18 / 35

Content-Based Search Results Histogram Similarity Topic Similarity Document Similarity Document Similarity cinc3.txt 1.0000 cinc3.txt 1.0000 cinc2.txt 0.7772 cinc9.txt 0.9991 cinc4.txt 0.7729 cinc1.txt 0.9981 cinc9.txt 0.7569 cinc10.txt 0.9980 minn9.txt 0.7470 cinc4.txt 0.9972 toronto7.txt 0.7468 cinc2.txt 0.9970 cinc7.txt 0.7428 cinc5.txt 0.9959 minn7.txt 0.7419 cinc7.txt 0.9940 dotm14.txt 0.7406 cinc8.txt 0.9929 cinc6.txt 0.7367 cinc6.txt 0.9905 WW2 8.txt 0.7361 WW2 8.txt 0.9001 minn2.txt 0.7358 dotm18.txt 0.8996 toronto5.txt 0.7300 toronto1.txt 0.8993 Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 19 / 35

Disadvantage of Topic Search Searching based on similiarity of each document’s topic vectors will do well to find documents of similar topic compositions, but does not take into account the difference or similarity between distinct topics. Document Similarity Index to ’dotm5.txt’ ’dotm5.txt’ 1 ’dotm2.txt’ 0.9997 ... ... ’dotm12.txt’ 0.9664 ’WW2 8.txt’ 0.8924 ... ... ’toronto4.txt’ 0.7027 ’dotm11.txt’ 0.7009 ... ... ’dotm14.txt’ 0.0212 Table: Topic Search Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 20 / 35

Reweighted Topic Modeling Topic Affinity Weighting Define A , the affinity topic matrix of U by || Ui − Uj || 2 A ij = e − σ AV T is a reweighting of V T using similarity between topics Gram Matrix Weighting Define G , the Gram topic matrix of U by G ij = � U i , U j � For dot product, G = U T U Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 21 / 35 GV T is a reweighting of V T using orthogonality between topics

Spectral Methods for Analyzing Large Data using Reweighted Topic - PowerPoint PPT Presentation

Spectral Methods for Analyzing Large Data using Reweighted Topic Modeling Blake Hunter Jason Bello, Brian de Silva, Arjuna Flenner , Jerry Luo Daniel Bernstein , Yang Hu , Anna Ma , Paul Sharkey UCLA Applied Math UCLA

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Spectral and High-Order Methods Spectral and High-Order Methods for Shock-Induced Mixing for

Spectral Graph Theory and its Applications Lillian Dai 6.454 Oct. 20, 2004 1 Outline Basic

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Lesson 9 Introduction Signal Spectral Analysis: Estimation of the power spectral density

US Census data: an overview Kyle Walker Instructor DataCamp Analyzing US Census Data in R

Spectral methods to compute a solution to some H interpolation problems A. E. Frazho The talk

Using Lagged Spectral Data in Feedback Control Using Particle Swarm Optimisation Mr. Caleb

10Hz Spectral Lines Joschua Dilly 10Hz Spectral Lines 2 Introduction Ions 50cm Protons 30cm

AIRS In-flight Spectral Calibration Steve Gaiser 1 Steve Gaiser, AIRS in-orbit spectral

Poster #190 1 Spectral Clustering of Signed Graphs Poster #190 Our Goal: Extend Spectral

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

Elections and Political Parties G. Elliott Morris Data Journalist DataCamp Analyzing Election

CMSC 434 Psychology and Psychopathology of Everyday Things Psychology of Everyday Things Many

Training Resources via Remote Split Operations Lt Col Juan Torres Det 520/CC May 18 This

An Introduction to Orbital ATK, Inc. Company Overview Presentation to the Aerospace and Defense

Using Guided Missiles in Drive-bys Automatic browser fingerprinting and exploitation with the

Guide to the Industrial Sessions 15 th International Conference on Reliable

Geoapplications development http://rgeo.wikience.org Higher School of Economics, Moscow,

Formal Verification of Roundoff Error Bounds using Semidefinite Programming Victor Magron , CNRS

Introduction to Games and their Representation Felix Munoz-Garcia Washington State University