Discovery of Latent Factors in High-dimensional Data via Spectral - PowerPoint PPT Presentation

Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang University of Maryland Workshop on Quantum Machine Learning 1 / 39

Machine Learning - Excitements Success of Supervised Learning Image classification Speech recognition Text processing 2 / 39

Machine Learning - Excitements Success of Supervised Learning Image classification Speech recognition Text processing Key to Success Deep composition of nonlinear units Enormous labeled data Computation power growth 2 / 39

Machine Learning - Modern Challenges Automated discovery of features and categories? Filter bank learning Feature extraction Embeddings, Topics 2 / 39

Machine Learning - Modern Challenges Automated discovery of features and categories? Real AI requires Unsupervised Learning Filter bank learning Feature extraction Embeddings, Topics Summarize key features in data ◮ State-of-the-art: Humans are better than machines ◮ Goal: Intelligent machines that summarize key features in data Interpretable modeling and learning of the data ◮ Theoretically guaranteed learning ◮ Extracted features are interpretable 2 / 39

Unsupervised Learning with Big Data Curse of Dimensionality More information → more unknowns/variables → challenging model learning 3 / 39

Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation T opics Cell T ypes Communities 3 / 39

Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation T opics Cell T ypes Communities Finding Needle In the Haystack Is Challenging 3 / 39

Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation T opics Cell T ypes Communities My Solution: A Unified Tensor Decomposition Framework 3 / 39

App 1: Automated Categorization of Documents Topics Education Crime Sports Document modeling Observed: words in document corpus: search logs, emails etc Hidden: (mixed) topics: personal interests, professional area etc 4 / 39

App 2: Community Extraction From Connectivity Graph Social Networks Observed: network of social ties: friendships, transactions etc Hidden: (mixed) groups/communities of social actors 5 / 39

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 10 4 4 2 10 3 0 6 / 39

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 10 4 4 2 10 3 0 Learning Communities from Graph Connectivity Facebook: n ∼ 20 k Yelp: n ∼ 40 k DBLPsub: n ∼ 0 . 1 m DBLP: n ∼ 1 m 10 6 Running Times (s) 10 1 Error /group 10 5 10 0 10 4 10 -1 10 3 10 -2 10 2 FB YP DBLPsub DBLP FB YP DBLPsub DBLP 6 / 39

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 e 10 4 t a 4 r u c c A 2 e r o 10 3 0 M & r Learning Communities from Graph Connectivity e t s a F e Facebook: n ∼ 20 k Yelp: n ∼ 40 k d DBLPsub: n ∼ 0 . 1 m DBLP: n ∼ 1 m u t i n 10 6 g a Running Times (s) 10 1 M Error /group 10 5 f o s 10 0 r e d 10 4 r O 10 -1 10 3 10 -2 10 2 FB YP DBLPsub DBLP FB YP DBLPsub DBLP “Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, F. Huang, A. Anandkumar, Oct. 2015. 6 / 39

App 3: Cataloging Neuronal Cell Types In the Brain Neuroscience Observed: cellular-resolution brain slices Hidden: neuronal cell types 7 / 39

App 3: Cataloging Neuronal Cell Types In the Brain Our method vs Average expression level [Grange 14’] Spatial point process (ours) k Average expression level ( previous ) 2.5 2.0 1.5 1.0 0.5 Recovered known cell types 1 Interneurons 5 Microglia 2 S1Pyramidal 6 Endothelial 3 Astrocytes 7 Mural 4 Ependymal 8 Oligodendrocytes “Discovering Neuronal Cell Types and Their Gene Expression Profiles Using a Spatial Point Process Mixture Model”, F. Huang, A. Anandkumar, C. Borgs, J. Chayes, E. Fraenkel, M. Hawrylycz, E. Lein, A. Ingrosso, S. Turaga, NIPS 2015 BigNeuro workshop. 8 / 39

App 4: Word Sequence Embedding Extraction The weather is good. tree Her life spanned years of soccer incredible change for women. Mary lived through an era of football liberating reform for women. Word Embedding Word Sequence Embedding “Convolutional Dictionary Learning through Tensor Factorization”, by F. Huang, A. Anandkumar, In Proceedings of JMLR 2015. 9 / 39

App 5: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Observed: co-occurrence of diseases on patients Hidden: disease similarity/hierarchy ” Scalable Latent TreeModel and its Application to Health Analytics ” by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 10 / 39

Involve discovering the hidden and compact structure that is embedded in the high-dimensional complex observed data 11 / 39

How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. h 1 Advanced: Probabilistic models Hidden variable h has more general distributions. h 2 h 3 Can model mixed memberships. x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models (topic model) 12 / 39

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k3 k4 k5 k1 k2 Topics A A A A A Words life gene data DNA RNA Learning Algorithm Unlabeled data Latent variable model Inference 13 / 39

Challenges in Learning – find hidden structure in data h Choice Variable k2 k3 k4 k5 Topics k1 A A A A A Words life gene data DNA RNA Learning Algorithm Unlabeled data Latent variable model Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 13 / 39

Challenges in Learning – find hidden structure in data h Choice Variable k2 k3 k4 k5 Topics k1 A A A A A Words life gene data DNA RNA MCMC Unlabeled data La tent Variable model Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time 13 / 39

Challenges in Learning – find hidden structure in data h Choice Variable k3 k4 k5 k1 k2 Topics A A A A A Words data DNA RNA life gene � ✁ el Unlabeled data Latent variable model Inference ▲ � ✐✂ ✂ ✄ ☎ ✆ t ✐✂ ✄✝ Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points 13 / 39

Challenges in Learning – find hidden structure in data h Choice Variable k3 k4 k5 k1 k2 Topics A A A A A Words data DNA RNA life gene ✟ ✠ el Unlabeled data Latent variable model Inference ✞ ✟ ✡☛ ☛ ☞ ✌ ✍ ✎ ✡☛ ☞✏ Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points Efficient computational and sample complexities? 13 / 39

Challenges in Learning – find hidden structure in data h Choice Variable = + + k2 k3 k4 k5 Topics k1 A A A A A Words life gene data DNA RNA ensor Decomposition Unlabeled data Latent variable model Inference ❚ Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 13 / 39

Discovery of Latent Factors in High-dimensional Data via Spectral - PowerPoint PPT Presentation

Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang University of Maryland Workshop on Quantum Machine Learning 1 / 39 Machine Learning - Excitements Success of Supervised Learning Image classification

Discovery of Latent Factors in High-dimensional Data Using Tensor Methods Furong Huang

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Graph Theoretic Latent Class Discovery and Its Robustness to Minimal Dominating Set Choice

Efficient Model Evaluation in the Search-Based Approach to Latent Structure Discovery Tao Chen,

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses Guan-Hua Huang,

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Electromagnetic Form Factors of Electromagnetic Form Factors of Electromagnetic Form Factors of

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

In [144]: # HIDDEN import matplotlib matplotlib.use('Agg') from datascience import * % matplotlib

The Mobile Web Framework UC Panelists: Rose Rocchio UCLA Tom Tsai UC Berkeley

1

Elementary School Planning Elementary School Planning Welcome APS Staff Introductions

Cheap Thrills: the Price of Leisure and the Decline of Work Hours Alexandr Kopytov Nikolai

Modelling and Analysis of Parallel/ Distributed Time-dependent Systems: An Approach based on JADE

StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet

Foreign Bodies Complications: Fun Ways to Get Them Off or Out Incr d Risk if > 24 hours

Discovery of Latent Factors in High-dimensional Data via Spectral - PowerPoint PPT Presentation

Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang University of Maryland Workshop on Quantum Machine Learning 1 / 39 Machine Learning - Excitements Success of Supervised Learning Image classification

Discovery of Latent Factors in High-dimensional Data Using Tensor Methods Furong Huang

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Graph Theoretic Latent Class Discovery and Its Robustness to Minimal Dominating Set Choice

Efficient Model Evaluation in the Search-Based Approach to Latent Structure Discovery Tao Chen,

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses Guan-Hua Huang,

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Electromagnetic Form Factors of Electromagnetic Form Factors of Electromagnetic Form Factors of

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

In [144]: # HIDDEN import matplotlib matplotlib.use('Agg') from datascience import * % matplotlib

The Mobile Web Framework UC Panelists: Rose Rocchio UCLA Tom Tsai UC Berkeley

1

Elementary School Planning Elementary School Planning Welcome APS Staff Introductions

Cheap Thrills: the Price of Leisure and the Decline of Work Hours Alexandr Kopytov Nikolai

Modelling and Analysis of Parallel/ Distributed Time-dependent Systems: An Approach based on JADE

StarPU : Exploiting heterogeneous architectures through task-based programming Cdric Augonnet

Foreign Bodies Complications: Fun Ways to Get Them Off or Out Incr d Risk if &gt; 24 hours

Foreign Bodies Complications: Fun Ways to Get Them Off or Out Incr d Risk if > 24 hours