discovery of latent factors in high dimensional data via
play

Discovery of Latent Factors in High-dimensional Data via Spectral - PowerPoint PPT Presentation

Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang University of Maryland Workshop on Quantum Machine Learning 1 / 39 Machine Learning - Excitements Success of Supervised Learning Image classification


  1. Discovery of Latent Factors in High-dimensional Data via Spectral Methods Furong Huang University of Maryland Workshop on Quantum Machine Learning 1 / 39

  2. Machine Learning - Excitements Success of Supervised Learning Image classification Speech recognition Text processing 2 / 39

  3. Machine Learning - Excitements Success of Supervised Learning Image classification Speech recognition Text processing Key to Success Deep composition of nonlinear units Enormous labeled data Computation power growth 2 / 39

  4. Machine Learning - Modern Challenges Automated discovery of features and categories? Filter bank learning Feature extraction Embeddings, Topics 2 / 39

  5. Machine Learning - Modern Challenges Automated discovery of features and categories? Real AI requires Unsupervised Learning Filter bank learning Feature extraction Embeddings, Topics Summarize key features in data ◮ State-of-the-art: Humans are better than machines ◮ Goal: Intelligent machines that summarize key features in data Interpretable modeling and learning of the data ◮ Theoretically guaranteed learning ◮ Extracted features are interpretable 2 / 39

  6. Unsupervised Learning with Big Data Curse of Dimensionality More information → more unknowns/variables → challenging model learning 3 / 39

  7. Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation T opics Cell T ypes Communities 3 / 39

  8. Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation T opics Cell T ypes Communities Finding Needle In the Haystack Is Challenging 3 / 39

  9. Unsupervised Learning with Big Data Information Extraction High dimension observation vs Low dimension representation T opics Cell T ypes Communities My Solution: A Unified Tensor Decomposition Framework 3 / 39

  10. App 1: Automated Categorization of Documents Topics Education Crime Sports Document modeling Observed: words in document corpus: search logs, emails etc Hidden: (mixed) topics: personal interests, professional area etc 4 / 39

  11. App 1: Automated Categorization of Documents Topics Education Crime Sports Document modeling Observed: words in document corpus: search logs, emails etc Hidden: (mixed) topics: personal interests, professional area etc 4 / 39

  12. App 2: Community Extraction From Connectivity Graph Social Networks Observed: network of social ties: friendships, transactions etc Hidden: (mixed) groups/communities of social actors 5 / 39

  13. App 2: Community Extraction From Connectivity Graph Social Networks Observed: network of social ties: friendships, transactions etc Hidden: (mixed) groups/communities of social actors 5 / 39

  14. Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 10 4 4 2 10 3 0 6 / 39

  15. Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 10 4 4 2 10 3 0 Learning Communities from Graph Connectivity Facebook: n ∼ 20 k Yelp: n ∼ 40 k DBLPsub: n ∼ 0 . 1 m DBLP: n ∼ 1 m 10 6 Running Times (s) 10 1 Error /group 10 5 10 0 10 4 10 -1 10 3 10 -2 10 2 FB YP DBLPsub DBLP FB YP DBLPsub DBLP 6 / 39

  16. Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs 10 × 10 4 10 5 Tensor Running Time (s) Variational 8 Perplexity 6 e 10 4 t a 4 r u c c A 2 e r o 10 3 0 M & r Learning Communities from Graph Connectivity e t s a F e Facebook: n ∼ 20 k Yelp: n ∼ 40 k d DBLPsub: n ∼ 0 . 1 m DBLP: n ∼ 1 m u t i n 10 6 g a Running Times (s) 10 1 M Error /group 10 5 f o s 10 0 r e d 10 4 r O 10 -1 10 3 10 -2 10 2 FB YP DBLPsub DBLP FB YP DBLPsub DBLP “Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, F. Huang, A. Anandkumar, Oct. 2015. 6 / 39

  17. App 3: Cataloging Neuronal Cell Types In the Brain Neuroscience Observed: cellular-resolution brain slices Hidden: neuronal cell types 7 / 39

  18. App 3: Cataloging Neuronal Cell Types In the Brain Our method vs Average expression level [Grange 14’] Spatial point process (ours) k Average expression level ( previous ) 2.5 2.0 1.5 1.0 0.5 Recovered known cell types 1 Interneurons 5 Microglia 2 S1Pyramidal 6 Endothelial 3 Astrocytes 7 Mural 4 Ependymal 8 Oligodendrocytes “Discovering Neuronal Cell Types and Their Gene Expression Profiles Using a Spatial Point Process Mixture Model”, F. Huang, A. Anandkumar, C. Borgs, J. Chayes, E. Fraenkel, M. Hawrylycz, E. Lein, A. Ingrosso, S. Turaga, NIPS 2015 BigNeuro workshop. 8 / 39

  19. App 4: Word Sequence Embedding Extraction The weather is good. tree Her life spanned years of soccer incredible change for women. Mary lived through an era of football liberating reform for women. Word Embedding Word Sequence Embedding “Convolutional Dictionary Learning through Tensor Factorization”, by F. Huang, A. Anandkumar, In Proceedings of JMLR 2015. 9 / 39

  20. App 5: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Observed: co-occurrence of diseases on patients Hidden: disease similarity/hierarchy ” Scalable Latent TreeModel and its Application to Health Analytics ” by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 10 / 39

  21. Involve discovering the hidden and compact structure that is embedded in the high-dimensional complex observed data 11 / 39

  22. How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. h 1 Advanced: Probabilistic models Hidden variable h has more general distributions. h 2 h 3 Can model mixed memberships. x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models (topic model) 12 / 39

  23. Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k3 k4 k5 k1 k2 Topics A A A A A Words life gene data DNA RNA Learning Algorithm Unlabeled data Latent variable model Inference 13 / 39

  24. Challenges in Learning – find hidden structure in data h Choice Variable k2 k3 k4 k5 Topics k1 A A A A A Words life gene data DNA RNA Learning Algorithm Unlabeled data Latent variable model Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 13 / 39

  25. Challenges in Learning – find hidden structure in data h Choice Variable k2 k3 k4 k5 Topics k1 A A A A A Words life gene data DNA RNA MCMC Unlabeled data La tent Variable model Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time 13 / 39

  26. Challenges in Learning – find hidden structure in data h Choice Variable k3 k4 k5 k1 k2 Topics A A A A A Words data DNA RNA life gene � ✁ el Unlabeled data Latent variable model Inference ▲ � ✐✂ ✂ ✄ ☎ ✆ t ✐✂ ✄✝ Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points 13 / 39

  27. Challenges in Learning – find hidden structure in data h Choice Variable k3 k4 k5 k1 k2 Topics A A A A A Words data DNA RNA life gene ✟ ✠ el Unlabeled data Latent variable model Inference ✞ ✟ ✡☛ ☛ ☞ ✌ ✍ ✎ ✡☛ ☞✏ Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points Efficient computational and sample complexities? 13 / 39

  28. Challenges in Learning – find hidden structure in data h Choice Variable = + + k2 k3 k4 k5 Topics k1 A A A A A Words life gene data DNA RNA ensor Decomposition Unlabeled data Latent variable model Inference ❚ Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow ◮ Exponential mixing time Likelihood: non-convex, not scalable ◮ Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 13 / 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend