[PPT] - Discovery of Latent Factors in High-dimensional Data Using Tensor PowerPoint Presentation

SLIDE 1

Discovery of Latent Factors in High-dimensional Data Using Tensor Methods

Furong Huang

University of California, Irvine Machine Learning Conference 2016 New York City

1 / 24

SLIDE 2

Machine Learning - Modern Challenges

Big Data Challenging Tasks

Success of Supervised Learning

Image classification Speech recognition Text processing Computation power growth Enormous labeled data

2 / 24

SLIDE 3

Machine Learning - Modern Challenges

Big Data Challenging Tasks

Real AI requires Unsupervised Learning

Filter bank learning Feature extraction Embeddings; Topics

2 / 24

SLIDE 4

Machine Learning - Modern Challenges

Big Data Challenging Tasks

Real AI requires Unsupervised Learning

Filter bank learning Feature extraction Embeddings; Topics Summarize key features in data: Machines vs Humans Foundation for successful supervised learning

2 / 24

SLIDE 5

Unsupervised Learning with Big Data

Information Extraction

High dimension observation vs Low dimension representation

Cell T ypes T

pics

Communities

3 / 24

SLIDE 6

Unsupervised Learning with Big Data

Information Extraction

High dimension observation vs Low dimension representation

Cell T ypes T

pics

Communities

Finding Needle In the Haystack Is Challenging

3 / 24

SLIDE 7

Unsupervised Learning with Big Data

Information Extraction

Solution for Unsupervised Learning A Unified Tensor Decomposition Framework

3 / 24

SLIDE 8

Automated Categorization of Documents

Mixed topics

Topics Education Crime Sports 4 / 24

SLIDE 9

Community Extraction From Connectivity Graph

Mixed memberships

5 / 24

SLIDE 10

Tensor Methods Compared with Variational Inference

PubMed on Spark: 8 million docs

103 104 105

Perplexity Tensor Variational

2 4 6 8 10 ×104

Running Time (s) 6 / 24

SLIDE 11

Tensor Methods Compared with Variational Inference

PubMed on Spark: 8 million docs

103 104 105

Perplexity Tensor Variational

2 4 6 8 10 ×104

Running Time (s)

Facebook: n ∼ 20k Yelp: n ∼ 40k DBLP: n ∼ 1 million

10-2 10-1 100 101

Error /group FB YP DBLPsub DBLP

102 103 104 105 106

Running Times (s) FB YP DBLPsub DBLP 6 / 24

SLIDE 12

Tensor Methods Compared with Variational Inference

PubMed on Spark: 8 million docs

103 104 105

Perplexity Tensor Variational

2 4 6 8 10 ×104

Running Time (s)

Facebook: n ∼ 20k Yelp: n ∼ 40k DBLP: n ∼ 1 million

10-2 10-1 100 101

Error /group FB YP DBLPsub DBLP

102 103 104 105 106

Running Times (s) FB YP DBLPsub DBLP

O r d e r s

f

M a g n i t u d e F a s t e r & M

r

e A c c u r a t e

“Online Tensor Methods for Learning Latent Variable Models”, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. “Tensor Methods on Apache Spark”, by F. Huang, A. Anandkumar, Oct. 2015. 6 / 24

SLIDE 13

Cataloging Neuronal Cell Types In the Brain

7 / 24

SLIDE 14

Cataloging Neuronal Cell Types In the Brain

Our method vs Average expression level [Grange 14’]

0.5 1.0 1.5 2.0 2.5 k Spatial point process (ours) Average expression level ( ) previous

Recovered known cell types

1 astrocytes 2 interneurons 3 oligodendrocytes

” Discovering Neuronal Cell Types and Their Gene Expression Profiles Using a Spatial Point Process Mixture Model ” by F. Huang, A. Anandkumar, C. Borgs, J. Chayes, E. Fraenkel, M. Hawrylycz, E. Lein, A. Ingrosso, S. Turaga, NIPS 2015 BigNeuro workshop. 8 / 24

SLIDE 15

Word Sequence Embedding Extraction

football soccer tree

Word Embedding

The weather is good. Her life spanned years of incredible change for women. Mary lived through an era of liberating reform for women.

Word Sequence Embedding

9 / 24

SLIDE 16

Word Sequence Embedding Extraction

football soccer tree

Word Embedding

The weather is good. Her life spanned years of incredible change for women. Mary lived through an era of liberating reform for women.

Word Sequence Embedding

Paraphrase Detection

MSR paraphrase data: 5800 pairs of sentences Method Outside Information F score Vector Similarity (Baseline) word similarity 75.3% Convolutional Tensor (Proposed) none 80.7% Skip-thought (NIPS’15) train on large corpus 81.9%

“Convolutional Dictionary Learning through Tensor Factorization”, by F. Huang, A. Anandkumar, conference and workshop proceeding of JMLR, vol.44, Dec 2015. 9 / 24

SLIDE 17

Human Disease Hierarchy Discovery

CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases.

” Scalable Latent TreeModel and its Application to Health Analytics ” by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun,

A. Anandkumar, NIPS 2015 MLHC workshop.

10 / 24

SLIDE 18

Unsupervised Learning via Probabilistic Models

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Probabilistic admixture model Learning Algorithm Inference 11 / 24

SLIDE 19

Unsupervised Learning via Probabilistic Models

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Probabilistic admixture model MCMC Inference

MCMC: random sampling, slow

◮ Exponential mixing time 11 / 24

SLIDE 20

Unsupervised Learning via Probabilistic Models

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Probabilistic admixture model Likelihood Methods Inference

MCMC: random sampling, slow

◮ Exponential mixing time

Likelihood: non-convex, not scalable

◮ Exponential critical points 11 / 24

SLIDE 21

Unsupervised Learning via Probabilistic Models

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Probabilistic admixture model Likelihood Methods Inference

MCMC: random sampling, slow

◮ Exponential mixing time

Likelihood: non-convex, not scalable

◮ Exponential critical points

Solution

A unified tensor decomposition framework

11 / 24

SLIDE 22

Unsupervised Learning via Probabilistic Models

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Probabilistic admixture model T

❡ ✁ ✂ ✄ ☎ ❡✆ ✂ ✝ ♣ ✂ ✁ ✞ ✟ ✞ ✂

Inference

❂ ✰ ✰

tensor decomposition → correct model

12 / 24

SLIDE 23

Unsupervised Learning via Probabilistic Models

Words Topics Choice Variable life gene data DNA RNA k1 k2 k3 k4 k5 h A A A A A

Unlabeled data Probabilistic admixture model T

✠✡ ☛ ☞ ✌ ✍ ✠✎ ☞ ✏ ✑ ☞ ☛ ✒ ✓ ✒ ☞ ✡

Inference

✔ ✕ ✕

tensor decomposition → correct model

Contributions

Guaranteed online algorithm with global convergence guarantee Highly scalable, highly parallel, random projection Tensor library on CPU/GPU/Spark Interdisciplinary applications Extension to model with group invariance

12 / 24

SLIDE 24

What is a tensor?

Matrix: Second Order Moments M2: pair-wise relationship. [x ⊗ x]i1,i2 = xi1xi2 → [M2]i1,i2

=

i1 i2

Tensor: Third Order Moments M3: triple-wise relationship. [x ⊗ x ⊗ x]i1,i2,i3 = xi1xi2xi3 → [M3]i1,i2,i3

=

i1 i2 i3

13 / 24

SLIDE 25

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap 1 1

= e1e⊤

1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

14 / 24

SLIDE 26

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap 1 1

= e1e⊤

1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

Tensor Orthogonal Decomposition

Unique: eigenvalue gap not needed

+ = ≠

14 / 24

SLIDE 27

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap 1 1

= e1e⊤

1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

Tensor Orthogonal Decomposition

Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap

+ = ≠

14 / 24

SLIDE 28

Why are tensors powerful?

Matrix Orthogonal Decomposition

Not unique without eigenvalue gap 1 1

= e1e⊤

1 +e2e⊤ 2 = u1u⊤ 1 +u2u⊤ 2

e1 e2 u1 = [

√ 2 2 , − √ 2 2 ]

u2 = [

√ 2 2 , √ 2 2 ]

Tensor Orthogonal Decomposition

Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap

+ = ≠

14 / 24

SLIDE 29

Outline

1

Introduction

2

LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm

3

Conclusion

15 / 24

SLIDE 30

Outline

1

Introduction

2

LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm

3

Conclusion

16 / 24

SLIDE 31

Probabilistic Topic Models - LDA

Bag of words

Topics Topic Proportion

police witness campus police witness campus police witness campus 17 / 24

SLIDE 32

Probabilistic Topic Models - LDA

Bag of words

Topics Topic Proportion

police witness campus police witness campus police witness campus police witness

crime S p

r

t s Educa

✖

n

campus 17 / 24

SLIDE 33

Probabilistic Topic Models - LDA

Bag of words

Topics Topic Proportion

police witness campus police witness campus police witness campus police witness

crime S p

r

t s Educaon

campus

Goal

campus police witness

Topic-word matrix P[word = i|topic = j]

17 / 24

SLIDE 34

Mixture Form of Moments

Goal: Linearly independent topic-word table

campus police witness

18 / 24

SLIDE 35

Mixture Form of Moments

Goal: Linearly independent topic-word table

campus police witness

M1: Occurrence of Words

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness 18 / 24

SLIDE 36

Mixture Form of Moments

Goal: Linearly independent topic-word table

campus police witness

M1: Occurrence of Words

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness

No unique decomposition of vectors

18 / 24

SLIDE 37

Mixture Form of Moments

Goal: Linearly independent topic-word table

campus police witness

M2: Modified Co-occurrence of Word Pairs

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness 18 / 24

SLIDE 38

Mixture Form of Moments

Goal: Linearly independent topic-word table

campus police witness

M2: Modified Co-occurrence of Word Pairs

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness

Matrix decomposition recovers subspace, not actual model

18 / 24

SLIDE 39

Mixture Form of Moments

Goal: Linearly independent topic-word table

Find a W

W W W

such that

M2: Modified Co-occurrence of Word Pairs

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness

Many such W’s, find one, project data with W

18 / 24

SLIDE 40

Mixture Form of Moments

Goal: Linearly independent topic-word table

Know a W

W W W

such that

M3: Modified Co-occurrence of Word Triplets

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness

W W W

Unique orthogonal tensor decomposition, project result with W †

18 / 24

SLIDE 41

Mixture Form of Moments

Goal: Linearly independent topic-word table

Know a W

W W W

such that

M3: Modified Co-occurrence of Word Triplets

= + +

campus police witness

crime Sports Educaon

campus police witness campus police witness

W W W

Tensor decomposition uniquely discovers the correct model Learning Topic Models through Matrix/Tensor Decomposition

18 / 24

SLIDE 42

Mixed Membership Community Models

Mixed memberships

19 / 24

SLIDE 43

Mixed Membership Community Models

Mixed memberships What ensures guaranteed learning?

✗ ✘ ✘

Alice Bob Charlie

Mathema

✙

cians V e g e t a r i a n s M u s i c i a n s

David Ellen Frank Grace Jack Kathy 19 / 24

SLIDE 44

Mixed Membership Community Models

Mixed memberships What ensures guaranteed learning?

✚ ✛ ✛

Alice Bob Charlie

Mathemacians V e g e t a r i a n s M u s i c i a n s

David Ellen Frank Grace Jack Kathy 19 / 24

SLIDE 45

Outline

1

Introduction

2

LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm

3

Conclusion

20 / 24

SLIDE 46

How to do tensor decomposition?

Model is uniquely identifiable! How to identify?

21 / 24

SLIDE 47

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

21 / 24

SLIDE 48

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

Objective Function

Theorem: We propose an objective function with equivalent local optima.

21 / 24

SLIDE 49

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

Objective Function

Theorem: We propose an objective function with equivalent local optima.

Saddle point: enemy of SGD

Saddle Point

Saddle point has 0 gradient

21 / 24

SLIDE 50

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

Objective Function

Theorem: We propose an objective function with equivalent local optima.

Saddle point: enemy of SGD

Saddle Point

Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue

21 / 24

SLIDE 51

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

Objective Function

Theorem: We propose an objective function with equivalent local optima.

Saddle point: enemy of SGD

escape stuck

Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue Negative eigenvalue: direction of escape

21 / 24

SLIDE 52

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

Objective Function

Theorem: We propose an objective function with equivalent local optima.

Saddle point: enemy of SGD

escape stuck

Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue Negative eigenvalue: direction of escape

Guaranteed Global Converge Online Tensor Decomposition

Theorem: For smooth fn. with non-degenerate saddle points, noisy SGD converges to a local minimum in polynomial steps.

21 / 24

SLIDE 53

How to do tensor decomposition?

How to find components? Non-convex optimization problem!

Objective Function

Theorem: We propose an objective function with equivalent local optima.

Saddle point: enemy of SGD

escape stuck

Saddle point has 0 gradient Non-degenerate saddle: Hessian has ± eigenvalue Negative eigenvalue: direction of escape

Guaranteed Global Converge Online Tensor Decomposition

Theorem: For smooth fn. with non-degenerate saddle points, noisy SGD converges to a local minimum in polynomial steps. Noise could help!

“Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition”,by R. Ge, F. Huang, C. Jin, Y. Yuan, COLT 2015. 21 / 24

SLIDE 54

Outline

1

Introduction

2

LDA and Community Models From Data Aggregates to Model Parameters Guaranteed Online Algorithm

3

Conclusion

22 / 24

SLIDE 55

Contributions

Spectral methods reveal hidden structure

Text/Image processing Social networks Neuroscience, heathcare ...

23 / 24

SLIDE 56

Contributions

Spectral methods reveal hidden structure

Text/Image processing Social networks Neuroscience, heathcare ...

Versatile for latent variable models

Flat model → hierarchical model Sparse coding → convolutional model Efficient, convergence guarantee

✜ ✢ ✢ ✜ ✢ ✢ ✜ ✢ ✢ ✜ ✢ ✢ ✣ ✤ ✥ ✥ ✥ ✤ ✤ ✤ ✥ ✥ ✥ ✤

M3

escape stuck

23 / 24

SLIDE 57

Thank You

Collaborators

Anima Anandkumar UC Irvine Rong Ge Duke University Srini Turaga Janelia Research Chi Jin UC Berkeley Jennifer Chayes MSR Christian Borgs MSR Ernest Fraenkel MIT Yang Yuan Cornell U UN Niranjan UC Irvine

24 / 24