Practical Unsupervised Learning INFO/CS 4300, Spring 2016 Jack - - PowerPoint PPT Presentation

practical unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Practical Unsupervised Learning INFO/CS 4300, Spring 2016 Jack - - PowerPoint PPT Presentation

Practical Unsupervised Learning INFO/CS 4300, Spring 2016 Jack Hessel Unsupervised Learning is Cool! But how can we use this in our projects? 1. 2. But first, lets look at our dataset... Data Dimensionality! Probability of Lung Cancer


slide-1
SLIDE 1

Practical Unsupervised Learning

INFO/CS 4300, Spring 2016 Jack Hessel

slide-2
SLIDE 2

Unsupervised Learning is Cool!

slide-3
SLIDE 3

But how can we use this in our projects?

1. 2.

slide-4
SLIDE 4

But first, let’s look at our dataset...

slide-5
SLIDE 5

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

slide-6
SLIDE 6

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

slide-7
SLIDE 7

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

slide-8
SLIDE 8

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

slide-9
SLIDE 9

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

X

slide-10
SLIDE 10

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

slide-11
SLIDE 11

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

D i f f e r e n c e f r

  • m

m e a n h e i g h t ?

slide-12
SLIDE 12

Data Dimensionality!

X X X X X X X X X

Cigarettes Consumed Per Day Probability of Lung Cancer Developing

1 Dimension!

slide-13
SLIDE 13

Words and documents are the same way...

Xtfidf

|V| |D|

slide-14
SLIDE 14

Words and documents are the same way...

Xtfidf

|V| |D|

Pineapple

slide-15
SLIDE 15

Words and documents are the same way...

Xtfidf

|V| |D|

Pineapple

X

slide-16
SLIDE 16

Words and documents are the same way...

Xtfidf

|V| |D|

Pineapple

X

slide-17
SLIDE 17

Words and documents are the same way...

Xtfidf

|V| |D|

Pineapple

X

(but, really -- a low dimensional subspace…)

slide-18
SLIDE 18

Words and documents are the same way...

Xtfidf

|V| |D|

“ P i n e a p p l e s w e r e r e c a l l e d … ”

X

slide-19
SLIDE 19

Words and documents are the same way...

Xtfidf

|V| |D|

“ P i n e a p p l e s w e r e r e c a l l e d … ”

X

slide-20
SLIDE 20

Words and documents are the same way...

Xtfidf

|V| |D|

“ P i n e a p p l e s w e r e r e c a l l e d … ”

X

(but, really -- a low dimensional subspace…)

slide-21
SLIDE 21

Key questions in unsupervised NLP:

slide-22
SLIDE 22

Key questions in unsupervised NLP:

  • 1. How many dimensions does our

dataset actually live in?

slide-23
SLIDE 23

Key questions in unsupervised NLP:

  • 1. How many dimensions does our

dataset actually live in?

  • 2. How do we project our data down to

those dimensions?

slide-24
SLIDE 24

Key questions in unsupervised NLP:

  • 1. How many dimensions does our

dataset actually live in?

  • 2. How do we project our data down to

those dimensions?

  • 3. Does any of this stuff actually do

anything for our projects?

slide-25
SLIDE 25

Key tool in Linear Algebra, NLP, Machine Learning, Data Science, Computer Vision, Algorithms, Matrix Computations, Optimization, Statistics...

slide-26
SLIDE 26

Key tool in Linear Algebra, NLP, Machine Learning, Data Science, Computer Vision, Algorithms, Matrix Computations, Optimization, Statistics...

slide-27
SLIDE 27

Key tool in Linear Algebra, NLP, Machine Learning, Data Science, Computer Vision, Algorithms, Matrix Computations, Optimization, Statistics...

slide-28
SLIDE 28

Key tool in Linear Algebra, NLP, Machine Learning, Data Science, Computer Vision, Algorithms, Matrix Computations, Optimization, Statistics...

slide-29
SLIDE 29

Xtfidf

|V| |D|

=

slide-30
SLIDE 30

Xtfidf

|V| |D|

=

|D| |V|

1 1

slide-31
SLIDE 31

Xtfidf

|V| |D|

=

|D| |V|

1 1

k

slide-32
SLIDE 32

Xtfidf

|V| |D|

=

|D| |V|

1 1

k

1 1

slide-33
SLIDE 33

Xtfidf

|V| |D|

=

|D| |V|

2 2 2 2 k1 0 0 k2

slide-34
SLIDE 34

Xtfidf

|V| |D|

=

|D| |V|

2 2 2 2 k1 0 0 k2

slide-35
SLIDE 35
slide-36
SLIDE 36

Key questions in unsupervised NLP:

  • 1. How many dimensions does our dataset actually live

in?

  • 2. How do we project our data down to those dimensions?
slide-37
SLIDE 37

Enough talk, time for some magic.

slide-38
SLIDE 38

Xtfidf

|V| |D|

+ =

slide-39
SLIDE 39

Xtfidf

|V| |D|

+ = Latent Semantic Indexing (LSI)

slide-40
SLIDE 40

Xtfidf

|V| |D|

+ = Latent Semantic Indexing (LSI) = Latent Semantic Analysis (LSA)

slide-41
SLIDE 41

As a side note...

Great first NLP paper to read! Highly accessible :-)

Scott Deerwester

“Indexing by latent semantic analysis” - Deerwester et al. 1990

slide-42
SLIDE 42

What are topic models?

Xtfidf

|V| |D| |D| |V|

k k

slide-43
SLIDE 43

What are topic models?

Xtfidf

|V| |D| |D| |V|

Latent semantic indexing (Deerwater et al. 1990) k k

slide-44
SLIDE 44

What are topic models?

Xtfidf

|V| |D| |D| |V|

Latent semantic indexing (Deerwater et al. 1990) k k Non-negative matrix factorization (Lee and Seung 1999)

slide-45
SLIDE 45

What are topic models?

Xtfidf

|V| |D| |D| |V|

Latent dirichlet allocation (Blei et al. 2003) Latent semantic indexing (Deerwater et al. 1990) k k Non-negative matrix factorization (Lee and Seung 1999)

slide-46
SLIDE 46

Why do we care??

|D| |V|

k k

slide-47
SLIDE 47

Why do we care??

|D| |V|

k k

slide-48
SLIDE 48

Why do we care??

|D| |V|

k k

Interpretable, small number of features for text classification!

slide-49
SLIDE 49

Why do we care??

|D| |V|

k k

Interpretable, small number of features for text classification!

slide-50
SLIDE 50

Document length *matters a lot*

slide-51
SLIDE 51

Document length *matters a lot*

Different regimes of supervised NLP

(Jack’s opinions only! Lots of caveats!)

slide-52
SLIDE 52

Document length *matters a lot*

Different regimes of supervised NLP

(Jack’s opinions only! Lots of caveats!)

50-100 Words

Less words More words

slide-53
SLIDE 53

Document length *matters a lot*

Different regimes of supervised NLP

(Jack’s opinions only! Lots of caveats!)

50-100 Words

Less words More words

Topic models fail Topic models work

slide-54
SLIDE 54

Document length *matters a lot*

Different regimes of supervised NLP

(Jack’s opinions only! Lots of caveats!)

50-100 Words

Less words More words

Topic models fail Topic models work Naive bayes, n-gram features + linear classifier are almost always pretty good in practice :-)

slide-55
SLIDE 55

So can we see if our Kickstarter will be successful?