Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

deconstructing data science
SMART_READER_LITE
LIVE PREVIEW

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering overview Jan 31, 2016 Clustering Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of


slide-1
SLIDE 1

Deconstructing Data Science

David Bamman, UC Berkeley
 
 Info 290
 Lecture 5: Clustering overview Jan 31, 2016

slide-2
SLIDE 2

Clustering

  • Clustering (and

unsupervised learning more generally) finds structure in data, using just X X = a set of skyscrapers

slide-3
SLIDE 3

Unsupervised Learning

Le et al. (2012), “Building High-level Features Using Large Scale Unsupervised Learning” (ICML)

slide-4
SLIDE 4

Netflix Amazon Twitter New York Times

slide-5
SLIDE 5

Unsupervised Learning

  • Matrix completion (e.g., user recommendations on

Netflix, Amazon)

Ann Bob Chris David Erik Star Wars 5 5 4 5 3 Bridget Jones 4 4 1 Rocky 3 5 Rambo ? 2 5

slide-6
SLIDE 6

task 𝓨 learn patterns that define architectural styles set of skyscrapers learn patterns that define genre set of books learn patterns that suggest “types” of customer behavior customer data

slide-7
SLIDE 7

Topic models Probabilistic graphical models Networks Deep learning K-means clustering Hierarchical clustering

Methods differ in the kind of structure learned

slide-8
SLIDE 8

Hierarchical Clustering

  • Hierarchical order

among the elements being clustered

slide-9
SLIDE 9

Shakespeare’s plays Witmore (2009)
 http://winedarksea.org/? p=519

Dendrogram

slide-10
SLIDE 10

Bottom-up clustering

slide-11
SLIDE 11

Similarity

  • What are you comparing?
  • How do you quantify the similarity/difference of

those things?

P(X) × P(X) → R

slide-12
SLIDE 12

Probability

the a dog cat runs to store 0.0 0.2 0.4

slide-13
SLIDE 13

Unigram probability

the a

  • f

love sword poison hamlet romeo king capulet be woe him most 0.00 0.06 0.12 the a

  • f

love sword poison hamlet romeo king capulet be woe him most 0.00 0.06 0.12

slide-14
SLIDE 14

Similarity

Euclidean = v u u t

vocab

X

i

  • P Hamlet

i

− P Romeo

i

2 Cosine similarity, Jensen-Shannon divergence…

slide-15
SLIDE 15

Cluster similarity

slide-16
SLIDE 16

Cluster similarity

  • Single link: two most similar elements
  • Complete link: two least similar elements
  • Group average: average of all members
slide-17
SLIDE 17

Flat Clustering

  • Partitions the data into a set of K clusters

A B C

slide-18
SLIDE 18

Flat Clustering

  • Partitions the data into a set of K clusters
slide-19
SLIDE 19

K-means

slide-20
SLIDE 20

K-means

slide-21
SLIDE 21

Representation

  • This is a huge decision that impacts what you can

learn

[x is a data point characterized by F real numbers, one for each feature]

x ∈ RF

slide-22
SLIDE 22

Voting behavior

Yes on abortion access 1 Yes on expanding gun rights Yes on tax breaks Yes on ACA 1 Yes on abolishing IRS

x ∈ R5

slide-23
SLIDE 23

First letter of last name

Last name starts with < “A” Last name starts with < “B” Last name starts with < “C” 1 Last name starts with < “D” 1 … 1 Last name starts with < “Z” 1

x ∈ R26

slide-24
SLIDE 24

task 𝓨 learn patterns that define architectural styles set of skyscrapers learn patterns that define genre set of books learn patterns that suggest “types” of customer behavior customer data

Representation

slide-25
SLIDE 25

Evaluation

  • Much more complex than supervised learning

since there’s often no notion of “truth”

slide-26
SLIDE 26

Internal criteria

  • Elements within clusters should be more similar to

each other

  • Elements in different clusters should be less similar

to each other

slide-27
SLIDE 27

External criteria

  • How closely does your clustering reproduce

another (“gold standard”) clustering?

slide-28
SLIDE 28

A B C Learned clusters Comparison clusters

slide-29
SLIDE 29

Evaluation: Purity

  • Learned clusters


(as learned by our algorithm)

  • External clusters


(from some external source)

= 1 N

  • k

max

j

|gk ∩ cj|

G = {g1 . . . gk}

C = {c1 . . . cj}

Purity

slide-30
SLIDE 30

A B C Learned (G) External (C)

= 1 N

  • k

max

j

|gk ∩ cj|

slide-31
SLIDE 31

A B C

= 1 N

  • k

max

j

|gk ∩ cj|

Learned (G) External (C)

slide-32
SLIDE 32

A B C

= 1 N

  • k

max

j

|gk ∩ cj|

Learned (G) External (C)

slide-33
SLIDE 33

A B C

= 1 N

  • k

max

j

|gk ∩ cj|

Learned (G) External (C)

slide-34
SLIDE 34

A B C (1 + 1 + 2) / 7 = .57 Learned (G) External (C)

slide-35
SLIDE 35

Evaluation: Rand Index

Every pair of data points is either in the same external cluster, or it’s not. = binary classification

slide-36
SLIDE 36

same cluster? Rubio Paul 1 Rubio Cruz 1 Rubio Trump Rubio Fiorina Rubio Clinton Rubio Sanders Paul Cruz 1 Paul Trump

Rand Index

slide-37
SLIDE 37

Rand Index

same
 cluster different
 cluster same
 cluster different
 cluster

Predicted (ŷ) True (y)

21 decisions N(N − 1)/2

slide-38
SLIDE 38

Learned External

same
 cluster different
 cluster same
 cluster different
 cluster

Predicted (ŷ) True (y)

slide-39
SLIDE 39

Rand Index

same
 cluster different
 cluster same
 cluster 1 4 different
 cluster 4 12

Predicted (ŷ) True (y)

From the confusion matrix, we can calculate standard measures from binary classification The Rand Index = accuracy (1 + 12) / 21 = .619

slide-40
SLIDE 40

Example

Clustering characters into distinct types

slide-41
SLIDE 41

The Villain

  • Does (agent): kill,

hunt, severs, chokes

  • Has done to them

(patient): fights, defeats, refuses

  • Is described as

(attribute): evil, frustrated, lord

slide-42
SLIDE 42
  • Is character in the movie

“Star Wars”

  • Science Fiction,

Adventure, Space Opera, Fantasy, Family Film, Action

  • Is played by David Prowse
  • Male
  • 42 years old in 1977

The Villain

slide-43
SLIDE 43

Task

Data Source 42,306 movie plot summaries Wikipedia 15,099 English novels (1700-1899) HathiTrust

Learning character types from textual descriptions of characters.

slide-44
SLIDE 44

Personas

attribute dark major henchman warrior sergeant agent shoot aim overpower interrogate kill

Jason Bourne, Bourne Supremacy

  • Male
  • Action
  • War film

Highest weighted features:

slide-45
SLIDE 45

Personas

patient capture corner transport imprison trap agent infiltrate deduce leap evade obtain agent flee escape swim hide manage

Ginormica (Monsters vs. Aliens)

  • Female
  • Action
  • Adventure

Highest weighted features:

slide-46
SLIDE 46

Evaluation I: Names

  • Gold clusters:

characters with the same name (sequels, remakes)

  • Noise: “street thug”
  • 970 unique character

names used twice in the data; n=2,666

slide-47
SLIDE 47

Evaluation II: TV Tropes

  • Gold clusters: manually

clustered characters from www.tvtropes.com

  • “The Surfer Dude”
  • “Arrogant Kung-Fu Guy”
  • “Hardboiled Detective”
  • “The Klutz”
  • “The Valley GIrl”
  • 72 character tropes

containing 501 characters

slide-48
SLIDE 48

Purity: Names

17.5 35 52.5 70 25x25 25x50 25x100 50x25 50x50 50x100

Persona Regression Dirichlet Persona

slide-49
SLIDE 49

Purity: TV Tropes

17.5 35 52.5 70 25x25 25x50 25x100 50x25 50x50 50x100

Persona Regression Dirichlet Persona

slide-50
SLIDE 50

task 𝓨 learn patterns that define architectural styles set of skyscrapers learn patterns that define genre set of books learn patterns that suggest “types” of customer behavior customer data

Evaluation

slide-51
SLIDE 51

Digital Humanities

  • Marche (2012), Literature Is not Data: Against

Digital Humanities

  • Underwood (2015), Seven ways humanists are

using computers to understand text.

slide-52
SLIDE 52

Text visualization

slide-53
SLIDE 53

Characteristic vocabulary

Characteristic words by William Wordsworth (in comparison to other contemporary poets) [Underwood 2015]

slide-54
SLIDE 54

Finding and organizing texts

  • e.g., finding all examples of a complex literary form

(Haiku).

  • Supplement traditional searches: book catalogues,

search engines.

slide-55
SLIDE 55

Modeling literary forms

  • What features of a text are predictive of Haiku?
slide-56
SLIDE 56

Modeling social boundaries

Predicting reviewed texts [Underwood and Sellers (2015)]

slide-57
SLIDE 57

Unsupervised modeling