Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

deconstructing data science
SMART_READER_LITE
LIVE PREVIEW

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering overview Feb 3, 2016 Clustering Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of


slide-1
SLIDE 1

Deconstructing Data Science

David Bamman, UC Berkeley
 
 Info 290
 Lecture 5: Clustering overview Feb 3, 2016

slide-2
SLIDE 2

Clustering

  • Clustering (and

unsupervised learning more generally) finds structure in data, using just X X = a set of skyscrapers

slide-3
SLIDE 3

Unsupervised Learning

  • Matrix completion (e.g., user recommendations on

Netflix, Amazon)

Ann Bob Chris David Erik Star Wars 5 5 4 5 3 Bridget Jones 4 4 1 Rocky 3 5 Rambo ? 2 5

slide-4
SLIDE 4

task 𝓨 learn patterns that define architectural styles set of skyscrapers learn patterns that define genre set of books learn patterns that suggest “types” of customer behavior customer data

slide-5
SLIDE 5

Topic models Probabilistic graphical models Networks Deep learning K-means clustering Hierarchical clustering

Methods differ in the kind of structure learned

slide-6
SLIDE 6

Hierarchical Clustering

  • Hierarchical order

among the elements being clustered

slide-7
SLIDE 7

Shakespeare’s plays Witmore (2009)
 http://winedarksea.org/? p=519

Dendrogram

slide-8
SLIDE 8

Bottom-up clustering

slide-9
SLIDE 9

Similarity

  • What are you comparing?
  • How do you quantify the similarity/difference of

those things?

P(X) × P(X) → R

slide-10
SLIDE 10

Probability

the a dog cat runs to store 0.0 0.2 0.4

slide-11
SLIDE 11

Unigram probability

the a

  • f

love sword poison hamlet romeo king capulet be woe him most 0.00 0.06 0.12 the a

  • f

love sword poison hamlet romeo king capulet be woe him most 0.00 0.06 0.12

slide-12
SLIDE 12

Similarity

Euclidean = v u u t

vocab

X

i

  • P Hamlet

i

− P Romeo

i

2 Cosine similarity, Jensen-Shannon divergence…

slide-13
SLIDE 13

Cluster similarity

slide-14
SLIDE 14

Cluster similarity

  • Single link: two most similar elements
  • Complete link: two least similar elements
  • Group average: average of all members
slide-15
SLIDE 15

Flat Clustering

  • Partitions the data into a set of K clusters

A B C

slide-16
SLIDE 16

Flat Clustering

  • Partitions the data into a set of K clusters
slide-17
SLIDE 17

K-means

slide-18
SLIDE 18

K-means

slide-19
SLIDE 19

Representation

  • This is a huge decision that impacts what you can

learn

[x is a data point characterized by F real numbers, one for each feature]

x ∈ RF

slide-20
SLIDE 20

Voting behavior

Yes on abortion access 1 Yes on expanding gun rights Yes on tax breaks Yes on ACA 1 Yes on abolishing IRS

x ∈ R5

slide-21
SLIDE 21

First letter of last name

Last name starts with < “A” Last name starts with < “B” Last name starts with < “C” 1 Last name starts with < “D” 1 … 1 Last name starts with < “Z” 1

x ∈ R26

slide-22
SLIDE 22

task 𝓨 learn patterns that define architectural styles set of skyscrapers learn patterns that define genre set of books learn patterns that suggest “types” of customer behavior customer data

Representation

slide-23
SLIDE 23

Evaluation

  • Much more complex than supervised learning

since there’s often no notion of “truth”

slide-24
SLIDE 24

Internal criteria

  • Elements within clusters should be more similar to

each other

  • Elements in different clusters should be less similar

to each other

slide-25
SLIDE 25

External criteria

  • How closely does your clustering reproduce

another (“gold standard”) clustering?

slide-26
SLIDE 26

A B C Learned clusters Comparison clusters

slide-27
SLIDE 27

Evaluation: Purity

  • Learned clusters


(as learned by our algorithm)

  • External clusters


(from some external source)

= 1 N

  • k

max

j

|gk ∩ cj|

G = {g1 . . . gk}

C = {c1 . . . cj}

Purity

slide-28
SLIDE 28

A B C Learned (G) External (C)

= 1 N

  • k

max

j

|gk ∩ cj|

slide-29
SLIDE 29

A B C

= 1 N

  • k

max

j

|gk ∩ cj|

Learned (G) External (C)

slide-30
SLIDE 30

A B C

= 1 N

  • k

max

j

|gk ∩ cj|

Learned (G) External (C)

slide-31
SLIDE 31

A B C

= 1 N

  • k

max

j

|gk ∩ cj|

Learned (G) External (C)

slide-32
SLIDE 32

A B C (1 + 1 + 2) / 7 = .57 Learned (G) External (C)

slide-33
SLIDE 33

Evaluation: Rand Index

Every pair of data points is either in the same external cluster, or it’s not. = binary classification

slide-34
SLIDE 34

same cluster? Rubio Paul 1 Rubio Cruz 1 Rubio Trump Rubio Fiorina Rubio Clinton Rubio Sanders Paul Cruz 1 Paul Trump

Rand Index

slide-35
SLIDE 35

Rand Index

same
 cluster different
 cluster same
 cluster different
 cluster

Predicted (ŷ) True (y)

21 decisions N(N − 1)/2

slide-36
SLIDE 36

Learned External

same
 cluster different
 cluster same
 cluster different
 cluster

Predicted (ŷ) True (y)

slide-37
SLIDE 37

Rand Index

same
 cluster different
 cluster same
 cluster 1 4 different
 cluster 4 12

Predicted (ŷ) True (y)

From the confusion matrix, we can calculate standard measures from binary classification The Rand Index = accuracy (1 + 12) / 21 = .619

slide-38
SLIDE 38

Example

Clustering characters into distinct types

slide-39
SLIDE 39

The Villain

  • Does (agent): kill,

hunt, severs, chokes

  • Has done to them

(patient): fights, defeats, refuses

  • Is described as

(attribute): evil, frustrated, lord

slide-40
SLIDE 40
  • Is character in the movie

“Star Wars”

  • Science Fiction,

Adventure, Space Opera, Fantasy, Family Film, Action

  • Is played by David Prowse
  • Male
  • 42 years old in 1977

The Villain

slide-41
SLIDE 41

Task

Data Source 42,306 movie plot summaries Wikipedia 15,099 English novels (1700-1899) HathiTrust

Learning character types from textual descriptions of characters.

slide-42
SLIDE 42

Evaluation I: Names

  • Gold clusters:

characters with the same name (sequels, remakes)

  • Noise: “street thug”
  • 970 unique character

names used twice in the data; n=2,666

slide-43
SLIDE 43

Evaluation II: TV Tropes

  • Gold clusters: manually

clustered characters from www.tvtropes.com

  • “The Surfer Dude”
  • “Arrogant Kung-Fu Guy”
  • “Hardboiled Detective”
  • “The Klutz”
  • “The Valley GIrl”
  • 72 character tropes

containing 501 characters

slide-44
SLIDE 44

Purity: Names

17.5 35 52.5 70 25x25 25x50 25x100 50x25 50x50 50x100

Persona Regression Dirichlet Persona

slide-45
SLIDE 45

Purity: TV Tropes

17.5 35 52.5 70 25x25 25x50 25x100 50x25 50x50 50x100

Persona Regression Dirichlet Persona

slide-46
SLIDE 46

task 𝓨 learn patterns that define architectural styles set of skyscrapers learn patterns that define genre set of books learn patterns that suggest “types” of customer behavior customer data

Evaluation

slide-47
SLIDE 47

Digital Humanities

  • Marche (2012), Literature Is not Data: Against

Digital Humanities

  • Underwood (2015), Seven ways humanists are

using computers to understand text.

slide-48
SLIDE 48

Text visualization

slide-49
SLIDE 49

Characteristic vocabulary

Characteristic words by William Wordsworth (in comparison to other contemporary poets) [Underwood 2015]

slide-50
SLIDE 50

Finding and organizing texts

  • e.g., finding all examples of a complex literary form

(Haiku).

  • Supplement traditional searches: book catalogues,

search engines.

slide-51
SLIDE 51

Modeling literary forms

  • What features of a text are predictive of Haiku?
slide-52
SLIDE 52

Modeling social boundaries

Predicting reviewed texts [Underwood and Sellers (2015)]

slide-53
SLIDE 53

Unsupervised modeling

slide-54
SLIDE 54

Homework 1

slide-55
SLIDE 55
slide-56
SLIDE 56

Representation

  • Part one (everyone): Design an ideal representation
  • f Oscar nominees to enable good prediction/

analysis.

slide-57
SLIDE 57
  • Part IIa. Implementation option. Instantiate a

subset of those features for all nominees from 1960-2015. Deliverable: 6 feature files we will use to make predictions from.

Representation

slide-58
SLIDE 58

feature name feature value nominee canonical id boxoffice 60700000 /wiki/127_Hours boxoffice 1000000 /wiki/12_Angry_Men_(1957_film) boxoffice 168800000 /wiki/12_Monkeys boxoffice 187700000 /wiki/12_Years_a_Slave_(film) boxoffice 190000000 /wiki/2001:_A_Space_Odyssey_(film) boxoffice 60400000 /wiki/21_Grams boxoffice 2250000 /wiki/42nd_Street_(film) boxoffice 9300000 /wiki/45_Years boxoffice 5000000 /wiki/49th_Parallel_(film)

slide-59
SLIDE 59
  • Part IIb. Critical option. The prediction process here is conditioned
  • n being the nominee. Lots of public critique of the Academy this

year for nominating no minority actors.

  • First, how would you model the Academy’s (human) nomination

process? How might this result in the underrepresentation of minorities?

  • Second, consider an algorithmic approach to nominee prediction.

What are the ways in which a similar underrepresentation can

  • ccur? What are the risks of training a supervised model?
  • How does representation of data influence these processes?
  • Deliverable: 3 page essay (single-spaced)

Representation