Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

  Deconstructing Data Science David Bamman, UC Berkeley   Info 290   Lecture 5: Clustering overview Jan 31, 2016

Clustering • Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of skyscrapers

Unsupervised Learning Le et al. (2012), “Building High-level Features Using Large Scale Unsupervised Learning” (ICML)

Netflix Amazon Twitter New York Times

Unsupervised Learning • Matrix completion (e.g., user recommendations on Netflix, Amazon) Ann Bob Chris David Erik Star Wars 5 5 4 5 3 Bridget 4 4 1 Jones Rocky 3 5 Rambo ? 2 5

𝓨 task learn patterns that define architectural set of skyscrapers styles learn patterns that define genre set of books learn patterns that suggest “types” of customer data customer behavior

Methods differ in the kind of structure learned Deep learning Probabilistic graphical models Networks Topic models K-means clustering Hierarchical clustering

Hierarchical Clustering • Hierarchical order among the elements being clustered

Dendrogram Shakespeare’s plays Witmore (2009)   http://winedarksea.org/? p=519

Bottom-up clustering

Similarity P ( X ) × P ( X ) → R • What are you comparing? • How do you quantify the similarity/difference of those things?

Probability 0.4 0.2 0.0 the a dog cat runs to store

Unigram probability 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most

Similarity v vocab u � 2 u X � P Hamlet − P Romeo Euclidean = t i i i Cosine similarity, Jensen-Shannon divergence…

Cluster similarity

Cluster similarity • Single link: two most similar elements • Complete link: two least similar elements • Group average: average of all members

Flat Clustering • Partitions the data into a set of K clusters B A C

Flat Clustering • Partitions the data into a set of K clusters

K-means

Representation x ∈ R F [x is a data point characterized by F real numbers, one for each feature] • This is a huge decision that impacts what you can learn

Yes on abortion 1 access Yes on expanding gun 0 rights Yes on tax 0 breaks Voting behavior Yes on ACA 1 Yes on 0 abolishing IRS x ∈ R 5

Last name starts 0 with < “A” Last name starts 0 with < “B” Last name starts 1 with < “C” Last name starts 1 with < “D” First letter of last name … 1 Last name starts 1 with < “Z” x ∈ R 26

Representation 𝓨 task learn patterns that define architectural set of skyscrapers styles learn patterns that define genre set of books learn patterns that suggest “types” of customer data customer behavior

Evaluation • Much more complex than supervised learning since there’s often no notion of “truth”

Internal criteria • Elements within clusters should be more similar to each other • Elements in different clusters should be less similar to each other

External criteria • How closely does your clustering reproduce another (“gold standard”) clustering?

Learned clusters A B C Comparison clusters

Evaluation: Purity G = { g 1 . . . g k } • Learned clusters   (as learned by our algorithm) • External clusters   C = { c 1 . . . c j } (from some external source) = 1 � | g k ∩ c j | Purity max N j k

Learned (G) A B C = 1 � | g k ∩ c j | max N j k External (C)

Learned (G) A B C (1 + 1 + 2) / 7 = .57 External (C)

Evaluation: Rand Index Every pair of data points is either in the same external cluster, or it’s not. = binary classification

same Rand Index cluster? Rubio Paul 1 Rubio Cruz 1 Rubio Trump 0 Rubio Fiorina 0 Rubio Clinton 0 Rubio Sanders 0 Paul Cruz 1 Paul Trump 0

Rand Index Predicted ( ŷ ) same   different   cluster cluster same   True (y) cluster different   cluster 21 decisions N ( N − 1 ) / 2

Learned Predicted ( ŷ ) same   different   cluster cluster True (y) same   cluster different   cluster External

Rand Index Predicted ( ŷ ) same   different   From the confusion matrix, cluster cluster we can calculate standard measures from binary same   True (y) 1 4 cluster classification different   4 12 The Rand Index = cluster accuracy (1 + 12) / 21 = .619

Example Clustering characters into distinct types

The Villain • Does (agent): kill, hunt, severs, chokes • Has done to them (patient): fights, defeats, refuses • Is described as (attribute): evil, frustrated, lord

The Villain • Is character in the movie “Star Wars” • Science Fiction, Adventure, Space Opera, Fantasy, Family Film, Action • Is played by David Prowse • Male • 42 years old in 1977

Task Learning character types from textual descriptions of characters. Data Source 42,306 movie plot summaries Wikipedia 15,099 English novels (1700-1899) HathiTrust

Personas dark major henchman attribute warrior sergeant shoot aim overpower agent interrogate kill Highest weighted features: Male • Action • War film • Jason Bourne, Bourne Supremacy

Personas capture corner transport patient imprison trap infiltrate deduce leap agent evade obtain flee escape swim hide agent manage Highest weighted features: Female • Action • Adventure • Ginormica (Monsters vs. Aliens)

Evaluation I: Names • Gold clusters: characters with the same name (sequels, remakes) • Noise: “street thug” • 970 unique character names used twice in the data; n=2,666

Evaluation II: TV Tropes • Gold clusters: manually clustered characters from www.tvtropes.com • “The Surfer Dude” • “Arrogant Kung-Fu Guy” • “Hardboiled Detective” • “The Klutz” • “The Valley GIrl” • 72 character tropes containing 501 characters

Purity: Names Persona Regression Dirichlet Persona 70 52.5 35 17.5 0 25x25 25x50 25x100 50x25 50x50 50x100

Purity: TV Tropes Persona Regression Dirichlet Persona 70 52.5 35 17.5 0 25x25 25x50 25x100 50x25 50x50 50x100

Evaluation 𝓨 task learn patterns that define architectural set of skyscrapers styles learn patterns that define genre set of books learn patterns that suggest “types” of customer data customer behavior

Digital Humanities • Marche (2012), Literature Is not Data: Against Digital Humanities • Underwood (2015), Seven ways humanists are using computers to understand text.

Text visualization

Characteristic vocabulary Characteristic words by William Wordsworth (in comparison to other contemporary poets) [Underwood 2015]

Finding and organizing texts • e.g., finding all examples of a complex literary form (Haiku). • Supplement traditional searches: book catalogues, search engines.

Modeling literary forms • What features of a text are predictive of Haiku?

Modeling social boundaries Predicting reviewed texts [Underwood and Sellers (2015)]

Unsupervised modeling

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering overview Jan 31, 2016 Clustering Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models

Mettle Fatigue: VW's Single Point of Failure Ethics Roland L. Trope | Trope and Schramm LLP Eugene

The Olivet Discourse Lesson #2 January 21, 2014 Dean Bible Ministries www.deanbible.org Guest

The Triumph of Simplicity How database complexity will be replaced by simple services Life is

Welcome PAYE Modernisation be to todays webinar considered a triumph? Paul Byrne Managing

A Framework for Integrating Business Processes and Business Requirements Raman Kazhamiakin, Marco

ABSTRACT ENTITIES Are abstract entities occult? Ney, .72 That the houses and roses and

Outline W X (DEP) CSci 4271W Return-oriented programming (ROP) Development of Secure

t t r srt

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering overview Jan 31, 2016 Clustering Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice &amp; Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models

Mettle Fatigue: VW's Single Point of Failure Ethics Roland L. Trope | Trope and Schramm LLP Eugene

The Olivet Discourse Lesson #2 January 21, 2014 Dean Bible Ministries www.deanbible.org Guest

The Triumph of Simplicity How database complexity will be replaced by simple services Life is

Welcome PAYE Modernisation be to todays webinar considered a triumph? Paul Byrne Managing

A Framework for Integrating Business Processes and Business Requirements Raman Kazhamiakin, Marco

ABSTRACT ENTITIES Are abstract entities occult? Ney, .72 That the houses and roses and

Outline W X (DEP) CSci 4271W Return-oriented programming (ROP) Development of Secure

t t r srt

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal