Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei) PROJECT GUIDELINES ( updated ) Project Goals Select a dataset / prediction problem Perform exploratory analysis


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 3 - Fall 2016

Lecture 11

Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei)

slide-2
SLIDE 2

PROJECT GUIDELINES

(updated)

slide-3
SLIDE 3

Project Goals

  • Select a dataset / prediction problem
  • Perform exploratory analysis 


and preprocesssing

  • Apply one or more algorithms
  • Critically evaluate results
  • Submit a report and present project
slide-4
SLIDE 4

Proposals

  • Due: 28 October
  • Presentation:10+5 mins
  • Proposal: 1-2 pages
  • Describe
  • Dataset
  • Prediction task
  • Proposed methods
slide-5
SLIDE 5

Presentation and Report

  • Due: 2 December
  • Presentation
  • 20 mins + 10 discussion
  • Report
  • 8-10 pages, 11 pts
  • Code
slide-6
SLIDE 6

Presentation and Report

  • Due: 2 December
  • Presentation
  • 20 mins + 10 discussion
  • Report
  • 8-10 pages, 11 pts
  • Code
slide-7
SLIDE 7

Grading

  • Proposal: 15%
  • Problem and Results: 20%
  • Data and Code: 15%
  • Report: 35%
  • Presentation: 15%
slide-8
SLIDE 8

Grading

  • Problem and Results: 20%
  • Novelty of task
  • Own dataset vs UCI dataset
  • Number of algorithms tested
  • Novelty of algorithms
slide-9
SLIDE 9

Grading

  • Data and Code: 15%
  • Documentation and Readability
  • TAs should be able to run code
  • Reproducibility 


(can figures and tables be generated by running code?)

slide-10
SLIDE 10

Grading

  • Report: 35%
  • Exploratory analysis of data
  • Explain how properties of data


relate to choice of algorithm

  • Description of algorithms 


and methodology

  • Discussion of results
  • Which methods work well,


which do not, and why?

  • Comparison to state of art?
slide-11
SLIDE 11

Example: Minimum Viable Project

  • Get 2-3 datasets 


from UCI repository

  • Figure out what pre-processing 


(if any) is needed

  • Run every applicable 


algorithm in scikit learn

  • Explain which algorithms work well

  • n which datasets and why
slide-12
SLIDE 12

Example: More Ambitious Projects

  • Find a new dataset or define a novel task


(i.e. not classification or clustering)

  • Attack a problem from a Kaggle competition
  • Implement a recently published method


(talk to me for suggestions)

slide-13
SLIDE 13

Homework Updates

  • HW3 now due on 2 November


(after midterm and proposals)

  • Removed HW5 to give more 


time to work on projects


slide-14
SLIDE 14

MIDTERM REVIEW

slide-15
SLIDE 15

List of Topics for Midterm

  • Everything up until last Friday


(expect final to emphasize later topics)

  • Open book, focus on understanding

http://www.ccs.neu.edu/course/cs6220f16/sec3/midterm-topics.html

slide-16
SLIDE 16

BINOMIAL MIXTURES

slide-17
SLIDE 17

Mixture of Binomials

Suppose we have two coins A and B (weighted). We want to estimate the bias of the two coins. i.e. pA(head) = µA pB(head) = µB Pick a coin at random (simplified version, a equal mixture) Flip 10 times and record ’H’ and ’T’ repeat the process until we have a good size of training data

slide-18
SLIDE 18

Mixture of Binomials

slide-19
SLIDE 19

Gaussian Mixture Model

Generative Model

Expectation Maximization Initialize θ Repeat until convergence

  • 1. Expectation Step
  • 2. Maximization Step
slide-20
SLIDE 20

Binomial Mixture Model

Generative Model

Expectation Maximization Initialize θ Repeat until convergence

  • 1. Expectation Step
  • 2. Maximization Step
slide-21
SLIDE 21

Binomial Mixture Model

Generative Model

Expectation Maximization Initialize θ Repeat until convergence

  • 1. Expectation Step
  • 2. Maximization Step
slide-22
SLIDE 22

TOPIC MODELS

Borrowing from:
 David Blei
 (Columbia)

slide-23
SLIDE 23

Review: Naive Bayes

Labels: Spam or not Spam

n

x =            1 . . . 1 . . .            a aardvark aardwolf . . . buy . . . zygmurgy

Features: Words in E-mail Generative Model Maximum Likelihood

slide-24
SLIDE 24

Review: Naive Bayes

Labels: Spam or not Spam

n

x =            1 . . . 1 . . .            a aardvark aardwolf . . . buy . . . zygmurgy

Features: Words in E-mail Generative Model (with prior) Maximum Likelihood Posterior Mean for Parameters

slide-25
SLIDE 25

Mixtures of Documents

n

x =            1 . . . 1 . . .            a aardvark aardwolf . . . buy . . . zygmurgy

Observations: Bag of Words 24 1 4 Clusters: Types of Documents

slide-26
SLIDE 26

Mixtures of Documents

n

x =            1 . . . 1 . . .            a aardvark aardwolf . . . buy . . . zygmurgy

Observations: Bag of Words Generative Model (with prior) Maximum Likelihood 24 1 4

How should we modify
 the generative model?

Clusters: Types of Documents

slide-27
SLIDE 27

Mixtures of Documents

Clusters: Types of Documents

n

x =            1 . . . 1 . . .            a aardvark aardwolf . . . buy . . . zygmurgy

Observations: Bag of Words Generative Model (with prior) 24 1 4

slide-28
SLIDE 28

Topic Modeling

gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01

  • rganism 0.01

.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

Topics Documents Topic proportions and assignments

  • Naive Bayes: Documents belong a class
  • Topic Models: Words belong to a class
slide-29
SLIDE 29

Latent Dirichlet Allocation

θd Zd,n Wd,n N D K

βk

α

η

Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter

slide-30
SLIDE 30

PLSI/PLSA: EM for LDA

Generative Model (no priors) Expectation Step Maximization Step

slide-31
SLIDE 31

Variational Inference for LDA (sketch)

D N K "d !d %d,n !d,n #k $k

LDA:

θd Zd,n Wd,n N D K

βk

α

η

Variational Approximation Generative Model

slide-32
SLIDE 32

Variational Inference for LDA (sketch)

D N K "d !d %d,n !d,n #k $k

LDA:

θd Zd,n Wd,n N D K

βk

α

η

Variational Approximation Generative Model

slide-33
SLIDE 33

Variational Inference for LDA (sketch)

One iteration of mean field variational inference for LDA (1) For each topic k and term v: (8) λ(t+1)

k,v

= η +

D

d=1 N

n=1

1(wd,n = v)φ(t)

n,k.

(2) For each document d: (a) Update γd: (9) γ (t+1)

d,k

= αk + N

n=1 φ(t) d,n,k.

(b) For each word n, update ⌅ φd,n: (10) φ(t+1)

d,n,k ⇥ exp

⌅ Ψ (γ (t+1)

d,k

) + Ψ (λ(t+1)

k,wn ) Ψ (V v=1 λ(t+1) k,v

) ⇧ , where Ψ is the digamma function, the first derivative of the log Γ function.

slide-34
SLIDE 34

Example Inference

1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4

slide-35
SLIDE 35

Example Inference

human evolution disease computer genome evolutionary host models dna species bacteria information genetic

  • rganisms

diseases data genes life resistance computers sequence

  • rigin

bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

slide-36
SLIDE 36

Example Inference

slide-37
SLIDE 37

Example Inference

problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem

slide-38
SLIDE 38

Performance Metric: Perplexity

10 20 30 40 50 60 70 80 90 100 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400

Number of Topics Perplexity

Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI 7000

Nematode abstracts Associated Press

20 40 60 80 100 120 140 160 180 200 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000

Number of Topics Perplexity

Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI

perplexity = exp ⇢− P

d log p(wd)

P

d Nd

  • Marginal likelihood (evidence) of held out documents
slide-39
SLIDE 39

Extensions of LDA

  • EM inference (PLSA/PLSI) yields similar 


results to Variational inference (LDA) on most data

  • Reason for popularity of LDA: 


can be embedded in more complicated models

slide-40
SLIDE 40

Extensions: Correlated Topic Model

Estimate a covariance matrix Σ that parameterizes correlations between topics in a document

Zd,n Wd,n N D K

Σ µ ηd

βk

Noconjugate prior

  • n topic proportions
slide-41
SLIDE 41

Extensions: Dynamic Topic Models

Track changes in word distributions 
 associated with a topic over time.

AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order...

1789

My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful

  • f the sacrifices borne by our ancestors...

2009 Inaugural addresses

slide-42
SLIDE 42

Extensions: Dynamic Topic Models

D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α

βk,1 βk,2 βk,T

. . .

slide-43
SLIDE 43

Extensions: Dynamic Topic Models

D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α

βk,1 βk,2 βk,T

. . .

slide-44
SLIDE 44

Extensions: Dynamic Topic Models

1880 electric machine power engine steam two machines iron battery wire 1890 electric power company steam electrical machine two system motor engine 1900 apparatus steam power engine engineering water construction engineer room feet 1910 air water engineering apparatus room laboratory engineer made gas tube 1920 apparatus tube air pressure water glass gas made laboratory mercury 1930 tube apparatus glass air mercury laboratory pressure made gas small 1940 air tube apparatus glass laboratory rubber pressure small mercury gas 1950 tube apparatus glass air chamber instrument small laboratory pressure rubber 1960 tube system temperature air heat chamber power high instrument control 1970 air heat power system temperature chamber high flow tube design 1980 high power design heat system systems devices instruments control large 1990 materials high power current applications technology devices design device heat 2000 devices device materials current gate high light silicon material technology

slide-45
SLIDE 45

Extensions: Dynamic Topic Models

1880 1900 1920 1940 1960 1980 2000

  • o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o o o o o o
  • o
  • o o o o
  • o o
  • o o
  • o o

1880 1900 1920 1940 1960 1980 2000

  • o o
  • o
  • o
  • o o o
  • o o o o o o o o o o
  • o o o
  • o
  • o
  • o o o o
  • o o
  • o o o o o o o o o o o o o
  • o o o o o

RELATIVITY LASER FORCE NERVE OXYGEN NEURON "Theoretical Physics" "Neuroscience"

slide-46
SLIDE 46

Extensions: Supervised LDA

θd Zd,n Wd,n N D K

βk

α Yd η, σ2

1 Draw topic proportions θ | α ∼ Dir(α). 2 For each word

  • Draw topic assignment zn | θ ∼ Mult(θ).
  • Draw word wn | zn, β1:K ∼ Mult(βzn).

3 Draw response variable y | z1:N, η, σ2 ∼ N

  • η>¯

z, σ2 , where ¯ z = (1/N) PN

n=1 zn.

slide-47
SLIDE 47

Extensions: Supervised LDA

both motion simple perfect fascinating power complex however cinematography screenplay performances pictures effective picture his their character many while performance between

−30 −20 −10 10 20

  • more

has than films director will characters

  • ne

from there which who much what awful featuring routine dry

  • ffered

charlie paris not about movie all would they its have like you was just some

  • ut

bad guys watchable its not

  • ne

movie least problem unfortunately supposed worse flat dull

slide-48
SLIDE 48

Extensions: Ideal Point Topic Models

θd N D K

βk

α

Wdn Zdn η Xu

Legislator ideal points Observed votes

U

Vud Ad, Bd σ2

u

σ2

d

Bill content (topic model) Bill sentiment variables

slide-49
SLIDE 49

Extensions: Ideal Point Topic Models

dod,defense,defense and appropriation,military,subtitle veteran,veterans,bills,care,injury people,woman,american,nation,school producer,eligible,crop,farm,subparagraph coin,inspector,designee,automobile,lebanon bills,iran,official,company,sudan human,vietnam,united nations,call,people drug,pediatric,product,device,medical child,fire,attorney,internet,bills surveillance,director,court,electronic,flood energy,bills,price,commodity,market land,site,bills,interior,river child,center,poison,victim,abuse coast guard,vessel,space,administrator,requires science,director,technology,mathematics,bills computer,alien,bills,user,collection head,start,child,technology,award loss,crop,producer,agriculture,trade bills,tax,subparagraph,loss,taxable cover,bills,bridge,transaction,following transportation,rail,railroad,passenger,homeland security business,administrator,bills,business concern,loan defense,iraq,transfer,expense,chapter medicare,medicaid,child,chip,coverage student,loan,institution,lender,school energy,fuel,standard,administrator,lamp housing,mortgage,loan,family,recipient bank,transfer,requires,holding company,industrial county,eligible,ballot,election,jurisdiction tax credit,budget authority,energy,outlays,tax