Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei)

PROJECT GUIDELINES ( updated )

Project Goals • Select a dataset / prediction problem • Perform exploratory analysis   and preprocesssing • Apply one or more algorithms • Critically evaluate results • Submit a report and present project

Proposals • Due: 28 October • Presentation:10+5 mins • Proposal: 1-2 pages • Describe • Dataset • Prediction task • Proposed methods

Presentation and Report • Due: 2 December • Presentation • 20 mins + 10 discussion • Report • 8-10 pages, 11 pts • Code

Grading • Proposal: 15% • Problem and Results: 20% • Data and Code: 15% • Report: 35% • Presentation: 15%

Grading • Problem and Results: 20% • Novelty of task • Own dataset vs UCI dataset • Number of algorithms tested • Novelty of algorithms

Grading • Data and Code: 15% • Documentation and Readability • TAs should be able to run code • Reproducibility   (can figures and tables be generated by running code?)

Grading • Report: 35% • Exploratory analysis of data • Explain how properties of data   relate to choice of algorithm • Description of algorithms   and methodology • Discussion of results • Which methods work well,   which do not, and why? • Comparison to state of art?

Example: Minimum Viable Project • Get 2-3 datasets   from UCI repository • Figure out what pre-processing   (if any) is needed • Run every applicable   algorithm in scikit learn • Explain which algorithms work well   on which datasets and why

Example: More Ambitious Projects • Find a new dataset or define a novel task   ( i.e. not classification or clustering) • Attack a problem from a Kaggle competition • Implement a recently published method   (talk to me for suggestions)

Homework Updates • HW3 now due on 2 November   (after midterm and proposals) • Removed HW5 to give more   time to work on projects  

MIDTERM REVIEW

List of Topics for Midterm http://www.ccs.neu.edu/course/cs6220f16/sec3/midterm-topics.html • Everything up until last Friday   (expect final to emphasize later topics) • Open book, focus on understanding

BINOMIAL MIXTURES

Mixture of Binomials Suppose we have two coins A and B (weighted). We want to estimate the bias of the two coins. i.e. p A ( head ) = µ A p B ( head ) = µ B Pick a coin at random (simplified version, a equal mixture) Flip 10 times and record ’H’ and ’T’ repeat the process until we have a good size of training data

Mixture of Binomials

Gaussian Mixture Model Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step

Binomial Mixture Model Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step

TOPIC MODELS Borrowing from :   David Blei   (Columbia)

Review: Naive Bayes Features: Words in E-mail Generative Model 1 a   0 aardvark    0  aardwolf   . .   . . x = . .     Maximum Likelihood 1 buy   n   . . . .   . .   0 zygmurgy Labels: Spam or not Spam

Review: Naive Bayes Features: Words in E-mail Generative Model (with prior) 1 a   0 aardvark    0  aardwolf   . .   . . x = . .     Maximum Likelihood 1 buy   n   . . . .   . .   0 zygmurgy Posterior Mean for Parameters Labels: Spam or not Spam

Mixtures of Documents Observations: Bag of Words 1 a   24 0 aardvark 1    0  aardwolf   . .   . . x = . .     1 buy   4 n   . . . .   . .   0 zygmurgy Clusters: Types of Documents

Mixtures of Documents Observations: Bag of Words Generative Model (with prior) 1 a   24 0 aardvark 1    0  aardwolf   . .   . . x = . .     Maximum Likelihood 1 buy   4 n   . . . .   . .   0 zygmurgy How should we modify   the generative model? Clusters: Types of Documents

Mixtures of Documents Observations: Bag of Words Generative Model (with prior) 1 a   24 0 aardvark 1    0  aardwolf   . .   . . x = . .     1 buy   4 n   . . . .   . .   0 zygmurgy Clusters: Types of Documents

Topic Modeling Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Naive Bayes : Documents belong a class • Topic Models : Words belong to a class

Latent Dirichlet Allocation Per-word Proportions topic assignment parameter Per-document Topic Observed Topics topic proportions parameter word α θ d Z d,n W d,n β k η N D K

PLSI/PLSA: EM for LDA Generative Model (no priors) Expectation Step Maximization Step

Variational Inference for LDA (sketch) Generative Model LDA: θ d α Z d,n W d,n β k η N D K Variational Approximation % d,n ! d,n # k $ k N " d ! d D K

Variational Inference for LDA (sketch) One iteration of mean field variational inference for LDA (1) For each topic k and term v : D N ⇥ ⇥ λ ( t + 1 ) 1 (w d , n = v) φ ( t ) = η + n , k . (8) k ,v d = 1 n = 1 (2) For each document d : (a) Update γ d : = α k + � N γ ( t + 1 ) n = 1 φ ( t ) (9) d , n , k . d , k (b) For each word n , update ⌅ φ d , n : ⌅ ⇧ k ,w n ) � Ψ ( � V φ ( t + 1 ) Ψ ( γ ( t + 1 ) ) + Ψ ( λ ( t + 1 ) v = 1 λ ( t + 1 ) d , n , k ⇥ exp ) , (10) d , k k ,v where Ψ is the digamma function, the first derivative of the log Γ function.

Example Inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics

Example Inference human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

Example Inference

Example Inference problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem

Performance Metric: Perplexity Nematode abstracts Associated Press 3400 7000 Smoothed Unigram Smoothed Unigram Smoothed Mixt. Unigrams Smoothed Mixt. Unigrams 3200 6500 LDA LDA Fold in pLSI Fold in pLSI 3000 6000 2800 5500 Perplexity 2600 Perplexity 5000 2400 4500 2200 4000 2000 3500 1800 1600 3000 1400 2500 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics 7000 ⇢ − P d log p ( w d ) � perplexity = exp P d N d Marginal likelihood (evidence) of held out documents

Extensions of LDA • EM inference (PLSA/PLSI) yields similar   results to Variational inference (LDA) on most data • Reason for popularity of LDA:   can be embedded in more complicated models

Extensions: Correlated Topic Model β k Σ η d Z d,n W d,n N D K µ Noconjugate prior on topic proportions Estimate a covariance matrix Σ that parameterizes correlations between topics in a document

Extensions: Dynamic Topic Models 1789 2009 Inaugural addresses My fellow citizens: I stand here today humbled by the task AMONG the vicissitudes incident to life no event could before us, grateful for the trust you have bestowed, mindful have filled me with greater anxieties than that of which of the sacrifices borne by our ancestors... the notification was transmitted by your order... Track changes in word distributions   associated with a topic over time.

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei) PROJECT GUIDELINES ( updated ) Project Goals Select a dataset / prediction problem Perform exploratory analysis

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

T8: Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming

Low Scale Testable Leptogenesis Jacobo Lpez-Pavn Neutrino Physics at the High Energy

STATISTICAL ANALYSIS OF MASS SPECTROMETRY IMAGING EXPERIMENTS Kylie Bemis Purdue University

Ideology Estimation, Media Slant, and Opinion Segregation: Facebook as a Social Barometer

MacSeNet/SpaRTan Spring School on Sparse Representations and Compressed Sensing Sp Spar arse

1 Radial Arm Maze Olton Radial Arm Maze Observe path taken Typical track of rat in radial arm

Single Top Physics at the Tevatron Enrique Palencia Fermilab for the CDF and D

Statistical methods for understanding neural coding and dynamics Liam Paninski Department of