Classification with generative models 2 DSE 210 Classification with - PDF document

Classification with generative models 2 DSE 210 Classification with parametrized models Classifiers with a fixed number of parameters can represent a limited set of functions. Learning a model is about picking a good approximation. Typically the x ’s are points in d -dimensional Euclidean space, R d . Two ways to classify: • Generative : model the individual classes. • Discriminative : model the decision boundary between the classes.

The Bayes-optimal prediction Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1= π 2= π 3= 10% 50% 40% x Labels Y = { 1 , 2 , . . . , k } , density Pr ( x ) = π 1 P 1 ( x ) + · · · + π k P k ( x ). For any x ∈ X and any label j , Pr ( y = j | x ) = Pr ( y = j ) Pr ( x | y = j ) π j P j ( x ) = P k Pr ( x ) i =1 π i P i ( x ) Bayes-optimal prediction: h ∗ ( x ) = arg max j π j P j ( x ). The winery prediction problem Which winery is it from, 1, 2, or 3? Using one feature (’Alcohol’), error rate is 29%. What if we use two features?

The data set, again Training set obtained from 130 bottles • Winery 1: 43 bottles • Winery 2: 51 bottles • Winery 3: 36 bottles • For each bottle, 13 features: ’Alcohol’, ’Malic acid’, ’Ash’, ’Alcalinity of ash’,’Magnesium’, ’Total phenols’, ’Flavanoids’, ’Nonflavanoid phenols’, ’Proanthocyanins’, ’Color intensity’, ’Hue’, ’OD280/OD315 of diluted wines’, ’Proline’ Also, a separate test set of 48 labeled points. This time: ’Alcohol’ and ’Flavanoids’. Why it helps to add features Better separation between the classes! Error rate drops from 29% to 8%.

Bivariate distributions Simplest option: treat each variable as independent. Example: For a large collection of people, measure the two variables H = height W = weight Independence would mean Pr ( H = h , W = w ) = Pr ( H = h ) Pr ( W = w ) , which would also imply E ( HW ) = E ( H ) E ( W ). Is this an accurate approximation? No: we’d expect height and weight to be positively correlated . Types of correlation H , W positively correlated. This also implies weight E ( HW ) > E ( H ) E ( W ) . height Y Y X X X , Y negatively correlated X , Y uncorrelated

Pearson (1903): fathers and sons Heights of fathers and their full grown sons 78 76 74 72 Son’s height (inches) 70 68 66 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 Father’s height (inches) How to quantify the degree of correlation? Correlation pictures r = 0 r = 1 r = 0 . 25 r = − 0 . 25 r = 0 . 5 r = − 0 . 5 r = 0 . 75 r = − 0 . 75

Covariance and correlation Suppose X has mean µ X and Y has mean µ Y . • Covariance cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y Maximized when X = Y , in which case it is var( X ). In general, it is at most std( X )std( Y ). • Correlation cov( X , Y ) corr( X , Y ) = std( X )std( Y ) This is always in the range [ − 1 , 1]. Covariance and correlation: example 1 cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y cov( X , Y ) corr( X , Y ) = std( X )std( Y ) Pr ( x , y ) µ X = 0 x y − 1 − 1 1 / 3 µ Y = − 1 / 3 − 1 1 1 / 6 var( X ) = 1 1 − 1 1 / 3 var( Y ) = 8 / 9 1 1 1 / 6 cov( X , Y ) = 0 corr( X , Y ) = 0 In this case, X , Y are independent. Independent variables always have zero covariance and correlation.

Covariance and correlation: example 2 cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y cov( X , Y ) corr( X , Y ) = std( X )std( Y ) Pr ( x , y ) µ X = 0 x y − 1 − 10 1 / 6 µ Y = 0 − 1 10 1 / 3 var( X ) = 1 1 − 10 1 / 3 var( Y ) = 100 1 10 1 / 6 cov( X , Y ) = − 10 / 3 corr( X , Y ) = − 1 / 3 In this case, X and Y are negatively correlated. Return to winery example Better separation between the classes! Error rate drops from 29% to 8%.

The bivariate Gaussian Model class 1 by a bivariate Gaussian, parametrized by: ✓ ◆ ✓ ◆ 13 . 7 0 . 20 0 . 06 mean µ = and covariance matrix Σ = 3 . 0 0 . 06 0 . 12 The bivariate (2-d) Gaussian A distribution over ( x 1 , x 2 ) ∈ R 2 , parametrized by: • Mean ( µ 1 , µ 2 ) ∈ R 2 , where µ 1 = E ( X 1 ) and µ 2 = E ( X 2 )  Σ 11 � Σ 12 • Covariance matrix Σ = where Σ 21 Σ 22 8 9 Σ 11 = var( X 1 ) < = Σ 22 = var( X 2 ) Σ 12 = Σ 21 = cov( X 1 , X 2 ) : ; Density is highest at the mean, falls o ff in ellipsoidal contours.

Density of the bivariate Gaussian • Mean ( µ 1 , µ 2 ) ∈ R 2 , where µ 1 = E ( X 1 ) and µ 2 = E ( X 2 )  Σ 11 � Σ 12 • Covariance matrix Σ = Σ 21 Σ 22  x 1 − µ 1  x 1 − µ 1 �! � T 1 − 1 Σ − 1 Density p ( x 1 , x 2 ) = 2 π | Σ | 1 / 2 exp x 2 − µ 2 x 2 − µ 2 2 Bivariate Gaussian: examples In either case, the mean is (1 , 1).  4 �  � 0 4 1 . 5 Σ = Σ = 0 1 1 . 5 1

The decision boundary Go from 1 to 2 features: error rate goes from 29% to 8%. What kind of function is this? And, can we use more features?

DSE 210: Probability and statistics Winter 2018 Worksheet 6 — Generative models 2 1. Would you expect the following pairs of random variables to be uncorrelated, positively correlated, or negatively correlated? (a) The weight of a new car and its price. (b) The weight of a car and the number of seats in it. (c) The age in years of a second-hand car and its current market value. 2. Consider a population of married couples in which every wife is exactly 0.9 of her husband’s age. What is the correlation between husband’s age and wife’s age? 3. Each of the following scenarios describes a joint distribution ( x, y ). In each case, give the parameters of the (unique) bivariate Gaussian that satisfies these properties. (a) x has mean 2 and standard deviation 1, y has mean 2 and standard deviation 0.5, and the correlation between x and y is − 0 . 5. (b) x has mean 1 and standard deviation 1, and y is equal to x . 4. Roughly sketch the shapes of the following Gaussians N ( µ, Σ ). For each, you only need to show a representative contour line which is qualitatively accurate (has approximately the right orientation, for instance). ✓ 0 ✓ 9 ◆ ◆ 0 (a) µ = and Σ = 0 0 1 ✓ 0 ◆ ✓ ◆ 1 − 0 . 75 (b) µ = and Σ = 0 − 0 . 75 1 5. For each of the two Gaussians in the previous problem, check your answer using Python: draw 100 random samples from that Gaussian and plot it. 6-1

Linear algebra primer DSE 210 Data as vectors and matrices 6 5 4 3 2 1 0 0 1 2 3 4 5 6

Matrix-vector notation Vector x ∈ R d : Matrix M ∈ R r × d :   M 11 M 12 M 1 d   x 1 · · · M 21 M 22 M 2 d x 2 · · ·     M =  . . .  ...   . . . x 3   x = . . .      .  .   M r 1 M r 2 M rd . · · ·   x d M ij = entry at row i , column j Transpose of vectors and matrices   1 6 x T =   x = has transpose   3   0   1 2 0 4 M T = M = 3 9 1 6 has transpose   . 8 7 0 2 • ( A T ) ij = A ji • ( A T ) T = A

Adding and subtracting vectors and matrices Dot product of two vectors Dot product of vectors x , y ∈ R d : x · y = x 1 y 1 + x 2 y 2 + · · · + x d y d . What is the dot product between these two vectors? x 4 3 y 2 1 -4 -3 -2 -1 0 1 2 3 4

Dot products and angles Dot product of vectors x , y 2 R d : x · y = x 1 y 1 + x 2 y 2 + · · · + x d y d . Tells us the angle between x and y : x x · y cos θ = k x k k y k . y θ x is orthogonal (at right angles) to y if and only if x · y = 0 When x , y are unit vectors (length 1): cos θ = x · y What is x · x ? Linear and quadratic functions In one dimension: • Linear: f ( x ) = 3 x + 2 • Quadratic: f ( x ) = 4 x 2 � 2 x + 6 In higher dimension, e.g. x = ( x 1 , x 2 , x 3 ): • Linear: 3 x 1 � 2 x 2 + x 3 + 4 • Quadratic: x 2 1 � 2 x 1 x 3 + 6 x 2 2 + 7 x 1 + 9

Linear functions and dot products 5 4 3 Linear separator 2 4 x 1 + 3 x 2 = 12: 1 0 1 2 3 4 5 For x = ( x 1 , . . . , x d ) ∈ R d , linear separators are of the form: w 1 x 1 + w 2 x 2 + · · · + w d x d = c . Can write as w · x = c , for w = ( w 1 , . . . , w d ). More general linear functions A linear function from R 4 to R : f ( x 1 , x 2 , x 3 , x 4 ) = 3 x 1 − 2 x 3 A linear function from R 4 to R 3 : f ( x 1 , x 2 , x 3 , x 4 ) = (4 x 1 − x 2 , x 3 , − x 1 + 6 x 4 )

Matrix-vector product Product of matrix M ∈ R r × d and vector x ∈ R d : The identity matrix The d × d identity matrix I d sends each x ∈ R d to itself.   1 0 0 0 · · · 0 1 0 0  · · ·    0 0 1 0 I d =   · · ·   . . . . ... . . . .   . . . .   0 0 0 1 · · ·

Matrix-matrix product Product of matrix A ∈ R r × k and matrix B ∈ R k × p :

Matrix products If A ∈ R r × k and B ∈ R k × p , then AB is an r × p matrix with ( i , j ) entry ( AB ) ij = (dot product of i th row of A and j th column of B ) k X = A i ` B ` j ` =1 • I k B = B and A I k = A • Can check: ( AB ) T = B T A T • For two vectors u , v ∈ R d , what is u T v ? Some special cases For vector x ∈ R d , what are x T x and xx T ?

Classification with generative models 2 DSE 210 Classification with - PDF document

Classification with generative models 2 DSE 210 Classification with parametrized models Classifiers with a fixed number of parameters can represent a limited set of functions. Learning a model is about picking a good approximation. Typically the

generative design systems Generative Brief Design Definitions Workshop Processes

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

Deep Generative models for Inverse Problems Alex Dimakis joint work with Ashish Bora, Dave Van

Invertible Generative Models for Inverse Problems Mitigating Representation Error and Dataset Bias

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &

Introduction to Generative Models (and GANs) Haoqiang Fan fhq@megvii.com Nov. 2017 Figures

Compressed Sensing and Generative Models Ashish Bora Ajil Jalal Eric Price Alex Dimakis UT

Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Martin

LEARNING GENERATIVE MODELS ACROSS INCOMPARABLE SPACES Cha harlot otte Bunne unne , David

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of

Empty Rainbow Triangles in k -colored Point Sets Ruy Fabila-Monroy, Daniel Perz, Ana Laura

Lectures 12: Choice, Preference, and Utility Alexander Wolitzky MIT 14.121 1 Individual

A study for CSIRTs strengthening From a View point of Interactive Storytelling in an

Natural Language Processing and Information Retrieval Performance Evaluation Query Expansion

Fuzzy Logic : Introduction Debasis Samanta IIT Kharagpur dsamanta@iitkgp.ac.in 07.01.2015

Probability and Statistics for Computer Science The statement that The average US

Sambuz

Useful Links

Newsletter

Mail Us

Classification with generative models 2 DSE 210 Classification with - PDF document

Classification with generative models 2 DSE 210 Classification with parametrized models Classifiers with a fixed number of parameters can represent a limited set of functions. Learning a model is about picking a good approximation. Typically the

generative design systems Generative Brief Design Definitions Workshop Processes

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Learning Deep Generative Models Inference &amp; Representation Lecture 12 Rahul G. Krishnan

Deep Generative models for Inverse Problems Alex Dimakis joint work with Ashish Bora, Dave Van

Invertible Generative Models for Inverse Problems Mitigating Representation Error and Dataset Bias

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &amp;

Introduction to Generative Models (and GANs) Haoqiang Fan fhq@megvii.com Nov. 2017 Figures

Compressed Sensing and Generative Models Ashish Bora Ajil Jalal Eric Price Alex Dimakis UT

Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Martin

LEARNING GENERATIVE MODELS ACROSS INCOMPARABLE SPACES Cha harlot otte Bunne unne , David

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of

Empty Rainbow Triangles in k -colored Point Sets Ruy Fabila-Monroy, Daniel Perz, Ana Laura

Lectures 12: Choice, Preference, and Utility Alexander Wolitzky MIT 14.121 1 Individual

A study for CSIRTs strengthening From a View point of Interactive Storytelling in an

Natural Language Processing and Information Retrieval Performance Evaluation Query Expansion

Fuzzy Logic : Introduction Debasis Samanta IIT Kharagpur dsamanta@iitkgp.ac.in 07.01.2015

Probability and Statistics for Computer Science The statement that The average US

Sambuz

Useful Links

Newsletter

Mail Us

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &