CS480/680 Lecture 8: June 3, 2019 Classification by Logistic - - PowerPoint PPT Presentation

cs480 680 lecture 8 june 3 2019
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Lecture 8: June 3, 2019 Classification by Logistic - - PowerPoint PPT Presentation

CS480/680 Lecture 8: June 3, 2019 Classification by Logistic Regression, Generalized linear models [RN] Sec 18.6.4, [B] Sec. 4.3, [M] Chapt. 8, [HTF] Sec. 4.4 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Beyond Mixtures of


slide-1
SLIDE 1

CS480/680 Lecture 8: June 3, 2019

Classification by Logistic Regression, Generalized linear models [RN] Sec 18.6.4, [B] Sec. 4.3, [M] Chapt. 8, [HTF] Sec. 4.4

CS480/680 Spring 2019 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

Beyond Mixtures of Gaussians

  • Mixture of Gaussians:

– Restrictive assumption: each class is Gaussian – Picture:

  • Can we consider other distributions than

Gaussians?

CS480/680 Spring 2019 Pascal Poupart 2 University of Waterloo

slide-3
SLIDE 3

Exponential Family

  • More generally, when Pr($|&') are members of the

exponential family (e.g., Gaussian, exponential, Bernoulli, categorical, Poisson, Beta, Dirichlet, Gamma, etc.) Pr $ )' = exp()'

./ $ − 1 )' + 3($))

where )': parameters of class 4 / $ , 1 )' , 3 $ : arbitrary fns of the inputs and params

  • the posterior is a sigmoid logistic linear function in $

Pr &' $ = 7(8.$ + 9:)

CS480/680 Spring 2019 Pascal Poupart 3 University of Waterloo

slide-4
SLIDE 4

Probabilistic Discriminative Models

  • Instead of learning Pr($%) and Pr('|$%) by

maximum likelihood and finding Pr $% ' by Bayesian inference, why not learn Pr $% ' directly by maximum likelihood?

  • We know the general form of Pr($%|'):

– Logistic sigmoid (binary classification) – Softmax (general classification)

CS480/680 Spring 2019 Pascal Poupart 4 University of Waterloo

slide-5
SLIDE 5

Logistic Regression

  • Consider a single data point (", $):

&∗ = )*+,)-& . &/0 " 1 1 − . &/0 "

451

  • Similarly, for an entire dataset 6, 7 :

&∗ = )*+,)-& 9

:

. &/0 ": 1; 1 − . &/0 ":

451;

Objective: negative log likelihood (minimization) < & = − ∑: $: ln .(&/0 ":) + 1 − $: ln(1 − . &/0 ": )

Tip: AB C

AC

= .())(1 − . ) )

CS480/680 Spring 2019 Pascal Poupart 5 University of Waterloo

slide-6
SLIDE 6

Logistic Regression

  • NB: Despite the name, logistic regression is a form of

classification.

  • However, it can be viewed as regression where the

goal is to estimate the posterior Pr #$ % , which is a continuous function

CS480/680 Spring 2019 Pascal Poupart 6 University of Waterloo

slide-7
SLIDE 7

Maximum likelihood

  • Convex loss: set derivative to 0

0 =

#$ #% = − ∑( )( * %+,

  • .

/0* %+,

  • .

,

  • .

* %+,

  • .

− ∑( 1 − )(

/0* %+,

  • .

* %+,

  • 2

0,

  • .

/0* %+,

  • .

⟹ 0 = − ∑( )(,

  • ( − ∑( )(4 %5,
  • ( ,
  • (

+ ∑( 4 %5,

  • ( ,
  • ( + ∑( )(4 %5,
  • ( ,
  • (

⟹ 0 = ∑( 4 %5,

  • ( − )( ,
  • (
  • Sigmoid prevents us from isolating %, so we use an

iterative method instead

CS480/680 Spring 2019 Pascal Poupart 7 University of Waterloo

slide-8
SLIDE 8

Newton’s method

  • Iterative reweighted least square:

! ← ! − $%&'((!)

where '( is the gradient (column vector) and + is the Hessian (matrix)

+ =

  • (
  • ./0

  • (
  • /0-/2

⋮ ⋱ ⋮

  • (
  • /2-/0

  • (
  • /2 .

CS480/680 Spring 2019 Pascal Poupart 8 University of Waterloo

slide-9
SLIDE 9

Hessian

! = #(#% & ) = ∑)*+

,

  • &./

0) 1 − - &./ 0) / 0)/ 0)

.

= / 34/ 3. where 4 =

  • +(1 − -+)

  • ,(1 − -,)

and -+ = -(&./ 0+), -, = -(&./ 0,)

CS480/680 Spring 2019 Pascal Poupart 9 University of Waterloo

slide-10
SLIDE 10

Case study

  • Applications: recommender systems, ad

placement

  • Used by all major companies
  • Advantages: logistic regression is simple,

flexible and efficient

CS480/680 Spring 2019 Pascal Poupart 10 University of Waterloo

slide-11
SLIDE 11

App Recommendation

  • Flexibility: millions of features (binary & numerical)

– Examples:

  • Efficiency: classification by dot products

Multiple classes: Two classes: 0∗ = 3456378

9:;(=>

?@

A) ∑>D 9:;(E>D

? @

A)

0∗ = F1 H =I@ A ≥ 0.5

  • therwise.

= 3456378 =8

I@

A 0∗ = F1 =I@ A ≥ 0

  • therwise

– Sparsity: – Parallelization:

CS480/680 Spring 2019 Pascal Poupart 11 University of Waterloo

slide-12
SLIDE 12

Numerical Issues

  • Logistic Regression is subject to overfitting

– Without enough data, logistic regression can classify each data point arbitrarily well (i.e., Pr #$%%&#' #()** → 1)

  • Problems: -&./ℎ'* → ±∞

Hessian → singular

  • Picture

CS480/680 Spring 2019 Pascal Poupart 12 University of Waterloo

slide-13
SLIDE 13

Regularization

  • Solution: penalize large weights
  • Objective: min

$ % & + ( ) * & ) )

= min

! − - "

." ln 0(&#2 3") + 1 − ." ln(1 − 0 &#2 3" ) + 1 2 *&#&

  • Hessian

7 = 2 892 8: + *;

where <== = 0(&>2 3=)(1 − 0(&>2 3=) the term *? ensures that 7 is not singular (eigenvalues ≥ *)

CS480/680 Spring 2019 Pascal Poupart 13 University of Waterloo

slide-14
SLIDE 14

Generalized Linear Models

  • How can we do non-linear regression and

classification while using the same machinery?

  • Idea: map inputs to a different space and do

linear regression/classification in that space

CS480/680 Spring 2019 Pascal Poupart 14 University of Waterloo

slide-15
SLIDE 15

Example

  • Suppose the underlying function is quadratic

CS480/680 Spring 2019 Pascal Poupart 15 University of Waterloo

slide-16
SLIDE 16

Basis functions

  • Use non-linear basis functions:

– Let !" denote a basis function !# $ = 1 !' $ = $ !( $ = $( – Let the hypothesis space ) be ) = {$ → ,#!# $ + ,'!' $ + ,(!(($)|," ∈ ℜ}

  • If the basis functions are non-linear in $, then a non-

linear hypothesis can still be found by linear regression

CS480/680 Spring 2019 Pascal Poupart 16 University of Waterloo

slide-17
SLIDE 17

Common basis functions

  • Polynomial: !" # = #"
  • Gaussian: !" # = %&

!"#$ % %&%

  • Sigmoid: !" # = '

(&)$ *

where ' + =

, ,-."'

  • Also Fourier basis functions, wavelets, etc.

CS480/680 Spring 2019 Pascal Poupart 17 University of Waterloo

slide-18
SLIDE 18

Generalized Linear Models

  • Linear regression:

!∗ = $%&'()!

* + ∑-.* /

0- − !23 45

+ + 7 +

!

+ +

  • Generalized linear regression:

!∗ = $%&'()!

* + ∑-.* /

0- − !28(45)

+ + 7 +

!

+ +

  • Linear separator (classification):

!∗ = $%&'()! − ∑- ;- ln >(!?3 4-) + 1 − ;- ln(1 − > !?3 4- ) +

7 +

!

+ +

  • Generalized linear separator (classification):

!∗ = $%&'()! − ∑- ;- ln >(!?8(4-)) + 1 − ;- ln(1 − > !?8(4-) ) +

7 +

!

+ +

CS480/680 Spring 2019 Pascal Poupart 18 University of Waterloo