CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 - - PowerPoint PPT Presentation

โ–ถ
cs6220 data mining techniques
SMART_READER_LITE
LIVE PREVIEW

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 - - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 27, 2015 Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network


slide-1
SLIDE 1

CS6220: DATA MINING TECHNIQUES

Instructor: Yizhou Sun

yzsun@ccs.neu.edu September 27, 2015

Matrix Data: Classification: Part 2

slide-2
SLIDE 2

Methods to Learn

2

Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification

Decision Tree; Naรฏve Bayes; Logistic Regression SVM; kNN HMM Label Propagation* Neural Network

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k- means* PLSA SCAN*; Spectral Clustering*

Frequent Pattern Mining

Apriori; FP-growth GSP; PrefixSpan

Prediction

Linear Regression Autoregression

Similarity Search

DTW P-PageRank

Ranking

PageRank

slide-3
SLIDE 3

Matrix Data: Classification: Part 2

  • Bayesian Learning
  • Naรฏve Bayes
  • Bayesian Belief Network
  • Logistic Regression
  • Summary

3

slide-4
SLIDE 4

Basic Probability Review

  • Have two dices h1 and h2
  • The probability of rolling an i given die h1 is denoted

P(i|h1). This is a conditional probability

  • Pick a die at random with probability P(hj), j=1 or 2. The

probability for picking die hj and rolling an i with it is called joint probability and is P(i, hj)=P(hj)P(i| hj).

  • If we know P(i| hj), then the so-called marginal probability

P(i) can be computed as: ๐‘„ ๐‘— = ๐‘˜ ๐‘„(๐‘—, โ„Ž๐‘˜)

  • For any X and Y, P(X,Y)=P(X|Y)P(Y)

4

slide-5
SLIDE 5

Bayesโ€™ Theorem: Basics

  • Bayesโ€™ Theorem:
  • Let X be a data sample (โ€œevidenceโ€)
  • Let h be a hypothesis that X belongs to class C
  • P(h) (prior probability): the initial probability
  • E.g., X will buy computer, regardless of age, income, โ€ฆ
  • P(X|h) (likelihood): the probability of observing the

sample X, given that the hypothesis holds

  • E.g., Given that X will buy computer, the prob. that X is 31..40,

medium income

  • P(X): marginal probability that sample data is observed
  • ๐‘„ ๐‘Œ = โ„Ž ๐‘„ ๐‘Œ โ„Ž ๐‘„(โ„Ž)
  • P(h|X), (i.e., posterior probability): the probability that

the hypothesis holds given the observed data sample X

) ( ) ( ) | ( ) | ( X X X P h P h P h P ๏€ฝ

5

slide-6
SLIDE 6

Classification: Choosing Hypotheses

  • Maximum Likelihood (maximize the likelihood):
  • Maximum a posteriori (maximize the posterior):
  • Useful observation: it does not depend on the denominator P(X)

6

) | ( max arg h D P h

H h ML ๏ƒŽ

๏€ฝ ) ( ) | ( max arg ) | ( max arg h P h D P D h P h

H h H h MAP ๏ƒŽ ๏ƒŽ

๏€ฝ ๏€ฝ

X X X

slide-7
SLIDE 7

7

Classification by Maximum A Posteriori

  • Let D be a training set of tuples and their associated class labels,

and each tuple is represented by an p-D attribute vector X = (x1, x2, โ€ฆ, xp)

  • Suppose there are m classes Yโˆˆ{C1, C2, โ€ฆ, Cm}
  • Classification is to derive the maximum posteriori, i.e., the

maximal P(Y=Cj|X)

  • This can be derived from Bayesโ€™ theorem
  • Since P(X) is constant for all classes, only

needs to be maximized

) ( ) ( ) | ( ) | ( X X X P j C Y P j C Y P j C Y P ๏€ฝ ๏€ฝ ๏€ฝ ๏€ฝ

) ( ) | ( ) , ( y P y P y P X X ๏€ฝ

slide-8
SLIDE 8

Example: Cancer Diagnosis

  • A patient takes a lab test with two possible

results (+ve, -ve), and the result comes back

  • positive. It is known that the test returns
  • a correct positive result in only 98% of the cases;
  • a correct negative result in only 97% of the cases.
  • Furthermore, only 0.008 of the entire population

has this disease.

  • 1. What is the probability that this patient has

cancer?

  • 2. What is the probability that he does not have

cancer?

  • 3. What is the diagnosis?

8

slide-9
SLIDE 9

Solution

9

P(cancer) = .008 P(๏ƒ˜ cancer) = .992 P(+ve|cancer) = .98 P(-ve|cancer) = .02 P(+ve| ๏ƒ˜ cancer) = .03 P(-ve| ๏ƒ˜ cancer) = .97 Using Bayes Formula: P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve) = 0.98 x 0.008/ P(+ve) = .00784 / P(+ve) P(๏ƒ˜ cancer|+ve) = P(+ve| ๏ƒ˜ cancer)xP(๏ƒ˜ cancer) / P(+ve) = 0.03 x 0.992/P(+ve) = .0298 / P(+ve) So, the patient most likely does not have cancer.

slide-10
SLIDE 10

Matrix Data: Classification: Part 2

  • Bayesian Learning
  • Naรฏve Bayes
  • Bayesian Belief Network
  • Logistic Regression
  • Summary

10

slide-11
SLIDE 11

Naรฏve Bayes Classifier

  • Let D be a training set of tuples and their

associated class labels, and each tuple is represented by an p-D attribute vector X = (x1, x2, โ€ฆ, xp)

  • Suppose there are m classes Yโˆˆ{C1, C2, โ€ฆ, Cm}
  • Goal: Find Y max ๐‘„ ๐‘ ๐’€ = ๐‘„(๐‘, ๐’€)/๐‘„(๐’€) โˆ ๐‘„ ๐’€ ๐‘ ๐‘„(๐‘)
  • A simplified assumption: attributes are

conditionally independent given the class (class conditional independency):

11

) | ( ... ) | ( ) | ( 1 ) | ( ) | (

2 1

C j x P C j x P C j x P p k C j x P C j P

p k

๏‚ด ๏‚ด ๏‚ด ๏€ฝ ๏ƒ• ๏€ฝ ๏€ฝ X

slide-12
SLIDE 12

Estimate Parameters by MLE

  • Given a dataset ๐ธ = {(๐˜i, Yi)}, the goal is to
  • Find the best estimators ๐‘„(๐ท

๐‘˜) and ๐‘„(๐‘Œ๐‘™ = ๐‘ฆ๐‘™|๐ท ๐‘˜), for

every ๐‘˜ = 1, โ€ฆ , ๐‘› ๐‘๐‘œ๐‘’ ๐‘™ = 1, โ€ฆ , ๐‘ž

  • that maximizes the likelihood of observing D:

๐‘€ =

๐‘—

๐‘„ ๐˜i, Yi =

๐‘—

๐‘„ ๐˜i|Yi ๐‘„(๐‘

๐‘—)

=

๐‘—

(

๐‘™

๐‘„ ๐‘Œ๐‘—๐‘™|๐‘

๐‘— )๐‘„(๐‘ ๐‘—)

  • Estimators of Parameters:
  • ๐‘„ ๐ท

๐‘˜ = ๐ท ๐‘˜,๐ธ / ๐ธ (|๐ท๐‘˜,๐ธ|= # of tuples of Cj in D) (why?)

  • ๐‘„ ๐‘Œ๐‘™ = ๐‘ฆ๐‘™ ๐ท

๐‘˜ : ๐‘Œ๐‘™ can be either discrete or numerical

12

slide-13
SLIDE 13

Discrete and Continuous Attributes

  • If ๐‘Œ๐‘™ is discrete, with ๐‘Š possible values
  • P(xk|Cj) is the # of tuples in Cj having value xk for

Xk divided by |Cj, D|

  • If ๐‘Œ๐‘™ is continuous, with observations of real

values

  • P(xk|Cj) is usually computed based on Gaussian

distribution with a mean ฮผ and standard deviation ฯƒ

  • Estimate (ฮผ, ๐œ2) according to the observed X in

the category of Cj

  • Sample mean and sample variance
  • P(xk|Cj) is then

13

) , , ( ) | (

j j

C C k k k

x g C j x X P ๏ณ ๏ญ ๏€ฝ ๏€ฝ

Gaussian density function

slide-14
SLIDE 14

Naรฏve Bayes Classifier: Training Dataset

Class: C1:buys_computer = โ€˜yesโ€™ C2:buys_computer = โ€˜noโ€™ Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

age income student credit_rating ys_comp <=30 high no fair no <=30 high no excellent no 31โ€ฆ40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31โ€ฆ40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31โ€ฆ40 medium no excellent yes 31โ€ฆ40 high yes fair yes >40 medium no excellent no

14

slide-15
SLIDE 15

Naรฏve Bayes Classifier: An Example

  • P(Ci): P(buys_computer = โ€œyesโ€) = 9/14 = 0.643

P(buys_computer = โ€œnoโ€) = 5/14= 0.357

  • Compute P(X|Ci) for each class

P(age = โ€œ<=30โ€ | buys_computer = โ€œyesโ€) = 2/9 = 0.222 P(age = โ€œ<= 30โ€ | buys_computer = โ€œnoโ€) = 3/5 = 0.6 P(income = โ€œmediumโ€ | buys_computer = โ€œyesโ€) = 4/9 = 0.444 P(income = โ€œmediumโ€ | buys_computer = โ€œnoโ€) = 2/5 = 0.4 P(student = โ€œyesโ€ | buys_computer = โ€œyes) = 6/9 = 0.667 P(student = โ€œyesโ€ | buys_computer = โ€œnoโ€) = 1/5 = 0.2 P(credit_rating = โ€œfairโ€ | buys_computer = โ€œyesโ€) = 6/9 = 0.667 P(credit_rating = โ€œfairโ€ | buys_computer = โ€œnoโ€) = 2/5 = 0.4

  • X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = โ€œyesโ€) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = โ€œnoโ€) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = โ€œyesโ€) * P(buys_computer = โ€œyesโ€) = 0.028 P(X|buys_computer = โ€œnoโ€) * P(buys_computer = โ€œnoโ€) = 0.007 Therefore, X belongs to class (โ€œbuys_computer = yesโ€)

age income student credit_rating ys_comp <=30 high no fair no <=30 high no excellent no 31โ€ฆ40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31โ€ฆ40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31โ€ฆ40 medium no excellent yes 31โ€ฆ40 high yes fair yes >40 medium no excellent no

15

slide-16
SLIDE 16

16

Avoiding the Zero-Probability Problem

  • Naรฏve Bayesian prediction requires each conditional prob. be non-zero.

Otherwise, the predicted prob. will be zero

  • Use Laplacian correction (or Laplacian smoothing)
  • Adding 1 to each case
  • ๐‘„ ๐‘ฆ๐‘™ = ๐‘ค ๐ท

๐‘˜ = ๐‘œ๐‘˜๐‘™,๐‘ค+1 ๐ท๐‘˜,๐ธ +๐‘Š where ๐‘œ๐‘˜๐‘™,๐‘ค is # of tuples in Cj having value

๐‘ฆ๐‘™ = v, V is the total number of values that can be taken

  • Ex. Suppose a training dataset with 1000 tuples, for category

โ€œbuys_computer = yesโ€, income=low (0), income= medium (990), and income = high (10)

Prob(income = low|buys_computer = โ€œyesโ€) = 1/1003 Prob(income = medium|buys_computer = โ€œyesโ€) = 991/1003 Prob(income = high|buys_computer = โ€œyesโ€) = 11/1003

  • The โ€œcorrectedโ€ prob. estimates are close to their โ€œuncorrectedโ€

counterparts ๏ƒ• ๏€ฝ ๏€ฝ p k C j xk P C j X P 1 ) | ( ) | (

slide-17
SLIDE 17

*Smoothing and Prior on Attribute Distribution

  • ๐ธ๐‘—๐‘ก๐‘‘๐‘ ๐‘“๐‘ข๐‘“ ๐‘’๐‘—๐‘ก๐‘ข๐‘ ๐‘—๐‘๐‘ฃ๐‘ข๐‘—๐‘๐‘œ: ๐‘Œ๐‘™|๐ท

๐‘˜~ ๐œพ

  • ๐‘„ ๐‘Œ๐‘™ = ๐‘ค ๐ท

๐‘˜, ๐œพ = ๐œ„๐‘ค

  • Put prior to ๐œพ
  • In discrete case, the prior can be chosen as symmetric Dirichlet

distribution: ๐œพ~๐ธ๐‘—๐‘  ๐›ฝ , ๐‘—. ๐‘“. , ๐‘„ ๐œพ โˆ ๐‘ค ๐œ„๐‘ค

๐›ฝโˆ’1

  • ๐‘ž๐‘๐‘ก๐‘ข๐‘“๐‘ ๐‘—๐‘๐‘  ๐‘’๐‘—๐‘ก๐‘ข๐‘ ๐‘—๐‘๐‘ฃ๐‘ข๐‘—๐‘๐‘œ:
  • ๐‘„ ๐œ„ ๐‘Œ1๐‘™, โ€ฆ , ๐‘Œ๐‘œ๐‘™, ๐ท

๐‘˜ โˆ ๐‘„ ๐‘Œ1๐‘™, โ€ฆ , ๐‘Œ๐‘œ๐‘™ ๐ท ๐‘˜, ๐œพ ๐‘„ ๐œพ , another Dirichlet

distribution, with new parameter (๐›ฝ + ๐‘‘1, โ€ฆ , ๐›ฝ + ๐‘‘๐‘ค, โ€ฆ , ๐›ฝ + ๐‘‘๐‘Š)

  • ๐‘‘๐‘ค is the number of observations taking value v
  • Inference: ๐‘„ ๐‘Œ๐‘™ = ๐‘ค ๐‘Œ1๐‘™, โ€ฆ , ๐‘Œ๐‘œ๐‘™, ๐ท

๐‘˜ = โˆซ ๐‘„(๐‘Œ๐‘™ =

๐‘ค|๐œพ)๐‘„ ๐œพ ๐‘Œ1๐‘™, โ€ฆ , ๐‘Œ๐‘œ๐‘™, ๐ท

๐‘˜ d๐œพ

= ๐’…๐’˜ + ๐œท ๐’…๐’˜ + ๐‘พ๐œท

  • Equivalent to adding ๐›ฝ to each observation value ๐‘ค

17

slide-18
SLIDE 18

*Notes on Parameter Learning

  • Why the probability of ๐‘„ ๐‘Œ๐‘™ ๐ท

๐‘˜ is

estimated in this way?

  • http://www.cs.columbia.edu/~mcollins/em.pdf
  • http://www.cs.ubc.ca/~murphyk/Teaching/CS3

40-Fall06/reading/NB.pdf

18

slide-19
SLIDE 19

Naรฏve Bayes Classifier: Comments

  • Advantages
  • Easy to implement
  • Good results obtained in most of the cases
  • Disadvantages
  • Assumption: class conditional independence, therefore loss of

accuracy

  • Practically, dependencies exist among variables
  • E.g., hospitals: patients: Profile: age, family history,

etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

  • Dependencies among these cannot be modeled by

Naรฏve Bayes Classifier

  • How to deal with these dependencies? Bayesian Belief Networks

19

slide-20
SLIDE 20

Matrix Data: Classification: Part 2

  • Bayesian Learning
  • Naรฏve Bayes
  • Bayesian Belief Network
  • Logistic Regression
  • Summary

20

slide-21
SLIDE 21

21

Bayesian Belief Networks (BNs)

  • Bayesian belief network (also known as Bayesian network, probabilistic

network): allows class conditional independencies between subsets of variables

  • Two components: (1) A directed acyclic graph (called a structure) and (2) a set
  • f conditional probability tables (CPTs)
  • A (directed acyclic) graphical model of causal influence relationships
  • Represents dependency among the variables
  • Gives a specification of joint probability distribution

X Y Z P

๏ฑ Nodes: random variables ๏ฑ Links: dependency ๏ฑ X and Y are the parents of Z, and Y is the parent of P ๏ฑ No dependency between Z and P conditional

  • n Y

๏ฑ Has no cycles

21

slide-22
SLIDE 22

22

A Bayesian Network and Some of Its CPTs

Fire (F) Smoke (S) Leaving (L) Tampering (T) Alarm (A) Report (R)

CPT: Conditional Probability Tables

๏ƒ• ๏€ฝ ๏€ฝ n i x Parents i xi P x x P

n

1 )) ( | ( ) ,..., (

1

CPT shows the conditional probability for each possible combination of its parents Derivation of the probability of a particular combination of values of X, from CPT (joint probability):

F ยฌF S .90 .01 ยฌS .10 .99 F, T ๐‘ฎ, ยฌ๐‘ผ ยฌ๐‘ฎ, T ยฌ๐‘ฎ, ยฌ๐‘ผ A .5 .99 .85 .0001 ยฌA .95 .01 .15 .9999

slide-23
SLIDE 23

Inference in Bayesian Networks

  • Infer the probability of values of some

variable given the observations of other variables

  • E.g., P(Fire = True|Report = True, Smoke =

True)?

  • Computation
  • Exact computation by enumeration
  • In general, the problem is NP hard
  • *Approximation algorithms are needed

23

slide-24
SLIDE 24

Inference by enumeration

  • To compute posterior marginal P(Xi | E=e)
  • Add all of the terms (atomic event

probabilities) from the full joint distribution

  • If E are the evidence (observed) variables and

Y are the other (unobserved) variables, then:

P(X|e) = ฮฑ P(X, E) = ฮฑ โˆ‘ P(X, E, Y)

  • Each P(X, E, Y) term can be computed using

the chain rule

  • Computationally expensive!

24

slide-25
SLIDE 25

Example: Enumeration

  • P (d|e) = ๏ก ฮฃABCP(a, b, c, d, e)

= ๏ก ฮฃABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)

  • With simple iteration to compute this

expression, thereโ€™s going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C=true)

  • *A solution: variable elimination

a b c d e

25

slide-26
SLIDE 26

26

*How Are Bayesian Networks Constructed?

  • Subjective construction: Identification of (direct) causal structure
  • People are quite good at identifying direct causes from a given set of variables &

whether the set contains all relevant direct causes

  • Markovian assumption: Each variable becomes independent of its non-effects
  • nce its direct causes are known
  • E.g., S โ€นโ€” F โ€”โ€บ A โ€นโ€” T, path Sโ€”โ€บA is blocked once we know Fโ€”โ€บA
  • Synthesis from other specifications
  • E.g., from a formal system design: block diagrams & info flow
  • Learning from data
  • E.g., from medical records or student admission record
  • Learn parameters give its structure or learn both structure and parms
  • Maximum likelihood principle: favors Bayesian networks that maximize the

probability of observing the given data set

slide-27
SLIDE 27

27

*Learning Bayesian Networks: Several Scenarios

  • Scenario 1: Given both the network structure and all variables observable:

compute only the CPT entries (Easiest case!)

  • Scenario 2: Network structure known, some variables hidden: gradient descent

(greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion function

  • Weights are initialized to random probability values
  • At each iteration, it moves towards what appears to be the best solution at the

moment, w.o. backtracking

  • Weights are updated at each iteration & converge to local optimum
  • Scenario 3: Network structure unknown, all variables observable: search

through the model space to reconstruct network topology

  • Scenario 4: Unknown structure, all hidden variables: No good algorithms

known for this purpose

  • D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in

Graphical Models, M. Jordan, ed. MIT Press, 1999.

slide-28
SLIDE 28

Matrix Data: Classification: Part 2

  • Bayesian Learning
  • Naรฏve Bayes
  • Bayesian Belief Network
  • Logistic Regression
  • Summary

28

slide-29
SLIDE 29

Linear Regression VS. Logistic Regression

  • Linear Regression
  • Y: ๐‘‘๐‘๐‘œ๐‘ข๐‘—๐‘œ๐‘ฃ๐‘๐‘ฃ๐‘ก ๐‘ค๐‘๐‘š๐‘ฃ๐‘“ โˆ’โˆž, +โˆž
  • Y = ๐’š๐‘ˆ๐œธ = ๐›พ0 + ๐‘ฆ1๐›พ1 + ๐‘ฆ2๐›พ2 + โ‹ฏ + ๐‘ฆ๐‘ž๐›พ๐‘ž
  • Y|๐’š, ๐›พ~๐‘‚(๐’š๐‘ˆ๐›พ, ๐œ2)
  • Logistic Regression
  • Y: ๐‘’๐‘—๐‘ก๐‘‘๐‘ ๐‘“๐‘ข๐‘“ ๐‘ค๐‘๐‘š๐‘ฃ๐‘“ ๐‘”๐‘ ๐‘๐‘› ๐‘› ๐‘‘๐‘š๐‘๐‘ก๐‘ก๐‘“๐‘ก
  • ๐‘ž ๐‘ = ๐ท

๐‘˜ โˆˆ (0,1) ๐‘๐‘œ๐‘’ ๐‘˜ ๐‘ž ๐‘ = ๐ท ๐‘˜ = 1

29

slide-30
SLIDE 30

Logistic Function

  • Logistic Function / sigmoid function:

๐œ ๐‘ฆ =

1 1+๐‘“โˆ’๐‘ฆ

30

slide-31
SLIDE 31

Modeling Probabilities of Two Classes

  • ๐‘„ ๐‘ = 1 ๐‘Œ, ๐›พ = ๐œ ๐‘Œ๐‘ˆ๐›พ =

1 1+exp{โˆ’๐‘Œ๐‘ˆ๐›พ} = exp{๐‘Œ๐‘ˆ๐›พ} 1+exp{๐‘Œ๐‘ˆ๐›พ}

  • ๐‘„ ๐‘ = 0 ๐‘Œ, ๐›พ = 1 โˆ’ ๐œ ๐‘Œ๐‘ˆ๐›พ =

exp{โˆ’๐‘Œ๐‘ˆ๐›พ} 1+exp{โˆ’๐‘Œ๐‘ˆ๐›พ} = 1 1+exp{๐‘Œ๐‘ˆ๐›พ}

๐›พ = ๐›พ0 ๐›พ1 โ‹ฎ ๐›พ๐‘ž

  • In other words
  • ๐‘|X, ๐›พ~๐ถ๐‘“๐‘ ๐‘œ๐‘๐‘ฃ๐‘š๐‘š๐‘—(๐œ ๐‘Œ๐‘ˆ๐›พ )

31

slide-32
SLIDE 32

The 1-d Situation

  • ๐‘„ ๐‘ = 1 ๐‘ฆ, ๐›พ0, ๐›พ1 = ๐œ ๐›พ1๐‘ฆ + ๐›พ0

32

slide-33
SLIDE 33

Parameter Estimation

  • MLE estimation
  • Given a dataset ๐ธ, ๐‘ฅ๐‘—๐‘ขโ„Ž ๐‘œ ๐‘’๐‘๐‘ข๐‘ ๐‘ž๐‘๐‘—๐‘œ๐‘ข๐‘ก
  • For a single data object with attributes ๐’š๐‘—, class label

๐‘ง๐‘—

  • Let ๐‘ž ๐’š๐‘—; ๐›พ = ๐‘ž๐‘— = ๐‘ = 1 ๐’š๐‘—, ๐›พ , ๐‘ขโ„Ž๐‘“ ๐‘ž๐‘ ๐‘๐‘. ๐‘๐‘” ๐‘— ๐‘—๐‘œ ๐‘‘๐‘š๐‘๐‘ก๐‘ก 1
  • The probability of observing ๐‘ง๐‘— would be
  • If ๐‘ง๐‘— = 1, ๐‘ขโ„Ž๐‘“๐‘œ ๐‘ž๐‘—
  • If ๐‘ง๐‘— = 0, ๐‘ขโ„Ž๐‘“๐‘œ 1 โˆ’ ๐‘ž๐‘—
  • Combing the two cases: ๐‘ž๐‘—

๐‘ง๐‘— 1 โˆ’ ๐‘ž๐‘— 1โˆ’๐‘ง๐‘—

๐‘€ = ๐‘— ๐‘ž๐‘—

๐‘ง๐‘— 1 โˆ’ ๐‘ž๐‘— 1โˆ’๐‘ง๐‘— = ๐‘— exp ๐‘Œ๐‘ˆ๐›พ 1+exp ๐‘Œ๐‘ˆ๐›พ ๐‘ง๐‘— 1 1+exp ๐‘Œ๐‘ˆ๐›พ 1โˆ’๐‘ง๐‘—

33

slide-34
SLIDE 34

Optimization

  • Equivalent to maximize log likelihood
  • ๐‘€ = ๐‘— ๐‘ง๐‘—๐’š๐‘—

๐‘ˆ๐›พ โˆ’ log 1 + exp ๐’š๐‘— ๐‘ˆ๐›พ

  • Gradient ascent update:
  • ๐›พ๐‘œ๐‘“๐‘ฅ= ๐›พ๐‘๐‘š๐‘’ + ๐œƒ ๐œ–๐‘€(๐›พ)

๐œ–๐›พ

  • Newton-Raphson update
  • where derivatives at evaluated at ๐›พold

34

Step size, usually set as 0.1

slide-35
SLIDE 35

First Derivative

35

j = 0, 1, โ€ฆ, p

๐‘ž(๐‘ฆ๐‘—; ๐›พ)

slide-36
SLIDE 36

Second Derivative

  • It is a (p+1) by (p+1) matrix, Hessian

Matrix, with jth row and nth column as

36

slide-37
SLIDE 37

What about Multiclass Classification?

  • It is easy to handle under logistic

regression, say M classes

  • ๐‘„ ๐‘ = ๐‘˜ ๐‘Œ

=

exp{๐‘Œ๐‘ˆ๐›พ๐‘˜} 1+ ๐‘›=1

๐‘โˆ’1 exp{๐‘Œ๐‘ˆ๐›พ๐‘›} , for j =

1, โ€ฆ , ๐‘ โˆ’ 1

  • ๐‘„ ๐‘ = ๐‘ ๐‘Œ =

1 1+ ๐‘›=1

๐‘โˆ’1 exp{๐‘Œ๐‘ˆ๐›พ๐‘›}

37

slide-38
SLIDE 38

Summary

  • Bayesian Learning
  • Bayes theorem
  • Naรฏve Bayes, class conditional independence
  • Bayesian Belief Network, DAG, conditional probability

table

  • Logistic Regression
  • Logistic function, two-class logistic regression, MLE

estimation, Gradient ascent updte, Newton-Raphson update, multiclass classification under logistic regression

38