1 Relation Between Multinomial Logistic Regression Nave Bayes and - PDF document

Logistic Regression • Assumes a parametric form for directly estimating P( Y | X ). For binary concepts, this is: CS 391L: Machine Learning: 1 P Y = X = ( 1 | ) n + w + w i X Bayesian Learning: 1 exp( ) ∑ = i 0 i 1 Beyond Naïve Bayes P Y = X = − P Y = X ( 0 | ) 1 ( 1 | ) n w + w X exp( ) i i = 0 i = 1 ∑ n + w + w X 1 exp( ) i i 0 i = 1 ∑ • Equivalent to a one-layer backpropagation neural net. Raymond J. Mooney – Logistic regression is the source of the sigmoid function University of Texas at Austin used in backpropagation. – Objective function for training is somewhat different. 1 2 Logistic Regression as a Log-Linear Model Logistic Regression Training • Logistic regression is basically a linear model, which • Weights are set during training to maximize the is demonstrated by taking logs. conditional data likelihood : = P Y X ( 0 | ) ∏ W ← P Y d X d W Y = < argmax ( | , ) Assign label 0 iff 1 = P Y X ( 1 | ) W d ∈ D n < w + w i X where D is the set of training examples and Y d and 1 exp( ) ∑ = i 0 i 1 X d denote, respectively, the values of Y and X for n < w + w i X 0 ∑ = i 0 i example d. 1 n > ∑ = − w w i X or equivalent ly i 0 i 1 • Equivalently viewed as maximizing the • Also called a maximum entropy model ( MaxEnt ) conditional log likelihood (CLL) because it can be shown that standard training for W ← P Y d X d W argmax ln ( | , ) logistic regression gives the distribution with maximum ∑ W ∈ d D entropy that is consistent with the training data. 3 4 Logistic Regression Training Preventing Overfitting in Logistic Regression • To prevent overfitting, one can use regularization • Like neural-nets, can use standard gradient (a.k.a. smoothing) by penalizing large weights by descent to find the parameters (weights) that changing the training objective: optimize the CLL objective function. λ W ← P Y d X d W − W 2 argmax ln ( | , ) • Many other more advanced training 2 W ∑ d ∈ D methods are possible to speed convergence. Where λ is a constant that determines the amount of smoothing – Conjugate gradient • This can be shown to be equivalent to assuming a – Generalized Iterative Scaling (GIS) Guassian prior for W with zero mean and a variance related to 1/ λ . – Improved Iterative Scaling (IIS) – Limited-memory quasi-Newton (L-BFGS) 5 6 1

Relation Between Multinomial Logistic Regression Naïve Bayes and Logistic Regression • Naïve Bayes with Gaussian distributions for features (GNB), • Logistic regression can be generalized to can be shown to given the same functional form for the multi-class problems (where Y has a conditional distribution P( Y | X ). – But converse is not true, so Logistic Regression makes a weaker multinomial distribution). assumption. • Logistic regression is a discriminative rather than generative • Effectively constructs a linear classifier for model, since it models the conditional distribution P( Y | X ) and directly attempts to fit the training data for predicting Y from each category. X . Does not specify a full joint distribution. • When conditional independence is violated, logistic regression gives better generalization if it is given sufficient training data. • GNB converges to accurate parameter estimates faster (O(log n ) examples for n features) compared to Logistic Regression (O( n ) examples). – Experimentally, GNB is better when training data is scarce, logistic regression is better when it is plentiful. 7 8 Graphical Models Bayesian Networks • If no assumption of independence is made, then an • Directed Acyclic Graph (DAG) exponential number of parameters must be estimated for sound probabilistic inference. – Nodes are random variables • No realistic amount of training data is sufficient to estimate – Edges indicate causal influences so many parameters. • If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong Burglary Earthquake assumption is rarely warranted. • Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable Alarm dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated. JohnCalls MaryCalls – Bayesian Networks : Directed acyclic graphs that indicate causal structure. – Markov Networks : Undirected graphs that capture general dependencies. 9 10 Conditional Probability Tables CPT Comments • Each node has a conditional probability table ( CPT ) that • Probability of false not given since rows gives the probability of each of its values given every possible must add to 1. combination of values for its parents (conditioning case). • Example requires 10 parameters rather than – Roots (sources) of the DAG that have no parents are given prior probabilities. 2 5 –1=31 for specifying the full joint distribution. P(E) P(B) Burglary Earthquake .002 .001 • Number of parameters in the CPT for a B E P(A) T T .95 node is exponential in the number of parents T F .94 Alarm F T .29 (fan-in). F F .001 A P(M) A P(J) T .70 T .90 MaryCalls JohnCalls F .01 F .05 11 12 2

Joint Distributions for Bayes Nets Naïve Bayes as a Bayes Net • A Bayesian Network implicitly defines a joint • Naïve Bayes is a simple Bayes Net distribution. n Y ∏ P x x x = P x X ( , ,... ) ( | Parents ( )) n i i 1 2 = i 1 X 2 … • Example X n X 1 ∧ ∧ ∧ ¬ ∧ ¬ P J M A B E ( ) = ¬ ∧ ¬ ¬ ¬ P J A P M A P A B E P B P E ( | ) ( | ) ( | ) ( ) ( ) • Priors P( Y ) and conditionals P( X i | Y ) for = × × × × = 0 . 9 0 . 7 0 . 001 0 . 999 0 . 998 0 . 00062 Naïve Bayes provide CPTs for the network. • Therefore an inefficient approach to inference is: – 1) Compute the joint distribution using this equation. – 2) Compute any desired conditional probability using the joint distribution. 13 14 Bayes Net Inference Bayes Net Inference • Given known values for some evidence variables , • Example: Given that John calls, what is the determine the posterior probability of some query probability that there is a Burglary? variables . P(B) ??? .001 • Example: Given that John calls, what is the John also calls 5% of the time when there Burglary Earthquake is no Alarm. So over 1,000 days we probability that there is a Burglary? expect 1 Burglary and John will probably call. However, he will also call with a Alarm ??? false report 50 times on average. So the John calls 90% of the time there Burglary Earthquake call is about 50 times more likely a false is an Alarm and the Alarm detects JohnCalls MaryCalls report: P(Burglary | JohnCalls) ≈ 0.02 94% of Burglaries so people generally think it should be fairly high. Alarm A P(J) T .90 However, this ignores the prior F .05 MaryCalls JohnCalls probability of John calling. 15 16 Bayes Net Inference Complexity of Bayes Net Inference • In general, the problem of Bayes Net inference is • Example: Given that John calls, what is the NP-hard (exponential in the size of the graph). probability that there is a Burglary? • For singly-connected networks or polytrees in P(B) which there are no undirected loops, there are linear- ??? .001 Actual probability of Burglary is 0.016 time algorithms based on belief propagation . Burglary Earthquake since the alarm is not perfect (an – Each node sends local evidence messages to their children Earthquake could have set it off or it and parents. could have gone off on its own). On the Alarm – Each node updates belief in each of its possible values other side, even if there was not an based on incoming messages from it neighbors and alarm and John called incorrectly, there propagates evidence on to its neighbors. JohnCalls MaryCalls could have been an undetected Burglary • There are approximations to inference for general anyway, but this is unlikely. A P(J) networks based on loopy belief propagation that T .90 iteratively refines probabilities that converge to F .05 accurate values in the limit. 17 18 3

1 Relation Between Multinomial Logistic Regression Nave Bayes and - PDF document

Logistic Regression Assumes a parametric form for directly estimating P( Y | X ). For binary concepts, this is: CS 391L: Machine Learning: 1 P Y = X = ( 1 | ) n + w + w i X Bayesian Learning: 1 exp( ) = i 0 i 1

Bayesian Networks Bayesian Networks Course: CS40022 Course: CS40022 Instructor: Dr. Pallab

Chapter14 Probabilistic Reasoning (Bayesian Networks) Sec. 1 - 2 20070607 Chap14 1

Probabilistic Models Models describe how (a portion of) the world works Models are always

ProgOS UE Getting Introduction to Pintos Started Pintos Basics Daniel Prokesch, Denise

in Memory for Melodies W. Jay Dowling University of Texas at Dallas Thanks to Rachna Raman &

Fingerprints in the Ether: Physical Layer Authentication Liang Xiao Advisors: Prof. L.

Hardwiring Happiness : The Practical Science of Growing Inner Strength and Peace Openground

and Electrooculography Features Ruofei Du 1 , Renjie Liu 1 , Tianxiang Wu 1 , Baoliang Lu 1234 1

Using Workforce Planning Systems in Managing Fatigue Risk Arjen Heeres, COO Quintiq 13 th

Day 1 Summary 1 Extending the Web Platform to Automotive What considerations need to be

Acquiring Durable Mental Resources 2 3 Resources in the Mind Mental resources which help us

End Homelessness Co-production: its messy HSCP Strategy, Integrated Joint Board Policy

An EPI PIC Ap C Approach to Co Community P Partnerships Dr. Sandy Turnage and Emily Toalson

Fostering Transfer Student Success Through Cross Campus Collaboration Maia Randle, Ph.D., PI and

South East London Commissioning Alliance: Engagement with Health & Wellbeing Boards on CCG

What I wont talk about Luc De Raedt (KULeuven) Dagstuhl Seminar on ML and Formal Methods

The gig economy: : alliances on We Welcome to your world! common demands The Gig Economy:

How does violent conflict affect third- country trade? Evidence from a big data analysis in a

Alternatives to a Alternatives to a Merger or Acquisition Merger or Acquisition

PROGRAM EXECUTIVE OFFICE SIMULATION, TRAINING AND INSTRUMENTATION BG Michael E. Sloane Program

igcs@igcs.org Chair Gynecologic Oncology University of Toronto Website: IGCS Education Cmte,

Worcestershire County Council CSP 2015 Active Alliances www.worcestershire.gov.uk Active

1 2 3 4 Here are the main points, up front. 5 6 7 8 One of these images is from Womans

What Exactly Is Meaning? Meanings are messages that get communicated by language utterances.

1 Relation Between Multinomial Logistic Regression Nave Bayes and - PDF document

Logistic Regression Assumes a parametric form for directly estimating P( Y | X ). For binary concepts, this is: CS 391L: Machine Learning: 1 P Y = X = ( 1 | ) n + w + w i X Bayesian Learning: 1 exp( ) = i 0 i 1

Bayesian Networks Bayesian Networks Course: CS40022 Course: CS40022 Instructor: Dr. Pallab

Chapter14 Probabilistic Reasoning (Bayesian Networks) Sec. 1 - 2 20070607 Chap14 1

Probabilistic Models Models describe how (a portion of) the world works Models are always

ProgOS UE Getting Introduction to Pintos Started Pintos Basics Daniel Prokesch, Denise

in Memory for Melodies W. Jay Dowling University of Texas at Dallas Thanks to Rachna Raman &amp;

Fingerprints in the Ether: Physical Layer Authentication Liang Xiao Advisors: Prof. L.

Hardwiring Happiness : The Practical Science of Growing Inner Strength and Peace Openground

and Electrooculography Features Ruofei Du 1 , Renjie Liu 1 , Tianxiang Wu 1 , Baoliang Lu 1234 1

Using Workforce Planning Systems in Managing Fatigue Risk Arjen Heeres, COO Quintiq 13 th

Day 1 Summary 1 Extending the Web Platform to Automotive What considerations need to be

Acquiring Durable Mental Resources 2 3 Resources in the Mind Mental resources which help us

End Homelessness Co-production: its messy HSCP Strategy, Integrated Joint Board Policy

An EPI PIC Ap C Approach to Co Community P Partnerships Dr. Sandy Turnage and Emily Toalson

Fostering Transfer Student Success Through Cross Campus Collaboration Maia Randle, Ph.D., PI and

South East London Commissioning Alliance: Engagement with Health &amp; Wellbeing Boards on CCG

What I wont talk about Luc De Raedt (KULeuven) Dagstuhl Seminar on ML and Formal Methods

The gig economy: : alliances on We Welcome to your world! common demands The Gig Economy:

How does violent conflict affect third- country trade? Evidence from a big data analysis in a

Alternatives to a Alternatives to a Merger or Acquisition Merger or Acquisition

PROGRAM EXECUTIVE OFFICE SIMULATION, TRAINING AND INSTRUMENTATION BG Michael E. Sloane Program

igcs@igcs.org Chair Gynecologic Oncology University of Toronto Website: IGCS Education Cmte,

Worcestershire County Council CSP 2015 Active Alliances www.worcestershire.gov.uk Active

1 2 3 4 Here are the main points, up front. 5 6 7 8 One of these images is from Womans

What Exactly Is Meaning? Meanings are messages that get communicated by language utterances.

in Memory for Melodies W. Jay Dowling University of Texas at Dallas Thanks to Rachna Raman &

South East London Commissioning Alliance: Engagement with Health & Wellbeing Boards on CCG