CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 1, 2013

Matrix Data: Classification: Part 2 • Bayesian Learning • Naïve Bayes • Bayesian Belief Network • Logistic Regression • Summary 2

Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier , has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 3

Basic Probability Review • Have two dices h 1 and h 2 • The probability of rolling an i given die h 1 is denoted P(i|h 1 ). This is a conditional probability • Pick a die at random with probability P(h j ), j=1 or 2. The probability for picking die h j and rolling an i with it is called joint probability and is P(i, h j )=P(h j )P(i| h j ). • If we know P(X,Y), then the so-called marginal probability  P(X) can be computed as  ( ) ( , ) P X P X Y Y • For any events X and Y, P(X,Y)=P(X|Y)P(Y) 4

Bayes’ Theorem: Basics ( | ) ( ) P X h P h  • Bayes’ Theorem: ( | ) P h X ( ) P X • Let X be a data sample (“ evidence ”) • Let h be a hypothesis that X belongs to class C • P(h) ( prior probability ): the initial probability • E.g., X will buy computer, regardless of age, income, … • P(X|h) (likelihood): the probability of observing the sample X, given that the hypothesis holds • E.g., Given that X will buy computer, the prob. that X is 31..40, medium income • P(X): marginal probability that sample data is observed • 𝑄 𝑌 = 𝑄 𝑌 ℎ 𝑄(ℎ) ℎ • P(h|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X 5

Classification: Choosing Hypotheses • Maximum Likelihood (maximize the likelihood):  arg max ( | ) h P D h ML  h H • Maximum a posteriori (maximize the posterior): • Useful observation: it does not depend on the denominator P(D)   arg max ( | ) arg max ( | ) ( ) h P h D P D h P h MAP   h H h H D: the whole training data set 6

Classification by Maximum A Posteriori • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an p-D attribute vector X = (x 1 , x 2 , …, x p ) • Suppose there are m classes Y ∈ {C 1 , C 2 , …, C m } • Classification is to derive the maximum posteriori, i.e., the maximal P(Y=C j | X )   ( | ) ( ) P X Y C P Y C • This can be derived from Bayes’ theorem j j   ( | ) P Y C X j ( ) P X • Since P(X) is constant for all classes, only X  ( , ) ( | ) ( ) P y P X y P y needs to be maximized 7

Example: Cancer Diagnosis • A patient takes a lab test with two possible results (+ve, -ve), and the result comes back positive. It is known that the test returns • a correct positive result in only 98% of the cases (true positive); • a correct negative result in only 97% of the cases (true negative). • Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis? 8

Solution P(  cancer) = .992 P(cancer) = .008 P(+ve|cancer) = .98 P(-ve|cancer) = .02 P(+ve|  cancer) = .03 P(-ve|  cancer) = .97 Using Bayes Formula: P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve) = 0.98 x 0.008/ P(+ve) = .00784 / P(+ve) P(  cancer|+ve) = P(+ve|  cancer)xP(  cancer) / P(+ve) = 0.03 x 0.992/P(+ve) = .0298 / P(+ve) So, the patient most likely does not have cancer. 9

Matrix Data: Classification: Part 2 • Bayesian Learning • Naïve Bayes • Bayesian Belief Network • Logistic Regression • Summary 10

Naïve Bayes Classifier • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an p-D attribute vector X = (x 1 , x 2 , …, x p ) • Suppose there are m classes Y ∈ {C 1 , C 2 , …, C m } • Goal: Find Y max 𝑄 𝑍 𝒀 = 𝑄(𝑍, 𝒀)/𝑄(𝒀) ∝ 𝑄 𝒀 𝑍 𝑄(𝑍) • A simplified assumption: attributes are conditionally independent given the class (class conditional independency): p       ( | ) ( | ) ( | ) ( | ) ... ( | ) X P P P P P C j x C j x C j x C j x C j 1 2 k p  1 k 11

Estimate Parameters by MLE • Given a dataset 𝐸 = {(𝐘 i , Y i )} , the goal is to • Find the best estimators 𝑄(𝐷 𝑘 ) and 𝑄(𝑌 𝑙 = 𝑦 𝑙 |𝐷 𝑘 ) , for every 𝑘 = 1, … , 𝑛 𝑏𝑜𝑒 𝑙 = 1, … , 𝑞 • that maximizes the likelihood of observing D: 𝑀 = 𝑄 𝐘 i , Y i = 𝑄 𝐘 i |Y i 𝑄(𝑍 𝑗 ) 𝑗 𝑗 = ( 𝑄 𝑌 𝑗𝑙 |𝑍 𝑗 )𝑄(𝑍 𝑗 ) 𝑗 𝑙 • Estimators of Parameters: • 𝑄 𝐷 𝑘 = 𝐷 𝑘,𝐸 / 𝐸 ( |𝐷 𝑘,𝐸 | = # of tuples of C j in D) (why?) • 𝑄 𝑌 𝑙 = 𝑦 𝑙 𝐷 𝑘 : 𝑌 𝑙 can be either discrete or numerical 12

Discrete and Continuous Attributes • If 𝑌 𝑙 is discrete, with 𝑊 possible values • P(x k |C j ) is the # of tuples in C j having value x k for X k divided by |C j, D | • If 𝑌 𝑙 is continuous, with observations of real values • P(x k |C j ) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ • Estimate ( μ , 𝜏 2 ) according to the observed X in the category of C j • Sample mean and sample variance • P(x k |C j ) is then     ( | ) ( , , ) P X x g x C j k k k C C i i Gaussian density function 13

Naïve Bayes Classifier: Training Dataset age income student credit_rating ys_comp <=30 high no fair no Class: <=30 high no excellent no C1 :buys_computer = ‘yes’ 31…40 high no fair yes C2 :buys_computer = ‘no’ >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no Data to be classified: 31…40 low yes excellent yes X = (age <=30, <=30 medium no fair no Income = medium, <=30 low yes fair yes Student = yes >40 medium yes fair yes Credit_rating = Fair) <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 14

Naïve Bayes Classifier: An Example age income student credit_rating ys_comp <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes • P(C i ): P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes >40 low yes fair yes P(buys_computer = “no”) = 5/14= 0.357 >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no • Compute P(X|C i ) for each class <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes P(age = “<= 30 ” | buys_computer = “yes”) = 2/9 = 0.222 31…40 medium no excellent yes 31…40 high yes fair yes P(age = “<= 30 ” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 • X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|C i ) : P( X |buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P( X |buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|C i )*P(C i ) : P(X|buys_computer = “yes”) * P( buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P( buys_computer = “no”) = 0.007 Therefore, X belongs to class (“ buys_computer = yes”) 15

Avoiding the Zero-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero . Otherwise, the predicted prob. will be zero p   ( | ) ( | ) P X P C j xk C j  1 k • Use Laplacian correction (or Laplacian smoothing) • Adding 1 to each case 𝑜 𝑘𝑙,𝑤 +1 𝐷 𝑘,𝐸 +𝑊 where 𝑜 𝑘𝑙,𝑤 is # of tuples in C j having value • 𝑄 𝑦 𝑙 = 𝑤 𝐷 𝑘 = 𝑦 𝑙 = v , V is the total number of values that can be taken • Ex. Suppose a training dataset with 1000 tuples, for category “ buys_computer = yes”, income=low (0), income= medium (990), and income = high (10) Prob(income = low|buys_computer = “yes” ) = 1/1003 Prob(income = medium|buys_computer = “yes” ) = 991/1003 Prob(income = high|buys_computer = “yes” ) = 11/1003 • The “corrected” prob. estimates are close to their “uncorrected” counterparts 16

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 1, 2013 Matrix Data: Classification: Part 2 Bayesian Learning Nave Bayes Bayesian Belief Network Logistic

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

Cross Section Uncertainties in the NOvA Oscillation Analyses Aaron Mislivec University of

Pileup Systematic Studies in The Fermilab Muon g-2 Experiment Meghna Bhattacharya University of

Fu Func nctio tions ns on t on the he La Latt ttic ice Huey-Wen Lin University of

Biostatistics Preparatory Course: Methods and Computing Lecture 6 Simulations Methods and

Huey-Wen Lin Lattice 2016, Southampton, UK Parton Distribution Functions This talk is based

DETECTOR . James Robinson 1 1 DESY Introduction ATLAS has measured particle production and the

Universal Phenomena at Strong Coupling and Gravity Ayan Mukhopadhyay Harish-Chandra Research

Fu Func nctions tions on on th the Lat attic tice Huey-Wen Lin University of Washington