 
              CS 4100: Artificial Intelligence Naïve Bayes Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Machine Learning • Up Up until il now: how use a model to make optimal decisions • Ma Machine learning: how to acquire a model from data / experience • Le Learni ning ng parameters (e.g. probabilities) • Le Learni ning ng struc uctur ure (e.g. BN graphs) • Le Learni ning ng hi hidden n conc ncepts (e.g. clustering, neural nets) • Toda Today: model-based classification with Naive Bayes
Classification Example: Spam Filter • In Input: an email Dear Sir. • Ou Outpu put: spam/ham First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and • Se Setup: top secret. … • Get a large collection of example emails, each TO BE REMOVED FROM FUTURE labeled “spam” or “ham” MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE • Note: someone has to hand label all this data! SUBJECT. • Want to learn to predict labels of new, future emails 99 MILLION EMAIL ADDRESSES FOR ONLY $99 • Fe Featur ures: The attributes used to make the Ok, Iknow this is blatantly OT but I'm ham / spam decision beginning to go insane. Had an old Dell Dimension XPS sitting in the corner • Wo Words: FREE! and decided to put it to use, I know it • Te Text xt Patterns: ns: $dd, CAPS was working pre being stuck in the corner, but when I plugged it in, hit the • No Non-te text: t: SenderInContacts, WidelyBroadcast power nothing happened.
Example: Digit Recognition • In Input: images / pixel grids 0 • Ou Outpu put: a digit 0-9 • Se Setup: 1 • Get a large collection of example images, each labeled with a digit • Note: someone has to manually label all this data! 2 • Want to learn to predict labels of new, future digit images ures: The attributes used to make the digit decision • Fe Featur • Pi Pixels: (6,8)=O =ON 1 • Sh Shape pe Pa Patterns: Nu NumCo Components , As Aspe pectRa Ratio , Nu NumLoops • … • Features are increasingly learned rather than crafted ?? Other Classification Tasks • Cl Classification: n: given inputs x , predict labels (classes) y • Ex Examp mples: • Me Medical diagnosis (input: symptoms, classes: diseases) • Fr Fraud ud de detection (input: account activity, classes: fraud / no fraud) • Au Automatic essay gradi ding (input: document, classes: grades) • Cu Customer servic ice email il ro routi ting • Re Review sent sentiment ent ana nalysi sis • La Langua nguage ge ID • … many more • Classification is an important commercial technology!
Model-Based Classification Model-Based Classification • Mo Model-ba based d approa pproach • Bui Build a model (e.g. Bayes’ net) where both the output label and input features are random variables • In Instan antiat ate e observed features • In Infer er (quer ery) the posterior distribution over the label conditioned on the features • Ch Challenges • What structure should the BN have? • How should we learn its parameters?
Naïve Bayes for Digits • Na Naïve Ba Bayes: Assume all features are independent effects of the label Y • Si Simp mple digit recognition ve version: • On One feature (variable) F ij ij for each grid position <i, i,j> • Fe Featur ure value ues s are on on / of off , based on whether intensity is more or less than 0. 0.5 in underlying image F 1 F 2 F n • Each input maps to a fe feature re v vector , e.g. • Here: lots of features, each is binary valued • Na Naïve Ba Bayes mo model: • What do we need to learn? General Naïve Bayes • A A general Na Naive Ba Bayes mo model: Y |Y| pa |Y| paramete ters F 1 F 2 F n F| n valu |Y| x | |Y| x |F| values es n n x x |F| F| x x |Y| pa paramete ters • We only have to specify how each feature depends on the class • Total number of parameters is lin linear in n • Model is very simplistic, but often works anyway
Inference for Naïve Bayes Goal: compute po • Go posterior di distribu bution over label variable Y • St Step p 1: get joint probability of label and evidence for each label + • St Step p 2: sum to get (marginal) probability of evidence p 3: normalize by dividing St p 1 by St • St Step Step Step p 2 General Naïve Bayes • Wh What at d do w we n e need eed i in o order er t to u use N e Naï aïve e Ba Bayes? • In Infer eren ence ce met ethod (we just saw this part) • Start with a bunch of probabilities: P( P(Y) Y) and the P( P(F i |Y |Y) tables • Use standard inference to compute P( P(Y| Y|F 1 …F …F n ) • Nothing new here • Es Esti tima mate tes of local conditional probability tables • P( P(Y) Y) , the prior over labels • P( P(F i |Y |Y) for each feature (evidence variable) • These probabilities are collectively called the parameters of the model and denoted by q pa • Up until now, we assumed these appeared by magic, but… • …they typically come from training data counts: we’ll look at this soon
Example: Conditional Probabilities 1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80 Naïve Bayes for Text • Ba Bag-of of-wo words ds Na Naïve Ba Bayes: • Fe Featur ures: s: W i is the word at position i • As As be before: predict label conditioned on feature variables (spam vs. ham) before: assume features are conditionally independent given label • As As be • Ne New: each W i is identically distributed Wo Word at position i, , th wo not i th no word in th the dict di ctionar ary! • Ge Generative mode del: • “T “Tied” ” distribut utions ns and nd bag-of of-wo words ds • Usually, each variable gets its own conditional probability distribution P( P(F|Y) Y) words model • In a ba bag-of of-wo • Each position is identically distributed • All positions share the same conditional probabilities P(W P(W|Y) • Why make this assumption? • Called “bag-of-words” because model is insensitive to word order or reordering
Example: Spam Filtering • Mo Model el: • Wha What are the he pa parameters? ham : 0.66 the : 0.0156 the : 0.0210 spam: 0.33 to : 0.0153 to : 0.0133 and : 0.0115 of : 0.0119 of : 0.0095 2002: 0.0110 you : 0.0093 with: 0.0108 a : 0.0086 from: 0.0107 with: 0.0080 and : 0.0105 from: 0.0075 a : 0.0100 ... ... • Whe Where do do the hese tabl bles come from? Spam Example Word P(w|spam) P(w|ham) Tot Spam Tot Ham (prior) 0.33333 0.66666 -1.1 -0.4 Gary 0.00002 0.00021 -11.8 -8.9 would 0.00069 0.00084 -19.1 -16.0 you 0.00881 0.00304 -23.8 -21.8 like 0.00086 0.00083 -30.9 -28.9 to 0.01517 0.01339 -35.1 -33.2 lose 0.00008 0.00002 -44.5 -44.0 weight 0.00016 0.00002 -53.3 -55.0 while 0.00027 0.00027 -61.5 -63.2 you 0.00881 0.00304 -66.2 -69.0 sleep 0.00006 0.00001 -76.0 -80.5 P(spam | w) = 98.9
Training and Testing Empirical Risk Minimization • Empirical risk k minimization • Basic principle of machine learning • Wa Want nt: The model (classifier, regressor) that does best on the test distribution • Do Don’t kn know: The true data distribution • So Solution: Pick the best model on based on the training data • Finding “the best” model on the training set is an optimization problem • Ma Main wo worry: Ov Overfi fitting to to the the tr traini ning ng set • Better with more training data (less sampling variance, training more like test) • Better if we limit the complexity of our hypotheses (regularization and/or small hypothesis spaces)
Importan ant Concepts ta: labeled instances (e.g. emails marked spam/ham) • Da Data Tr Training g set • • He Held ou out set • Te Test set Training Fe Featur ures: s: attribute-value pairs which characterize each x • Data • Experi Ex rimentation cycle n parameters (e.g. model probabilities) on training set • Le Learn hyperparameters on held-out set • Tune hy Tu • Com Compute accuracy of test set Ve Very importa tant: t: never “peek” at the test set! • • Ev Evaluation (many metrics possible, e.g. accuracy) Held-Out Accuracy: fraction of instances predicted correctly • Ac Data • Over Overfit ittin ing an and gen ener eral alizat zation Want a classifier which does well on test data • Test • Ov Overfitti tting: fitting the training data very closely, but not generalizing well Data • We’ll investigate overfitting and generalization formally in a few lectures Generalization and Overfitting
Overfitting 30 25 20 15 10 Constant function 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 15 10 Linear function 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20
Overfitting 30 25 20 15 10 Degree 3 Polynomial 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20
Recommend
More recommend