cs 4100 artificial intelligence
play

CS 4100: Artificial Intelligence Nave Bayes Jan-Willem van de - PDF document

CS 4100: Artificial Intelligence Nave Bayes Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at


  1. CS 4100: Artificial Intelligence Naïve Bayes Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Machine Learning • Up Up until il now: how use a model to make optimal decisions • Ma Machine learning: how to acquire a model from data / experience • Le Learni ning ng parameters (e.g. probabilities) • Le Learni ning ng struc uctur ure (e.g. BN graphs) • Le Learni ning ng hi hidden n conc ncepts (e.g. clustering, neural nets) • Toda Today: model-based classification with Naive Bayes

  2. Classification Example: Spam Filter • In Input: an email Dear Sir. • Ou Outpu put: spam/ham First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and • Se Setup: top secret. … • Get a large collection of example emails, each TO BE REMOVED FROM FUTURE labeled “spam” or “ham” MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE • Note: someone has to hand label all this data! SUBJECT. • Want to learn to predict labels of new, future emails 99 MILLION EMAIL ADDRESSES FOR ONLY $99 • Fe Featur ures: The attributes used to make the Ok, Iknow this is blatantly OT but I'm ham / spam decision beginning to go insane. Had an old Dell Dimension XPS sitting in the corner • Wo Words: FREE! and decided to put it to use, I know it • Te Text xt Patterns: ns: $dd, CAPS was working pre being stuck in the corner, but when I plugged it in, hit the • No Non-te text: t: SenderInContacts, WidelyBroadcast power nothing happened.

  3. Example: Digit Recognition • In Input: images / pixel grids 0 • Ou Outpu put: a digit 0-9 • Se Setup: 1 • Get a large collection of example images, each labeled with a digit • Note: someone has to manually label all this data! 2 • Want to learn to predict labels of new, future digit images ures: The attributes used to make the digit decision • Fe Featur • Pi Pixels: (6,8)=O =ON 1 • Sh Shape pe Pa Patterns: Nu NumCo Components , As Aspe pectRa Ratio , Nu NumLoops • … • Features are increasingly learned rather than crafted ?? Other Classification Tasks • Cl Classification: n: given inputs x , predict labels (classes) y • Ex Examp mples: • Me Medical diagnosis (input: symptoms, classes: diseases) • Fr Fraud ud de detection (input: account activity, classes: fraud / no fraud) • Au Automatic essay gradi ding (input: document, classes: grades) • Cu Customer servic ice email il ro routi ting • Re Review sent sentiment ent ana nalysi sis • La Langua nguage ge ID • … many more • Classification is an important commercial technology!

  4. Model-Based Classification Model-Based Classification • Mo Model-ba based d approa pproach • Bui Build a model (e.g. Bayes’ net) where both the output label and input features are random variables • In Instan antiat ate e observed features • In Infer er (quer ery) the posterior distribution over the label conditioned on the features • Ch Challenges • What structure should the BN have? • How should we learn its parameters?

  5. Naïve Bayes for Digits • Na Naïve Ba Bayes: Assume all features are independent effects of the label Y • Si Simp mple digit recognition ve version: • On One feature (variable) F ij ij for each grid position <i, i,j> • Fe Featur ure value ues s are on on / of off , based on whether intensity is more or less than 0. 0.5 in underlying image F 1 F 2 F n • Each input maps to a fe feature re v vector , e.g. • Here: lots of features, each is binary valued • Na Naïve Ba Bayes mo model: • What do we need to learn? General Naïve Bayes • A A general Na Naive Ba Bayes mo model: Y |Y| pa |Y| paramete ters F 1 F 2 F n F| n valu |Y| x | |Y| x |F| values es n n x x |F| F| x x |Y| pa paramete ters • We only have to specify how each feature depends on the class • Total number of parameters is lin linear in n • Model is very simplistic, but often works anyway

  6. Inference for Naïve Bayes Goal: compute po • Go posterior di distribu bution over label variable Y • St Step p 1: get joint probability of label and evidence for each label + • St Step p 2: sum to get (marginal) probability of evidence p 3: normalize by dividing St p 1 by St • St Step Step Step p 2 General Naïve Bayes • Wh What at d do w we n e need eed i in o order er t to u use N e Naï aïve e Ba Bayes? • In Infer eren ence ce met ethod (we just saw this part) • Start with a bunch of probabilities: P( P(Y) Y) and the P( P(F i |Y |Y) tables • Use standard inference to compute P( P(Y| Y|F 1 …F …F n ) • Nothing new here • Es Esti tima mate tes of local conditional probability tables • P( P(Y) Y) , the prior over labels • P( P(F i |Y |Y) for each feature (evidence variable) • These probabilities are collectively called the parameters of the model and denoted by q pa • Up until now, we assumed these appeared by magic, but… • …they typically come from training data counts: we’ll look at this soon

  7. Example: Conditional Probabilities 1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80 Naïve Bayes for Text • Ba Bag-of of-wo words ds Na Naïve Ba Bayes: • Fe Featur ures: s: W i is the word at position i • As As be before: predict label conditioned on feature variables (spam vs. ham) before: assume features are conditionally independent given label • As As be • Ne New: each W i is identically distributed Wo Word at position i, , th wo not i th no word in th the dict di ctionar ary! • Ge Generative mode del: • “T “Tied” ” distribut utions ns and nd bag-of of-wo words ds • Usually, each variable gets its own conditional probability distribution P( P(F|Y) Y) words model • In a ba bag-of of-wo • Each position is identically distributed • All positions share the same conditional probabilities P(W P(W|Y) • Why make this assumption? • Called “bag-of-words” because model is insensitive to word order or reordering

  8. Example: Spam Filtering • Mo Model el: • Wha What are the he pa parameters? ham : 0.66 the : 0.0156 the : 0.0210 spam: 0.33 to : 0.0153 to : 0.0133 and : 0.0115 of : 0.0119 of : 0.0095 2002: 0.0110 you : 0.0093 with: 0.0108 a : 0.0086 from: 0.0107 with: 0.0080 and : 0.0105 from: 0.0075 a : 0.0100 ... ... • Whe Where do do the hese tabl bles come from? Spam Example Word P(w|spam) P(w|ham) Tot Spam Tot Ham (prior) 0.33333 0.66666 -1.1 -0.4 Gary 0.00002 0.00021 -11.8 -8.9 would 0.00069 0.00084 -19.1 -16.0 you 0.00881 0.00304 -23.8 -21.8 like 0.00086 0.00083 -30.9 -28.9 to 0.01517 0.01339 -35.1 -33.2 lose 0.00008 0.00002 -44.5 -44.0 weight 0.00016 0.00002 -53.3 -55.0 while 0.00027 0.00027 -61.5 -63.2 you 0.00881 0.00304 -66.2 -69.0 sleep 0.00006 0.00001 -76.0 -80.5 P(spam | w) = 98.9

  9. Training and Testing Empirical Risk Minimization • Empirical risk k minimization • Basic principle of machine learning • Wa Want nt: The model (classifier, regressor) that does best on the test distribution • Do Don’t kn know: The true data distribution • So Solution: Pick the best model on based on the training data • Finding “the best” model on the training set is an optimization problem • Ma Main wo worry: Ov Overfi fitting to to the the tr traini ning ng set • Better with more training data (less sampling variance, training more like test) • Better if we limit the complexity of our hypotheses (regularization and/or small hypothesis spaces)

  10. Importan ant Concepts ta: labeled instances (e.g. emails marked spam/ham) • Da Data Tr Training g set • • He Held ou out set • Te Test set Training Fe Featur ures: s: attribute-value pairs which characterize each x • Data • Experi Ex rimentation cycle n parameters (e.g. model probabilities) on training set • Le Learn hyperparameters on held-out set • Tune hy Tu • Com Compute accuracy of test set Ve Very importa tant: t: never “peek” at the test set! • • Ev Evaluation (many metrics possible, e.g. accuracy) Held-Out Accuracy: fraction of instances predicted correctly • Ac Data • Over Overfit ittin ing an and gen ener eral alizat zation Want a classifier which does well on test data • Test • Ov Overfitti tting: fitting the training data very closely, but not generalizing well Data • We’ll investigate overfitting and generalization formally in a few lectures Generalization and Overfitting

  11. Overfitting 30 25 20 15 10 Constant function 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 15 10 Linear function 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20

  12. Overfitting 30 25 20 15 10 Degree 3 Polynomial 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend