INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 INF4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

2 (Mostly Text) Classification, Naive Bayes Lecture 3, 31 Aug

Today - Classification 3  Motivation  Classification  Naive Bayes classification  NB for text classification  The multinomial model  The Bernoulli model  Experiments: training, test and cross-validation  Evaluation

Motivation 4

Did Mikhail Sholokov write And Quiet Flows the Don? 5  Sholokov, 1905-1984  And Quiet Flows the Don  published 1928-1940  Nobel prize, literature, 1965  Authorship contested  e.g. Aleksandr Solzhenitsyn, 1974  Geir Kjetsaa (UiO) et al, 1984 Kjetsaa according to Hjort  refuted the contestants In addition to various linguistic analyses and  Nils Lid Hjort, 2007, confirmed several doses of detective work, quantitative data Kjetsaa by using sentence length and were gathered and organised, for example, advanced statistics. relating to word lengths, frequencies of certain words and phrases, sentence lengths, grammatical  https://en.wikipedia.org/wiki/Mikhail_Sholokhov characteristics, etc.

Pocitive or negative movie review? 6  unbelievably disappointing  Full of zany characters and richly applied satire, and come great plot twists  this is the greatest screwball comedy ever filmed  It was pathetic. The worst part about it was the boxing scenes. From Jurafsky & Martin

What is the subject of this article? 7 MeSH Subject Category Hierarchy MEDLINE Article  Antagonists and Inhibitors  Blood Supply  Chemistry ?  Drug Therapy  Embryology  Epidemiology  … From Jurafsky & Martin

Classification 8

Classification 9  Can be rule-based, but mostly machine learned  Text classification is a sub-class  Text classification examples:  Other types of classification:  Spam detection  Word sense disambiguation  Genre classification  Sentence splitting  Language identification  Tagging  Sentiment analysis:  Named-entity recognition  Positive-negative

Machine learning 10 Supervised  Supervised: 1. Classification (categorical)  Given classes 1. Regression (numerical)  Given examples of correct 2. classes Unsupervised 2.  Unsupervised: Semi-supervised 3.  Construct classes Reinforcement learning 4.

Supervised classification 11

Supervised classification 12 Task O C  Given Spam E-mails Spam,  a well-defined set of classification no-spam observations, O Language Pieces of Arabian,  a given set of classes, identification text Chinese, C={c 1 , c 2 , …, c k } English,  Goal: a classifier,  , a mapping Norwegian, from O to C … Word sense Occurrences Sense1, …,  For supervised training one disambi- of ”bass” sense8 needs a set of pairs from OxC guation

Features 13  To represent the objects in O, O: email O: person extract a set of features Features: Features: • length • height  Be explicit: • sender • weight  Which features • contained words • hair color  For each feature • language • eye color • … • …  The type  Categorical  Numeric (Discrete/Continuous)  The value space Cf. First lecture Classes and features are both attributes of the observations

Supervised classification  A given set of classes, C={c 1 , c 2 , …, c k }  A well defined class of observations, O  Some features f 1 , f 2 , …, f n  For each feature: a set of possible values V 1 , V 2 , …, V n  The set of feature vectors: V= V 1  V 2  …  V n  Each observation in O is represented by some member of V:  Written (f 1 =v 1 , f 2 =v 2 , …, f n =v n ), or  (v 1 , v 2 , …, v n ), if we have decided the order  A classifier,  , can be considered a mapping from V to C

A variety of ML classifiers 15  k-Nearest Neighbors  Rocchio  Naive Bayes  Logistic regression (Maximum entropy)  Support Vector Machines  Decision Trees  Perceptron  Multi-layered neural nets ("Deep learning")

Naïve Bayes 16

Example: Jan. 2021 Professor, do I can give you a you think I will scientific answer enjoy using machine IN3050? learning.

Baseline  Survey Yes,  Asked all the students of 2020 you will like it.  200 answered:  130 yes  70 no  Baseline classifier:  Choose the majority class  Accuracy 0.65=65%  (With two classes, always > 0.5)

Example: one year from now, Jan. 2021 Professor, do To answer that, I you think I will have to ask you enjoy some questions. IN3050?

The 2020 survey (imaginary) Ask each of the 200 students:  Did you enjoy the course?  Yes/no  Do you like mathematics?  Yes/no  Do you have programming experience?  None/some/good (= 3 or more courses)  Have you taken advanced machine learning courses?  Yes/no  And many more questions, but we have to simplify here

Results of the 2020 survey: a data set Student no Enjoy maths Programming Adv. ML Enjoy 1 Y Good N Y 2 Y Some N Y 3 N Good Y N 4 N None N N 5 N Good N Y 6 N Good Y Y ….

Summary of the 2020 survey

Our new student  We ask our incoming new student the same three question  From the table we can see e.g. that if:  she has good programming  no AdvML-course  does not like maths 40  There is a 44 chance she will But what should we say to a student with enjoy the course some programming background, and adv. ML course who does not like maths.?

A little more formal 24  What we do is that we consider  𝑄 𝑓𝑜𝑘𝑝𝑧 = 𝑧𝑓𝑡 𝑞𝑠𝑝𝑕 = 𝑕𝑝𝑝𝑒, 𝐵𝑒𝑤𝑁𝑀 = 𝑜𝑝, 𝑁𝑏𝑢ℎ𝑡 = 𝑜𝑝 and  𝑄 𝑓𝑜𝑘𝑝𝑧 = 𝑜𝑝𝑢 𝑞𝑠𝑝𝑕 = 𝑕𝑝𝑝𝑒, 𝐵𝑒𝑤𝑁𝑀 = 𝑜𝑝, 𝑁𝑏𝑢ℎ𝑡 = 𝑜𝑝  and decide on the class which has the largest probability, in symbols  𝑏𝑠𝑕𝑛𝑏𝑦 𝑧∈ 𝑧𝑓𝑡,𝑜𝑝 𝑄 𝑓𝑜𝑘𝑝𝑧 = 𝑧 𝑞𝑠𝑝𝑕 = 𝑕𝑝𝑝𝑒, 𝐵𝑒𝑤𝑁𝑀 = 𝑜𝑝, 𝑁𝑏𝑢ℎ𝑡 = 𝑜𝑝  But, there may be many more features.  An exponential growth in possible combinations  We might not have seen all combinations, or they may be rare  Therefore we apply Bayes theorem, and we make a simplifying assumption

Naive Bayes: Decision 25  Given an observation    , ,..., f v f v f v  1 1 2 2 n n  Consider      for each class 𝑡 𝑛 | , ,..., P s f v f v f v  1 1 2 2 m n n  Choose the class with the largest value, in symbols      arg max | , ,..., P s f v f v f v 1 1 2 2 m n n  s S m  i.e. choose the class for which the observation is most likely

Naive Bayes: Model 26  Bayes formula        , ,..., | ( ) P f v f v f v s P s     1 1 2 2 n n m m  | , ,...,   P s f v f v f v    1 1 2 2 m n n , ,..., P f v f v f v 1 1 2 2 n n  Sparse data, we may not even have seen     , ,..., f v f v f v 1 1 2 2 n n  We assume (wrongly) independence   n          , ,..., | | P f v f v f v s P f v s 1 1 2 2 n n m i i m  1 i  Putting together, choose   n          arg max | , ,..., arg max ( ) | P s f v f v f v P s P f v s 1 1 2 2 m n n m i i m    s S s S 1 i m m

Naive Bayes, Training 1 27  Maximum Likelihood  ( , )   C s o ˆ m  m P s ( ) C o  where C(s m , o) are the number of occurrences of observations o in class s m  Observe what we are doing:  We are looking for the true probability 𝑄(𝑡 𝑛 )  ෠ 𝑄(𝑡 𝑛 ) is an approximation to this, our best guess from a set of observations  Maximum likelihood means that it is the model which makes the set of observations we have seen, most likely

Naive Bayes: Training 2 28  Maximum Likelihood  ( , )   C f v s ˆ    i i m | P f v s i i m ( ) C s m  where C(f i =v i , s m ) is the number of observations o  where the observation o belongs to class s m  and the feature f i takes the value v i  C(s m ) is the number of observations belonging to class s m

Back to example 29 • Collect the numbers • Estimate the probabilities

Back to example 30 𝑜  𝑏𝑠𝑕𝑛𝑏𝑦 𝑑 𝑛 ∈𝐷 𝑄 𝑑 𝑛 ς 𝑗=1 𝑄 𝑔 𝑗 = 𝑤 𝑗 𝑑 𝑛 ) 𝑄 𝑧𝑓𝑡 × 𝑄 𝑕𝑝𝑝𝑒 𝑧𝑓𝑡 × 𝑄 𝐵: 𝑜𝑝 𝑧𝑓𝑡 × 𝑄 𝑁: 𝑜𝑝 𝑧𝑓𝑡 =  130 100 115 59 200 × 130 × 130 × 130 = 0.2 𝑄 𝑜𝑝 × 𝑄 𝑕𝑝𝑝𝑒 𝑜𝑝 × 𝑄 𝐵: 𝑜𝑝 𝑜𝑝 × 𝑄 𝑁: 𝑜𝑝 𝑜𝑝 =  70 22 53 39 200 × 70 × 70 × 70 = 0.046  So we predict that the student will most probably enjoy the class  Accuracy on training data: 75%  Compare to Baseline: 65%  Best classifier: 80%

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text) Classification, Naive Bayes Lecture 3, 31 Aug Today - Classification 3 Motivation Classification Naive Bayes classification NB for text

Fall to Fall Enrollment Comparison Fall to Fall Enrollment Comparison Student FTE, Fall 2000

Seasonal Outreach Fall Fall Outreach Campaign Fall Outreach Campaign Fall Outreach Fall

CPB Approach 0,5 0 2000 2002 2004 2006 2008 2010 2012 2014 -0,5 5 November 2015 Fall 06 Fall

Sampling CS 6965 Fall 2011 Creative Program 3 CS 6965 Fall 2011 2 CS 6965 Fall 2011 3 CS

FALL PLANNING BELLEVUE SCHOOL DISTRICT Fall Planning 2020 STEERING COMMITTEE July 29, 2020

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

CS 251 Fall 2019 CS 251 Fall 2019 CS 251 Fall 2019 CS 251 Fall 2019 Principles of

All About Fall Ms. Sams Class Fall Is Here Fall is here, what does that mean? The summer

Fall Gardening Checklist And Planting vegetables in the fall And Cold Season Vegetables to Grow

AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 National

School Re-Opening Plan 2020-2021 Fall Prep Planning Process May 2020 Spring Family Thought Exchange

8/17/2020 1 2 3 1 8/17/2020 4 5 6 2 8/17/2020 7 8 9 3 8/17/2020 10 11 12 4

2019 Fall Enrollment Todays Topics What can I do during Fall Enrollment? Personal

Registration Information Fall 2017 Advisor Name Goal for Today Prepare a schedule for Fall 2017

2013 Fall Blue MEDIA STRATEGY MARKETING COMMITTEE, AUGUST 14, 2013 Evolving Fall Blue

Marketing Committee FALL / WINTER UPDATE NOVEMBER 27, 2013 Marketing 2013 FALL BLUE Overview -

Topic Models for Word Sense Disambiguation and Token-based Idiom Detection Linlin Li, Benjamin

Combining Probabilistic and Translation- Based Models for Information Retrieval based on Word

Experiments on Active Learning for Croatian Word Sense Disambiguation c and Jan Domagoj

Online: Unit Testjng Michael Meeks <michael.meeks@collabora.com> mmeeks / irc.freenode.net

Computational Semantics and Pragmatics Autumn 2011 Raquel Fernndez Institute for Logic,

Identifying Generic Expressions Nils Reiter and Anette Frank Department of Computational

Word Senses Polysemy: many meanings The book uses aspect in these senses Informal

Mining Data Graphs Semi-supervised learning, label propagation, Web Search Data graphs Data

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text) Classification, Naive Bayes Lecture 3, 31 Aug Today - Classification 3 Motivation Classification Naive Bayes classification NB for text

Fall to Fall Enrollment Comparison Fall to Fall Enrollment Comparison Student FTE, Fall 2000

Seasonal Outreach Fall Fall Outreach Campaign Fall Outreach Campaign Fall Outreach Fall

CPB Approach 0,5 0 2000 2002 2004 2006 2008 2010 2012 2014 -0,5 5 November 2015 Fall 06 Fall

Sampling CS 6965 Fall 2011 Creative Program 3 CS 6965 Fall 2011 2 CS 6965 Fall 2011 3 CS

FALL PLANNING BELLEVUE SCHOOL DISTRICT Fall Planning 2020 STEERING COMMITTEE July 29, 2020

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

CS 251 Fall 2019 CS 251 Fall 2019 CS 251 Fall 2019 CS 251 Fall 2019 Principles of

All About Fall Ms. Sams Class Fall Is Here Fall is here, what does that mean? The summer

Fall Gardening Checklist And Planting vegetables in the fall And Cold Season Vegetables to Grow

AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 National

School Re-Opening Plan 2020-2021 Fall Prep Planning Process May 2020 Spring Family Thought Exchange

8/17/2020 1 2 3 1 8/17/2020 4 5 6 2 8/17/2020 7 8 9 3 8/17/2020 10 11 12 4

2019 Fall Enrollment Todays Topics What can I do during Fall Enrollment? Personal

Registration Information Fall 2017 Advisor Name Goal for Today Prepare a schedule for Fall 2017

2013 Fall Blue MEDIA STRATEGY MARKETING COMMITTEE, AUGUST 14, 2013 Evolving Fall Blue

Marketing Committee FALL / WINTER UPDATE NOVEMBER 27, 2013 Marketing 2013 FALL BLUE Overview -

Topic Models for Word Sense Disambiguation and Token-based Idiom Detection Linlin Li, Benjamin

Combining Probabilistic and Translation- Based Models for Information Retrieval based on Word

Experiments on Active Learning for Croatian Word Sense Disambiguation c and Jan Domagoj

Online: Unit Testjng Michael Meeks &lt;michael.meeks@collabora.com&gt; mmeeks / irc.freenode.net

Computational Semantics and Pragmatics Autumn 2011 Raquel Fernndez Institute for Logic,

Identifying Generic Expressions Nils Reiter and Anette Frank Department of Computational

Word Senses Polysemy: many meanings The book uses aspect in these senses Informal

Mining Data Graphs Semi-supervised learning, label propagation, Web Search Data graphs Data

Online: Unit Testjng Michael Meeks <michael.meeks@collabora.com> mmeeks / irc.freenode.net