Lecture 4: Introduction to Classification for NLP Julia - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 4: Introduction to Classification for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Lecture 04, Part 1: Review and Overview CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

Review: Lecture 03 Language models define a probability distribution over   all strings w =w (1) …w (K) in a language: ∑ P ( w ) = 1 w ∈ L N-gram language models define the probability of a string   w =w (1) …w (K) as the product of the probabilities of each word w (i) , conditioned on the n–1 preceding words: P n − gram ( w (1) . . . . w ( K ) ) = ∏ P ( w ( i ) | w ( i − 1) , …, w ( i − n +1) ) i =1.. K P unigram ( w (1) . . . . w ( K ) ) = ∏ Unigram: P ( w ( i ) ) i =1.. K P bigram ( w (1) . . . . w ( K ) ) = ∏ P ( w ( i ) | w ( i − 1) ) Bigram: i =1.. K P trigram ( w (1) . . . . w ( K ) ) = ∏ P ( w ( i ) | w ( i − 1) , w ( i − 2) ) Trigram: i =1.. K 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Review: Lecture 03 How do we… …estimate the parameters of a language model? Relative frequency estimation (aka Maximum Likelihood estimation)   … compute the probability of the first n–1 words? By padding the start of the sentence with n–1 BOS tokens   … obtain one distribution over strings of any length? By adding an EOS token to the end of each sentence.   … handle unknown words? By replacing rare words in training and unknown words with an UNK tokens … evaluate language models? Intrinsically with perplexity of test data, extrinsically e.g. with word error rate 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Overview: Lecture 04 Part 1: Review and Overview Part 2: What is classification? Part 3: The Naive Bayes classifier Part 4: Running&evaluating classification experiments Part 5: Features for Sentiment analysis Reading: Chapter 4, 3rd edition of Jurafsky and Martin 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 04’s questions What is classification? What is binary/multiclass/multilabel classification? What is supervised learning? And why do we want to learn classifiers   (instead of writing down some rules, say)? Feature engineering: from data to vectors How is the Naive Bayes Classifier defined? How do you evaluate a classifier? 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 04, Part 2: What is Classification? CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7

        Spam Detection Spam detection is a binary classification task:   Assign one of two labels (e.g. {S PAM , N O S PAM })   to the input (here, an email message) CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8

            Spam Detection A classifier is a function that maps inputs   to a predefined (finite) set of class labels : Spam Detector: Email ⟼ {S PAM , N O S PAM } Classifier: Input ⟼ {L ABEL 1 , …, L ABEL K } CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9

The importance of generalization Mail thinks this message is junk mail. We need to be able to classify items   our classifier has never seen before . CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10

The importance of adaptation Mail thinks this message is junk mail. Not junk The classifier needs to adapt/change based on the feedback ( supervision ) it receives CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11

Text classification more generally S PAM C ONFERENCES V ACATIONS … This is a multiclass classification task:   Assign one of K labels to the input   {S PAM , C ONFERENCES , V ACATIONS ,…} CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

Classification more generally Item   Class Classifier (Data Point) Label(s) But: The data we want to classify could be anything : Emails, words, sentences, images, image regions, sounds, database entries, sets of measurements, …. We assume that any data point   can be represented as a vector 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Classification more generally Class Feature Raw Feature Classifier function Label(s) Data vector Before we can use a classifier on our data, we have to map the data to “feature” vectors 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Feature engineering as a prerequisite   for classification To talk about classification mathematically, we assume   each input item is represented as a ‘ feature’ vector x = (x 1 ….x N ) — Each element in x is one feature. — The number of elements/features N is fixed, and may be very large. — x has to capture all the information about the item that the classifier needs. But the raw data points (e.g. documents to classify)   are typically not in vector form. Before we can train a classifier, we therefore have to first define   a suitable feature function that maps raw data points to vectors. In practice, feature engineering (designing suitable feature functions) is very important for accurate classification. 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

From texts to vectors In NLP, input items are documents, sentences, words, ….   ⇒ How do we represent these items as vectors? Bag-of-Words representation: (this ignores word order) Assume that each element x i in (x 1 ….x N ) corresponds to   one word type ( v i ) in the vocabulary V = {v 1 ,…,v N } There are many different ways to represent a piece of text   as a vector over the vocabulary, e.g.: — If x i ∈ {0,1} : Does word v i occur (yes: x i = 1 , no: x i = 0 )   in the input document? — If x i ∈ {0, 1, 2, …} : How often does word v i occur in the   input document? [We will see many other ways to map text to vectors this semester] 16 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Now, back to classification…: f ( x ) A classifier is a function that maps   x ∈ X y ∈ Y input items to class labels X Y ( is a vector space, is a finite set)   Binary classification:   Each input item is mapped to exactly one of 2 classes Multi-class classification:   Each input item is mapped to exactly one of K classes (K > 2) Multi-label classification:   Each input item is mapped to N of K classes   (N ≥ 1, varies per input item) 17 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Classification as supervised machine learning Classification tasks: Map inputs to a fixed set of class labels Underlying assumption: Each input really has one (or N) correct labels   Corollary: The correct mapping is a function (aka the ‘target function’) How do we obtain a classifier (model) for a given task? — If the target function is very simple (and known), implement it directly — Otherwise, if we have enough correctly labeled data,   estimate (aka. learn/train) a classifier based on that labeled data.   Supervised machine learning: Given (correctly) labeled training data , obtain a classifier   that predicts these labels as accurately as possible. Learning is supervised because the learning algorithm can get feedback about how accurate its predictions are from the labels in the training data. 18 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Supervised learning: Training Labeled Training Data D train Learned Learning ( x 1 , y 1 ) model Algorithm g( x ) ( x 2 , y 2 ) … ( x N , y N ) Give the learning algorithm examples in D train The learning algorithm returns a model g( x ) CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19

Supervised learning: Testing Labeled Test Data D test ( x’ 1 , y’ 1 ) ( x’ 2 , y’ 2 ) … ( x’ M , y’ M ) Reserve some labeled data for testing CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20

Supervised learning: Testing Labeled Test Data Raw Test D test Test Data ( x’ 1 , y’ 1 ) Labels X test ( x’ 2 , y’ 2 ) Y test x’ 1 … y’ 1 x’ 2 ( x’ M , y’ M ) y’ 2 …. x’ M ... CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 21

Supervised learning: Testing Apply the learned model to the raw test data   to obtain predicted labels for the test data Test Raw Test Predicted Labels Data Labels Y test g( X test ) X test Learned y’ 1 model x’ 1 g( x’ 1 ) y’ 2 x’ 2 g( x’ 2 ) g( x ) …. …. ... x’ M g( x’ M ) y’ M CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 22

Lecture 4: Introduction to Classification for NLP Julia - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 4: Introduction to Classification for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Lecture 04, Part 1: Review and Overview CS447 Natural

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Multiclass Classification CS 6956: Deep Learning for NLP 1 So far: Binary Classification We

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Question Classification Ling573 NLP Systems and Applications April 22, 2014 Roadmap

Algorithms for NLP Classification Sachin Kumar Slides: Dan Klein UC Berkeley, Taylor

Algorithms for NLP Classification I Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor

Statistical NLP Spring 2011 Lecture 11: Classification Dan Klein UC Berkeley Classification

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

IP Reputation Analysis of Public Databases and Machine Learning Techniques Jared Lee Lewis

Administrivia CS 188: Artificial Intelligence Reminder: Spring 2006 Drop-in Python/Unix

Machine Learning Machine Learning: algorithms that use experience to improve their

CMPSC443 - Introduction to Computer and Network Security Module: EMail Secuirty Professor

What is Green? What does it mean to be green? Why is being green important?

Natural language processing using constraint-based grammars Ann Copestake University of Cambridge

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Energy Management Planning for Small Water Systems - Workshop Village of Fox Lake, Illinois