Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 2 More Intro… Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am—12:30pm

Wrap-up:   Syllabus for this class CS546 Machine Learning in NLP 2

3 CS546 Machine Learning in NLP

Admin You will receive an email with a link to a Google form where you can sign up for slots to present. — Please sign up for at least three slots so that I have some flexibility in assigning you to a presentation We will give you one week to fill this in. You will have to meet with me the Monday before your presentation to go over your slides. 4 CS546 Machine Learning in NLP

Grading criteria for presentations — Clarity of exposition and presentation — Analysis   (don’t just regurgitate what’s in the paper) — Quality of slides   (and effort that went into making them   — just re-using other people’s slides is not enough) 5 CS546 Machine Learning in NLP

Why does NLP   need ML? CS447: Natural Language Processing (J. Hockenmaier) 6

NLP research questions redux How do you represent (or predict) words? Do you treat words in the input as atomic categories, as continuous vectors, or as structured objects? How do you handle rare/unseen words, typos, spelling variants, morphological information? Lexical semantics: do you capture word meanings/senses? How do you represent (or predict) word sequences? Sequences = sentences, paragraphs, documents, dialogs,… As a vector, or as a structured object? How do you represent (or predict) structures? Structures = labeled sequences, trees, graphs, formal languages (e.g. DB records/queries, logical representations) How do you represent “meaning”? 7 CS546 Machine Learning in NLP

Two core problems for NLP Ambiguity: Natural language is highly ambiguous - Words have multiple senses and different POS - Sentences have a myriad of possible parses - etc. Coverage (compounded by Zipf’s Law) - Any (wide-coverage) NLP system will come across words or constructions that did not occur during training. - We need to be able to generalize from the seen events during training to unseen events that occur during testing (i.e. when we actually use the system). 8 CS546 Machine Learning in NLP

The coverage problem CS447: Natural Language Processing (J. Hockenmaier) 9

Zipf’s law: the long tail How many words occur once, twice, 100 times, 1000 times? How many words occur N times? 100000 A few words   Word frequency ( log-scale) the r- th most are very frequent 10000 common word w r Frequency (log) has P ( w r ) ∝ 1/r 1000 Most words   100 are very rare 10 1 1 10 100 1000 10000 100000 Number of words (log) English words, sorted by frequency ( log-scale ) w 1 = the, w 2 = to, …., w 5346 = computer , ... In natural language: - A small number of events (e.g. words) occur with high frequency - A large number of events occur with very low frequency 10 CS447: Natural Language Processing (J. Hockenmaier)

Implications of Zipf’s Law for NLP The good: Any text will contain a number of words that are very common . We have seen these words often enough that we know (almost) everything about them. These words will help us get at the structure (and possibly meaning) of this text. The bad: Any text will contain a number of words that are rare . We know something about these words, but haven’t seen them often enough to know everything about them. They may occur with a meaning or a part of speech we haven’t seen before. The ugly: Any text will contain a number of words that are unknown to us. We have never seen them before, but we still need to get at the structure (and meaning) of these texts. 11 CS546 Machine Learning in NLP

Dealing with the bad and the ugly Our systems need to be able to generalize   from what they have seen to unseen events. There are two (complementary) approaches   to generalization: — Linguistics provides us with insights about the rules and structures in language that we can exploit in the (symbolic) representations we use E.g.: a finite set of grammar rules is enough to describe an infinite language   — Machine Learning/Statistics allows us to learn models (and/or representations) from real data that often work well empirically on unseen data E.g. most statistical or neural NLP 12 CS546 Machine Learning in NLP

How do we represent words? Option 1: Words are atomic symbols Can’t capture syntactic/semantic relations between words   — Each (surface) word form is its own symbol — Map different forms of a word to the same symbol - Lemmatization : map each word to its lemma   (esp. in English, the lemma is still a word in the language,   but lemmatized text is no longer grammatical) - Stemming : remove endings that differ among word forms   (no guarantee that the resulting symbol is an actual word) - Normalization: map all variants of the same word (form) to the same canonical variant (e.g. lowercase everything, normalize spellings, perhaps spell-check) 13 CS546 Machine Learning in NLP

  How do we represent words? Option 2: Represent the structure of each word “books” => “book N pl” (or “book V 3rd sg”) This requires a morphological analyzer The output is often a lemma plus morphological information This is particularly useful for highly inflected languages   (less so for English or Chinese) Aims: — the lemma/stem captures core (semantic) information   — reduce the vocabulary of highly inflected languages 14 CS546 Machine Learning in NLP

How do we represent words? Option 3: Each word is a (high-dimensional) vector Advantage: Neural nets need vectors as input! How do we represent words as vectors? — Naive solution :   as one-hot vectors — Distributional similarity solution :   as very high-dimensional sparse vectors — Static word embedding solution (word2vec etc.):   by a dictionary that maps words to   fixed lower-dimensional dense vectors — Dynamic embedding solution (Elmo etc.):   Compute context-dependent dense embeddings 15 CS546 Machine Learning in NLP

How do we represent unknown words? Systems that use machine learning may need to have a unique representation of each word.   Option 1: the UNK token Replace all rare words (in your training data)   with an UNK token (for Unknown word). Replace all unknown words that you come across after training (including rare training words) with the same UNK token   Option 2: substring-based representations Represent (rare and unknown) words as sequences of characters or substrings - Byte Pair Encoding: learn which character sequences are common in the vocabulary of your language 16 CS546 Machine Learning in NLP

The ambiguity problem CS546 Machine Learning in NLP 17

“I made her duck” What does this sentence mean? “ duck ” : noun or verb? “ make ” : “ cook X” or “ cause X to do Y” ? “ her ”: “for her” or “ belonging to her” ?   Language has different kinds of ambiguity, e.g.: Structural ambiguity “I eat sushi with tuna ” vs. “I eat sushi with chopsticks ” “ I saw the man with the telescope on the hill ” Lexical (word sense) ambiguity “ I went to the bank ” : financial institution or river bank? Referential ambiguity “ John saw Jim . He was drinking coffee.” 18 CS447: Natural Language Processing (J. Hockenmaier)

Task: Part-of-speech-tagging Open the pod door, Hal. Verb Det Noun Noun , Name . Open the pod door , Hal . open :   verb, adjective, or noun? Verb: open the door Adjective: the open door Noun: in the open 19 CS447: Natural Language Processing (J. Hockenmaier)

“I made her duck cassoulet” (Cassoulet = a French bean casserole) The second major problem in NLP is coverage : We will always encounter unfamiliar words   and constructions.   Our models need to be able to deal with this. This means that our models need to be able   to generalize from what they have been trained on   to what they will be used on. 21 CS447: Natural Language Processing (J. Hockenmaier)

Statistical NLP CS546 Machine Learning in NLP 22

The last big paradigm shift Starting in the early 1990s, NLP became very empirical and data-driven due to — success of statistical methods in machine translation   (IBM systems) — availability of large(ish) annotated corpora   (Susanne Treebank, Penn Treebank, etc.) Advantages over rule-based approaches: — Common benchmarks to compare models against — Empirical (objective) evaluation is possible — Better coverage — Principled way to handle ambiguity 23 CS546 Machine Learning in NLP

Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Wrap-up: Syllabus for this class

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Intro to Life Cycle Analysis Intro to Life Cycle Analysis Intro to Life Cycle Analysis

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Lab 0 Objectives Intro to Labs Intro to Operating Systems Start Lab #0 UNIX/Linux

Some issues in model-based development for embedded control systems Paul Caspi Verimag-Cnrs

CS 112: Intro to Comp Prog CS 112: Intro to Comp Prog Lecture Review Data Types String

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

Fusion Strategy for Prosodic and Lexical Representations of Word Importance Sushant Kafle

bp week Bernard Looney Bernard Looney Chief executive officer 1 Cautionary statement

Introduction to the Class Purpose of the Class principally practical: to improve English

A practical introduction to distributional semantics PART I: Co-occurrence matrix models Marco

From tefillah to the chadar ochel : Why and how camps use Hebrew words Sarah Bunin Benor -

Corpus Analysis of Conjunctions: Arabic Learners Difficulties with Collocations Haslina

Interacting alternatives Referential indeterminacy and questions Floris Roelofsen, ILLC,

Elevating Lived Experience in Coordinated Entry Evaluation THE ROLE OF QUALITATIVE DATA 1

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 2 More Intro Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Monday, 11am12:30pm Wrap-up: Syllabus for this class

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Why Transformers Work. *More info blablabla *More info blablabla *More info blablabla *More

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Intro to Life Cycle Analysis Intro to Life Cycle Analysis Intro to Life Cycle Analysis

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data &amp; Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data &amp; Intro to Cloud Computing

Lab 0 Objectives Intro to Labs Intro to Operating Systems Start Lab #0 UNIX/Linux

Some issues in model-based development for embedded control systems Paul Caspi Verimag-Cnrs

CS 112: Intro to Comp Prog CS 112: Intro to Comp Prog Lecture Review Data Types String

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

Fusion Strategy for Prosodic and Lexical Representations of Word Importance Sushant Kafle

bp week Bernard Looney Bernard Looney Chief executive officer 1 Cautionary statement

Introduction to the Class Purpose of the Class principally practical: to improve English

A practical introduction to distributional semantics PART I: Co-occurrence matrix models Marco

From tefillah to the chadar ochel : Why and how camps use Hebrew words Sarah Bunin Benor -

Corpus Analysis of Conjunctions: Arabic Learners Difficulties with Collocations Haslina

Interacting alternatives Referential indeterminacy and questions Floris Roelofsen, ILLC,

Elevating Lived Experience in Coordinated Entry Evaluation THE ROLE OF QUALITATIVE DATA 1

Sambuz

Useful Links

Newsletter

Mail Us

Why Transformers Work. More info blablabla More info blablabla More info blablabla More

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing