Algorithms for NLP CS 11-711 Fall 2020 Lecture 1: Introduction - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11-711 Fall 2020 Lecture 1: Introduction - - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 1: Introduction Emma Strubell Welcome! Emma Yulia Bob Sanket Han Jiateng 2 Course website: http://demo.clab.cs.cmu.edu/11711fa20/ Piazza: https://piazza.com/cmu/fall2020/11711/ 3


slide-1
SLIDE 1

Emma Strubell

Algorithms for NLP

CS 11-711 · Fall 2020

Lecture 1: Introduction

slide-2
SLIDE 2

Welcome!

Emma Yulia Bob Sanket Han Jiateng

2

slide-3
SLIDE 3

Course website:

http://demo.clab.cs.cmu.edu/11711fa20/

3

Piazza:

https://piazza.com/cmu/fall2020/11711/

slide-4
SLIDE 4

Communicating with machines

4

~1950s-1970s:

slide-5
SLIDE 5

Communicating with machines

5

~1980s:

slide-6
SLIDE 6

Communicating with machines

6

Today:

slide-7
SLIDE 7

What is natural language processing (NLP)?

■ NL ∈ {Mandarin, Hindi, Spanish, Arabic, English, … Inuktitut} ■ Automation of NLs: ■ analysis (NL → R) ■ generation (R → NL) ■ acquisition of R from knowledge and data

7

slide-8
SLIDE 8

What is natural language processing (NLP)?

8

Image credit: Noah Smith

slide-9
SLIDE 9

Language technologies

What technologies are required to write such a program?

9

■ A conversational agent contains: ■ Speech recognition ■ Language analysis ■ Dialog processing ■ Information retrieval ■ Text to speech

slide-10
SLIDE 10

Language technologies

Machine translation

10

slide-11
SLIDE 11

Language technologies

Question answering

11

■ What does “divergent” mean? ■ What year was Abraham Lincoln born? ■ How many states were in the United

States that year?

■ How much Chinese silk was exported

to England at the end of the 18th century?

■ What do scientists think about the

ethics of human cloning?

slide-12
SLIDE 12

Natural language processing

Applications

12

■ Machine Translation ■ Information Retrieval ■ Question Answering ■ Dialogue Systems ■ Information Extraction ■ Summarization ■ Sentiment Analysis ■ ...

Core technologies

■ Language modeling ■ Part-of-speech tagging ■ Syntactic parsing ■ Named-entity recognition ■ Coreference resolution ■ Word sense disambiguation ■ Semantic role labelling ■ ...

slide-13
SLIDE 13

What does an NLP system need to “know”?

13

■ Language consists of many levels of structure ■ Humans fluently integrate all of these in producing and understanding language ■ Ideally, so would a computer!

slide-14
SLIDE 14

What does it mean to “know” a language?

14

slide-15
SLIDE 15

Levels of linguistic knowledge

15

Image credit: Noah Smith

slide-16
SLIDE 16

Speech, phonetics, phonology

16

/ ðɪs ɪz ə ˈsɪmpl ˈsɛntəns /.

This is a simple sentence .

slide-17
SLIDE 17

Words

17

■ Morphological analysis ■ Tokenization ■ Lemmatization

This is a simple sentence .

be 3sg present

Morphology Tokens

slide-18
SLIDE 18

Syntax

18

This is a simple sentence .

■ Part-of-speech tagging

be 3sg present

Morphology Parts of speech

DT VBZ DT JJ NN PUNC

Tokens

slide-19
SLIDE 19

Syntax

19

■ Part-of-speech tagging ■ Syntactic parsing

This is a simple sentence .

be 3sg present

Tokens Morphology

DT VBZ DT JJ NN PUNC

Syntax Parts of speech

NP NP VP S

slide-20
SLIDE 20

Semantics

20

This is a simple sentence .

■ Named entity recognition ■ Word sense disambiguation ■ Semantic role labeling

be 3sg present

Tokens Morphology

DT VBZ DT JJ NN PUNC

Syntax Parts of speech

NP NP VP S

SIMPLE1: having few parts

SENTENCE1: String of words satisfying the grammatical rules of a language

Semantics

slide-21
SLIDE 21

Discourse

21

This is a simple sentence .

■ Reference resolution ■ Discourse parsing

be 3sg present

Tokens Morphology

DT VBZ DT JJ NN PUNC

Syntax Parts of speech

NP NP VP S

SIMPLE1: having few parts

SENTENCE1: String of words satisfying the grammatical rules

  • f a language

Semantics

But an instructive one .

Discourse

coreferent

slide-22
SLIDE 22

Where are we now?

22

https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

slide-23
SLIDE 23

Where are we now?

23

Conversation A Human: say you’re in an empty giant office and you see a black bird flying indoors. what do you do? Meena: What kind of bird? Is it friendly? How large? How big is the office? Is it indoors or outdoors? Human: well it’s a predator bird and it’s gonna eat you Meena: I would grab my phone and call 911.

Towards a Human-like Open-Domain Chatbot. Adiwardana, Luong, So, Hall, Fiedel, Thoppilan, Yang, Kulshreshtha, Nemade, Lu, Le. arXiv preprint 2001.09977. 2020.

Cross-turn Repetition Example 1 Human: Hi! Meena: Hi! How are you?! :) Human: doing good, what would be the best city to visit in Europe? Meena: Paris, Barcelona, Amsterdam, Prague Human: great list! why? Meena: Paris is a beautiful city, and Barcelona is too. And I’ve always wanted to go to Amsterdam and Prague and have never been. Human: have you even been to India? Meena: No, but I’d love to go. Human: how about Japan? Meena: I’d love to go to Japan too. Human: how about Paris? Meena: I’d love to go to Paris too. Human: are you lying to me? Meena: I’d love to go to Japan too. Human: i think you’re lying to me, that makes me sad Meena: I’d love to go to Japan, too. Human: you’re blocking me out! you have to talk to me Meena: I’d love to go to Japan. Human: where else? Meena: I’d love to go to Japan, too!

slide-24
SLIDE 24

Where are we now?

24

slide-25
SLIDE 25

Why is NLP hard?

25

  • 1. Ambiguity
  • 2. Scale
  • 3. Sparsity
  • 4. Variation
  • 5. Expressivity
  • 6. Unmodeled variables
  • 7. Unknown representation R
slide-26
SLIDE 26

Ambiguity

26

■ Ambiguity at multiple levels: ■ Word sense: bank (finance or river) ■ Part of speech: chair (noun or verb?) ■ Syntactic structure: I saw the man with the telescope ■ Multiple: I saw her duck

slide-27
SLIDE 27

Ambiguity + Scale

27

slide-28
SLIDE 28

Ambiguity: Tokenization

28

slide-29
SLIDE 29

Ambiguity: Word sense disambiguation

29

slide-30
SLIDE 30

Ambiguity: Tokenization + disambiguation

30

slide-31
SLIDE 31

Ambiguity: Morphology

31

■ unfriend ■ Obamacare ■ Manfuckinghattan

slide-32
SLIDE 32

Ambiguity: Semantic analysis

32

■ Every language sees the world in a different way ■ For example, could depend on historical conditions ■ Russian has very few words for colors; Japanese has hundreds ■ Idioms: happy as a clam, it’s raining cats and dogs, les

carottes sont cuites

slide-33
SLIDE 33

Ambiguity: Semantics

33

Every fifteen minutes a woman in this country gives birth.

slide-34
SLIDE 34

Ambiguity: Semantics

34

Every fifteen minutes a woman in this country gives birth. Our job is to find this woman, and stop her!

— Groucho Marx

slide-35
SLIDE 35

Ambiguity: Syntax + semantics

35

■ I saw the woman with the telescope wrapped in paper. ■ Who has the telescope? ■ Who or what is wrapped in paper? ■ Event of perception or assault?

slide-36
SLIDE 36

Dealing with ambiguity

36

■ How can we model ambiguity and choose correct analysis

in context?

■ Non-probabilistic methods return all possible analyses. ■ Probabilistic models return best possible analysis, i.e.

most probable one according to the model.

But the “best” analysis is only good if our probabilities are accurate. Where do they come from?

slide-37
SLIDE 37

Corpora

37

■ A corpus is a collection of text ■ Often annotated in some way ■ Sometimes just lots of raw text ■ Examples ■ Penn Treebank: 1M words of parsed Wall Street Journal ■ Canadian Hansards: 10M+ words of aligned French/English sentences ■ Yelp reviews ■ The Web / Common Crawl: billions of words of who knows what

slide-38
SLIDE 38

Corpus-based methods

38

■ Give us statistical information

All NPs

NPs under S

NPs under VP

slide-39
SLIDE 39

Statistical NLP

39

■ Like most other parts of AI, NLP is dominated by

statistical methods

■ Typically more robust than earlier rule-based methods ■ Relevant statistics/probabilities learned from data ■ Normally requires lots of data about any particular

phenomenon

slide-40
SLIDE 40

Why is NLP hard?

40

  • 1. Ambiguity
  • 2. Scale
  • 3. Sparsity
  • 4. Variation
  • 5. Expressivity
  • 6. Unmodeled variables
  • 7. Unknown representation R
slide-41
SLIDE 41

Sparsity

41

■ Sparse data due to Zipf’s Law ■ To illustrate, let’s look at the frequencies of different

words in a large text corpus

■ Assume “word” is a string of letters separated by spaces

slide-42
SLIDE 42

Sparsity

42

■ Most frequent words in the English Europarl

corpus (out of 24m word tokens)

slide-43
SLIDE 43

Word counts

43

■ Out of 93,638 distinct words (types), 36,231 (~40%) occur

  • nly once.

■ Examples: ■ cornflakes, mathematicians, fuzziness, jumbling ■ pseudo-rapporteur, lobby-ridden, perfunctorily ■ Lycketoft, UNCITRAL, H-0695 ■ policyfor, Commissioneris, 145.95, 27a

slide-44
SLIDE 44

Plotting word frequencies

44

■ Order words by frequency. What is the frequency of nth ranked word?

slide-45
SLIDE 45

Zipf’s Law

45

■ Implications ■ Regardless of how large our corpus is, there will be a lot of infrequent

(and zero-frequency!) words

■ This means we need to find clever ways to estimate probabilities for

things we have rarely or never seen

slide-46
SLIDE 46

Why is NLP hard?

46

  • 1. Ambiguity
  • 2. Scale
  • 3. Sparsity
  • 4. Variation
  • 5. Expressivity
  • 6. Unmodeled variables
  • 7. Unknown representation R
slide-47
SLIDE 47

Variation

47

■ Suppose we train a part of speech tagger or a parser on the Wall Street Journal…

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NML (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)))

■ What will happen if we try to use this tagger/parser on social media?

slide-48
SLIDE 48

Why is NLP Hard?

48

slide-49
SLIDE 49

Why is NLP hard?

49

  • 1. Ambiguity
  • 2. Scale
  • 3. Sparsity
  • 4. Variation
  • 5. Expressivity
  • 6. Unmodeled variables
  • 7. Unknown representation R
slide-50
SLIDE 50

Expressivity

50

■ Not only can one form have different meanings (ambiguity)

but the same meaning can be expressed with different forms:

She gave the book to Deni. vs. She gave Deni the book. Some kids popped by. vs. A few children visited. Is that window still open? vs. Please close the window.

slide-51
SLIDE 51

Unmodeled variables

51

■ World knowledge ■ I dropped the glass on the floor and it broke ■ I dropped the hammer on the glass and it broke

“drink this milk.” skater eats pavement

slide-52
SLIDE 52

Unknown representation

52

■ Very difficult to capture what is R, since we don’t even know how

to represent the knowledge a human has/needs:

■ What is the “meaning” of a word, sentence, utterance? ■ How to model context? ■ Other general knowledge?

slide-53
SLIDE 53

Desiderata for NLP models

53

■ Sensitivity to a wide range of phenomena and constraints in human

language

■ Generality across languages, modalities, genres, styles ■ Strong formal guarantees (e.g., convergence, statistical efficiency,

consistency)

■ High accuracy when judged against expert annotations or test data ■ Efficient ■ Ethical

slide-54
SLIDE 54

Symbolic and probabilistic NLP

54

~1990s Logic/Rule-based NLP Statistical NLP

slide-55
SLIDE 55

Probabilistic and Connectionist NLP

55

~mid 2010s Engineered features Learned features

slide-56
SLIDE 56

NLP ≟ Machine Learning

56

■ To be successful, a machine learner needs bias/assumptions;

for NLP , that might be linguistic theory/representations.

■ Symbolic, probabilistic, and connectionist ML have all seen

NLP as a source of inspiring applications.

slide-57
SLIDE 57

What is nearby NLP?

57

■ Computational linguistics ■ Using computational methods to better understand how language

works

■ We end up doing this and using it ■ Cognitive science ■ How does the human brain work? ■ Includes the bits that do language ■ Humans: the only working NLP prototype! ■ Speech processing ■ Mapping audio signals to text ■ Traditionally separate from NLP; converging? ■ Two components: acoustic models and language models ■ Language models in the domain of statistical NLP

slide-58
SLIDE 58

Course logistics

58

slide-59
SLIDE 59

What is this class?

Three aspects to the course:

59

■ Linguistic issues ■ What are the range of language phenomena? ■ What are the knowledge sources that let us disambiguate? ■ What representations are appropriate? ■ How do you know what to model and what not to model? ■ Statistical modeling methods ■ Increasingly complex model structures ■ Learning and parameter estimation ■ Efficient inference: dynamic programming, search, sampling ■ Engineering methods ■ Issues of scale. We’ll focus on what makes problems hard, and what works in practice.

slide-60
SLIDE 60

What is this class?

Models and algorithms:

60

■ Models ■ State machines (finite state automata / transducers) ■ Logic (first-order logic) ■ Probabilistic models (WFST, language models, HMM, SVM, CRF, …) ■ Vector space models (embeddings, Transformers) ■ Algorithms ■ State space search (DFS, BFS, A*, dynamic programming — Viterbi, CKY) ■ Supervised learning ■ Unsupervised learning ■ Methodological tools ■ Training/test sets, evaluation, cross-validation

slide-61
SLIDE 61

What is this class?

Outline of topics:

61

■ Words and sequences ■ Classification ■ Language modeling ■ Vector semantics and word embeddings ■ Sequence labeling ■ Structured classification ■ Parsing ■ Morphology ■ Semantics ■ Discourse ■ Applications ■ Machine translation, summarization, sentiment analysis, computational ethics

slide-62
SLIDE 62

What is this class?

Grading

62

■ This is a project-based course. Grading will be done

based on 5 homework assignments, as well as class participation.

■ Four projects each worth 20% of your final grade:

Classification, sequence labeling, dependency parsing, and coreference.

■ One review/presentation on a paper from this year’s

EMNLP: 10%

■ Participation on Piazza: 10%

slide-63
SLIDE 63

What is this class?

Requirements and goals

63

■ Class requirements ■ Uses a variety of skills/knowledge ■ Probability and statistics, graphical models ■ Basic linguistics background ■ Strong coding skills (Python) ■ Most people are probably missing one of the above. You will have to work on

your own to fill the gaps.

■ Class goals ■ Learn the issues and techniques of statistical NLP ■ Build realistic NLP tools ■ Be able to read current research papers in the field

slide-64
SLIDE 64

■ Primary texts ■ Jurafsky and Martin, Speech and Language Processing,

2nd and 3rd Edition (not 1st) 3rd ed: https://web.stanford.edu/~jurafsky/slp3/

■ Eisenstein, Introduction to Natural Language Processing

https://github.com/jacobeisenstein/gt-nlp-class/blob/ master/notes/eisenstein-nlp-notes.pdf

■ Prerequisites ■ Mastery of basic probability ■ Strong skills in Python or equivalent ■ Deep interest in language

What is this class?

Readings

64

slide-65
SLIDE 65

■ Course contacts: ■ Webpage: materials and announcements ■ Piazza: discussion forum ■ Canvas: project submissions ■ Homework questions: Recitation, Piazza, TA office hours ■ Enrollment: We’ll try to take everyone who meets the requirements ■ Computing resources: ■ Homeworks should run on relatively modern hardware just fine.

If you need access to more compute for extra credit experimentation, we may be able to arrange for cloud credits.

■ Questions?

Other announcements

65