Natural Language Processing Info 159/259 Lecture 5: Truth and - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Info 159/259 Lecture 5: Truth and - - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley Hwt! W Grde na in gardagum, odcyninga rym gefrnon, h elingas ellen fremedon. Oft Scyld


slide-1
SLIDE 1

Natural Language Processing

Info 159/259
 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley

slide-2
SLIDE 2

Natural Language Processing

Info 159/259
 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley

Hwæt! Wé Gárde
 na in géardagum, þéodcyninga 
 þrym gefrúnon, hú ðá æþelingas ellen

  • fremedon. Oft Scyld Scéfing sceaþena
slide-3
SLIDE 3
slide-4
SLIDE 4

h1 h2 h3 x1 x2 x3 x4 x5 x6 x7

h1 = σ(x1W1 + x2W2 + x3W3) h2 = σ(x3W1 + x4W2 + x5W3) h3 = σ(x5W1 + x6W2 + x7W3)

h1=f(I, hated, it) h3=f(really, hated, it) h2=f(it, I, really)

I hated it I really hated it

W x

x1 x2 x3

1 1 1 3.1

  • 2.7

1.4

  • 0.7
  • 1.4

9.2

  • 3.1
  • 2.7

1.4 0.1 0.3

  • 0.4
  • 2.4
  • 4.7

5.7

W1 W2 W3

slide-5
SLIDE 5

Convolutional networks

x1 x2 x3 x4 x5 x6 x7

1 10 2

  • 1

5 10

max pooling convolution This defines one filter.

slide-6
SLIDE 6

Zhang and Wallace 2016, “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”

slide-7
SLIDE 7

Modern NLP is driven by annotated data

  • Penn Treebank (1993; 1995;1999); morphosyntactic

annotations of WSJ

  • OntoNotes (2007–2013); syntax, predicate-argument

structure, word sense, coreference

  • FrameNet (1998–): frame-semantic lexica/annotations
  • MPQA (2005): opinion/sentiment
  • SQuAD (2016): annotated questions + spans of answers in

Wikipedia

slide-8
SLIDE 8
  • In most cases, the data we have is the product of

human judgments.

  • What’s the correct part of speech tag?
  • Syntactic structure?
  • Sentiment?

Modern NLP is driven by annotated data

slide-9
SLIDE 9

Ambiguity

“One morning I shot 
 an elephant in my pajamas”

Animal Crackers

slide-10
SLIDE 10

Dogmatism

Fast and Horvitz (2016), “Identifying Dogmatism in Social Media: Signals and Models”

slide-11
SLIDE 11

Sarcasm

https://www.nytimes.com/2016/08/12/opinion/an-even-stranger-donald-trump.html?ref=opinion

slide-12
SLIDE 12

Fake News

http://www.fakenewschallenge.org

slide-13
SLIDE 13

Pustejovsky and Stubbs (2012),
 Natural Language Annotation for Machine Learning

Annotation pipeline

slide-14
SLIDE 14

Pustejovsky and Stubbs (2012),
 Natural Language Annotation for Machine Learning

Annotation pipeline

slide-15
SLIDE 15

Annotation Guidelines

  • Our goal: given the constraints of our problem, how

can we formalize our description of the annotation process to encourage multiple annotators to provide the same judgment?

slide-16
SLIDE 16

Annotation guidelines

  • What is the goal of the project?
  • What is each tag called and how is it used? (Be

specific: provide examples, and discuss gray areas.)

  • What parts of the text do you want annotated, and

what should be left alone?

  • How will the annotation be created? (For example,

explain which tags or documents to annotate first, how to use the annotation tools, etc.)

Pustejovsky and Stubbs (2012), Natural Language Annotation for Machine Learning

slide-17
SLIDE 17

Practicalities

  • Annotation takes time, concentration (can’t do it 8

hours a day)

  • Annotators get better as they annotate (earlier

annotations not as good as later ones)

slide-18
SLIDE 18

Why not do it yourself?

  • Expensive/time-consuming
  • Multiple people provide a measure of

consistency: is the task well enough defined?

  • Low agreement = not enough training, guidelines

not well enough defined, task is bad

slide-19
SLIDE 19

Adjudication

  • Adjudication is the process of deciding on a single

annotation for a piece of text, using information about the independent annotations.

  • Can be as time-consuming (or more so) as a

primary annotation.

  • Does not need to be identical with a primary

annotation (both annotators can be wrong by chance)

slide-20
SLIDE 20

Interannotator agreement

puppy fried chicken puppy 6 3 fried chicken 2 5

annotator A annotator B

  • bserved agreement = 11/16 = 68.75%

https://twitter.com/teenybiscuit/status/705232709220769792/photo/1

slide-21
SLIDE 21

Cohen’s kappa

  • If classes are imbalanced, we can get high inter

annotator agreement simply by chance

puppy fried chicken puppy 7 4 fried chicken 8 81

annotator A annotator B

slide-22
SLIDE 22

Cohen’s kappa

  • If classes are imbalanced, we can get high inter

annotator agreement simply by chance

puppy fried chicken puppy 7 4 fried chicken 8 81

annotator A annotator B κ = po − pe 1 − pe κ = 0.88 − pe 1 − pe

slide-23
SLIDE 23

Cohen’s kappa

  • Expected probability of agreement is how often we would

expect two annotators to agree assuming independent annotations pe = P(A = puppy, B = puppy) + P(A = chicken, B = chicken) = P(A = puppy)P(B = puppy) + P(A = chicken)P(B = chicken)

slide-24
SLIDE 24

Cohen’s kappa

= P(A = puppy)P(B = puppy) + P(A = chicken)P(B = chicken)

puppy fried chicken puppy 7 4 fried chicken 8 81

annotator A annotator B

P(A=puppy) 15/100 = 0.15 P(B=puppy) 11/100 = 0.11 P(A=chicken) 85/100 = 0.85 P(B=chicken) 89/100 = 0.89

= 0.15 × 0.11 + 0.85 × 0.89 = 0.773

slide-25
SLIDE 25

Cohen’s kappa

  • If classes are imbalanced, we can get high inter

annotator agreement simply by chance

puppy fried chicken puppy 7 4 fried chicken 8 81

annotator A annotator B κ = po − pe 1 − pe κ = 0.88 − pe 1 − pe κ = 0.88 − 0.773 1 − 0.773 = 0.471

slide-26
SLIDE 26
  • “Good” values are subject to interpretation, but rule of thumb:

Cohen’s kappa

0.80-1.00 Very good agreement 0.60-0.80 Good agreement 0.40-0.60 Moderate agreement 0.20-0.40 Fair agreement < 0.20 Poor agreement

slide-27
SLIDE 27

puppy fried chicken puppy fried chicken 100

annotator A annotator B

slide-28
SLIDE 28

puppy fried chicken puppy 50 fried chicken 50

annotator A annotator B

slide-29
SLIDE 29

puppy fried chicken puppy 50 fried chicken 50

annotator A annotator B

slide-30
SLIDE 30

Interannotator agreement

  • Cohen’s kappa can be used for any number of

classes.

  • Still requires two annotators who evaluate the same

items.

  • Fleiss’ kappa generalizes to multiple annotators,

each of whom may evaluate different items (e.g., crowdsourcing)

slide-31
SLIDE 31

Fleiss’ kappa

  • Same fundamental idea of

measuring the observed agreement compared to the agreement we would expect by chance.

  • With N > 2, we calculate

agreement among pairs

  • f annotators

κ = Po − Pe 1 − Pe

slide-32
SLIDE 32

nij

Number of annotators who assign category j to item i

Pi = 1 n(n − 1)

K

  • j=1

nij(nij − 1)

For item i with n annotations, how many annotators agree, among all n(n-1) possible pairs

Fleiss’ kappa

slide-33
SLIDE 33

Pi = 1 n(n − 1)

K

  • j=1

nij(nij − 1)

For item i with n annotations, how many annotators agree, among all n(n-1) possible pairs

A B C D + + +

  • Annotator

Label nij + 3

  • 1

A-B B-A A-C C-A B-C C-B

Pi = 1 4(3)(3(2) + 1(0))

agreeing pairs


  • f annotators →

Fleiss’ kappa

slide-34
SLIDE 34

Average agreement among all items Expected agreement by chance — joint probability two raters pick the same label is the product of their independent probabilities of picking that label

pj = 1 Nn

N

  • i=1

nij

Probability of category j

Pe =

K

  • j=1

p2

j

Po = 1 N

N

  • i=1

Pi

Fleiss’ kappa

slide-35
SLIDE 35

Annotator bias correction

  • Dawid, A. P. and Skene, A. M. "Maximum Likelihood Estimation of Observer Error-Rates

Using the EM Algorithm," Journal of the Royal Statistical Society, 28(1):20–28, 1979.

  • Weibe et al. (1999), "Development and use of a gold-standard data set for subjectivity

classifications," ACL (for sentiment)

  • Carpenter (2010), "Multilevel Bayesian Models of Categorical Data Annotation"
  • Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. Cheap and Fast - But

is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP 2008

  • Sheng et al. (2008), "Get another label? improving data quality and data mining using

multiple, noisy labelers", KDD.

  • Raykar et al. (2009), "Supervised learning from multiple experts: whom to trust when

everyone lies a bit," ICML

  • Hovy et al. (2013), "Learning Whom to Trust with MACE," NAACL
slide-36
SLIDE 36

Annotator bias correction

positive negative mixed unknown positive 0.95 0.03 0.02 negative 0.80 0.10 0.10 mixed 0.20 0.05 0.50 0.25 unknown 0.15 0.10 0.10 0.70

truth annotator label confusion matrix for a single annotator (David) P(label | truth)

slide-37
SLIDE 37

Annotator bias correction Dawid and Skene 1979

labels

L I

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

annotator confusion matrix P(label | truth) truth

Annotator bias correction

Basic idea: the true label is unobserved; what we observe are noisy judgments by annotators

slide-38
SLIDE 38

Evaluation

  • A critical part of development new algorithms and

methods and demonstrating that they work

slide-39
SLIDE 39

Classification

𝓨 = set of all documents 𝒵 = {english, mandarin, greek, …} A mapping h from input data x (drawn from instance space 𝓨) to a label (or labels) y from some enumerable output space 𝒵 x = a single document y = ancient greek

slide-40
SLIDE 40

train dev test

𝓨

instance space

slide-41
SLIDE 41

Experiment design

training development testing size 80% 10% 10% purpose training models model selection evaluation; never look at it until the very end

slide-42
SLIDE 42

Metrics

  • Evaluations presuppose that you have some metric to evaluate

the fitness of a model.

  • Language model: perplexity
  • POS tagging/NER: accuracy, precision, recall, F1
  • Phrase-structure parsing: PARSEVAL (bracketing overlap)
  • Dependency parsing: Labeled/unlabeled attachment score
  • Machine translation: BLEU, METEOR
  • Summarization: ROUGE
slide-43
SLIDE 43

POS NEG NEUT POS

100 2 15

NEG

104 30

NEUT

30 40 70

Predicted (ŷ) True (y)

Multiclass confusion matrix

slide-44
SLIDE 44

Accuracy

POS NEG NEUT POS 100 2 15 NEG 104 30 NEUT 30 40 70

Predicted (ŷ) True (y)

1 N

N

  • i=1

I[ˆ yi = yi]

I[x]

  • 1

if x is true

  • therwise
slide-45
SLIDE 45

Precision

Precision: proportion

  • f predicted class

that are actually that class.

POS NEG NEUT POS 100 2 15 NEG 104 30 NEUT 30 40 70

Predicted (ŷ) True (y)

Precision(POS) = ∑N

i=1 I(yi = ̂

yi = POS) ∑N

i=1 I( ̂

yi = POS)

slide-46
SLIDE 46

Recall

Recall: proportion of true class that are predicted to be that class.

POS NEG NEUT POS 100 2 15 NEG 104 30 NEUT 30 40 70

Predicted (ŷ) True (y)

Recall(POS) = ∑N

i=1 I(yi = ̂

yi = POS) ∑N

i=1 I(yi = POS)

slide-47
SLIDE 47

Significance

  • If we observe difference in performance, what’s the

cause? Is it because one system is better than another,

  • r is it a function of randomness in the data? If we had

tested it on other data, would we get the same result?

Your work 58% Current state of the art 50%

slide-48
SLIDE 48

Hypothesis testing

  • Hypothesis testing measures our confidence in

what we can say about a null from a sample.

slide-49
SLIDE 49

Hypothesis testing

  • Current state of the art = 50%; your model = 58%.

Both evaluated on the same test set of 1000 data points.

  • Null hypothesis = there is no difference, so we

would expect your model to get 500 of the 1000 data points right.

  • If we make parametric assumptions, we can model

this with a Binomial distribution (number of successes in n trials)

slide-50
SLIDE 50

Example

Binomial probability distribution for number of correct predictions in n=1000 with p = 0.5

0.000 0.005 0.010 0.015 0.020 0.025 400 450 500 550 600

# Dem

slide-51
SLIDE 51

0.000 0.005 0.010 0.015 0.020 0.025 400 450 500 550 600

# Dem

Example

At what point is a sample statistic unusual enough to reject the null hypothesis?

510 580

slide-52
SLIDE 52

Example

  • The form we assume for the null hypothesis lets us

quantify that level of surprise.

  • We can do this for many parametric forms that

allows us to measure P(X ≤ x) for some sample of size n; for large n, we can often make a normal approximation.

slide-53
SLIDE 53
  • Decide on the level of significance α. {0.05, 0.01}
  • Testing is evaluating whether the sample statistic

falls in the rejection region defined by α

Tests

slide-54
SLIDE 54

1

“jobs” is predictive of presidential approval rating

2

“job” is predictive of presidential approval rating

3

“war” is predictive of presidential approval rating

4

“car” is predictive of presidential approval rating

5

“the” is predictive of presidential approval rating

6

“star” is predictive of presidential approval rating

7

“book” is predictive of presidential approval rating

8

“still” is predictive of presidential approval rating

9

“glass” is predictive of presidential approval rating

1,000

“bottle” is predictive of presidential approval rating

slide-55
SLIDE 55

Errors

  • For any significance level α and n hypothesis tests,

we can expect α⨉n type I errors.

  • α=0.01, n=1000 = 10 “significant” results simply by

chance

slide-56
SLIDE 56

Multiple hypothesis corrections

  • Bonferroni correction: for

family-wise significance level α0 with n hypothesis tests:

  • [Very strict; controls the

probability of at least one type I error.]

  • False discovery rate

α ← α0 n

slide-57
SLIDE 57

Nonparametric tests

  • Many hypothesis tests rely on parametric

assumptions (e.g., normality)

  • Alternatives that don’t rely on those assumptions:
  • permutation test
  • the bootstrap
slide-58
SLIDE 58

Issues

  • Evaluation performance may not hold across

domains (e.g., WSJ →literary texts)

  • Covariates may explain performance (MT/parsing,

sentences up to length n)

  • Multiple metrics may offer competing results

Søgaard et al. 2014

slide-59
SLIDE 59

English POS

50 62.5 75 87.5 100 WSJ Shakespeare

81.9 97.0

German POS

50 62.5 75 87.5 100 Modern Early Modern

69.6 97.0

English POS

50 62.5 75 87.5 100 WSJ Middle English

56.2 97.3

Italian POS

50 62.5 75 87.5 100 News Dante

75.0 97.0

English POS

50 62.5 75 87.5 100 WSJ Twitter

73.7 97.3

English NER

40 55 70 85 100 CoNLL Twitter

41.0 89.0

Phrase structure parsing

50 60 70 80 90 WSJ GENIA

79.3 89.5

Dependency parsing

50 59.75 69.5 79.25 89 WSJ Patent

79.6 88.2

Dependency parsing

50 59.25 68.5 77.75 87 WSJ Magazines

77.1 86.9

slide-60
SLIDE 60

Ethics

Why does a discussion about ethics need to be a part of NLP?

slide-61
SLIDE 61

Conversational Agents

slide-62
SLIDE 62

Question Answering

http://searchengineland.com/according-google-barack-obama-king-united-states-209733

slide-63
SLIDE 63

Language Modeling

slide-64
SLIDE 64

Vector semantics

slide-65
SLIDE 65
  • The decisions we make about our methods —

training data, algorithm, evaluation — are often tied up with its use and impact in the world.

slide-66
SLIDE 66

I saw the man with the telescope

nsubj dobj det det pobj prep prep

Scope

  • NLP often operates on text divorced from the

context in which it is uttered.

  • It’s now being used more and more to reason

about human behavior.

slide-67
SLIDE 67

Privacy

slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70

Interventions

slide-71
SLIDE 71
slide-72
SLIDE 72

Exclusion

  • Focus on data from one domain/demographic
  • State-of-the-art models perform worse for young

(Hovy and Søgaard 2015) and minorities (Blodgett et al. 2016)

slide-73
SLIDE 73

Language identification Dependency parsing Blodgett et al. (2016), "Demographic Dialectal Variation in Social Media: A Case Study

  • f African-American English" (EMNLP)

Exclusion

slide-74
SLIDE 74

Overgeneralization

  • Managing and communicating the uncertainty of
  • ur predictions
  • Is a false answer worse than no answer?
slide-75
SLIDE 75

Dual Use

  • Authorship attribution (author of Federalist Papers
  • vs. author of ransom note vs. author of political

dissent)

  • Fake review detection vs. fake review generation
  • Censorship evasion vs. enabling more robust

censorship

slide-76
SLIDE 76

Homework 2

  • Derive the updates for a CNN and implement the

functions for forward/backward pass

  • Out today, due Sept 20
  • Be sure to check Piazza for any updates
  • Start early! Another short homework will come out

next Thursday.