Natural Language Processing
Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley
Natural Language Processing Info 159/259 Lecture 5: Truth and - - PowerPoint PPT Presentation
Natural Language Processing Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley Hwt! W Grde na in gardagum, odcyninga rym gefrnon, h elingas ellen fremedon. Oft Scyld
Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley
Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley
Hwæt! Wé Gárde na in géardagum, þéodcyninga þrym gefrúnon, hú ðá æþelingas ellen
h1 h2 h3 x1 x2 x3 x4 x5 x6 x7
h1 = σ(x1W1 + x2W2 + x3W3) h2 = σ(x3W1 + x4W2 + x5W3) h3 = σ(x5W1 + x6W2 + x7W3)
h1=f(I, hated, it) h3=f(really, hated, it) h2=f(it, I, really)
I hated it I really hated it
W x
x1 x2 x3
1 1 1 3.1
1.4
9.2
1.4 0.1 0.3
5.7
W1 W2 W3
x1 x2 x3 x4 x5 x6 x7
1 10 2
5 10
max pooling convolution This defines one filter.
Zhang and Wallace 2016, “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”
annotations of WSJ
structure, word sense, coreference
Wikipedia
human judgments.
“One morning I shot an elephant in my pajamas”
Animal Crackers
Fast and Horvitz (2016), “Identifying Dogmatism in Social Media: Signals and Models”
https://www.nytimes.com/2016/08/12/opinion/an-even-stranger-donald-trump.html?ref=opinion
http://www.fakenewschallenge.org
Pustejovsky and Stubbs (2012), Natural Language Annotation for Machine Learning
Pustejovsky and Stubbs (2012), Natural Language Annotation for Machine Learning
can we formalize our description of the annotation process to encourage multiple annotators to provide the same judgment?
specific: provide examples, and discuss gray areas.)
what should be left alone?
explain which tags or documents to annotate first, how to use the annotation tools, etc.)
Pustejovsky and Stubbs (2012), Natural Language Annotation for Machine Learning
hours a day)
annotations not as good as later ones)
consistency: is the task well enough defined?
not well enough defined, task is bad
annotation for a piece of text, using information about the independent annotations.
primary annotation.
annotation (both annotators can be wrong by chance)
puppy fried chicken puppy 6 3 fried chicken 2 5
annotator A annotator B
https://twitter.com/teenybiscuit/status/705232709220769792/photo/1
annotator agreement simply by chance
puppy fried chicken puppy 7 4 fried chicken 8 81
annotator A annotator B
annotator agreement simply by chance
puppy fried chicken puppy 7 4 fried chicken 8 81
annotator A annotator B κ = po − pe 1 − pe κ = 0.88 − pe 1 − pe
expect two annotators to agree assuming independent annotations pe = P(A = puppy, B = puppy) + P(A = chicken, B = chicken) = P(A = puppy)P(B = puppy) + P(A = chicken)P(B = chicken)
= P(A = puppy)P(B = puppy) + P(A = chicken)P(B = chicken)
puppy fried chicken puppy 7 4 fried chicken 8 81
annotator A annotator B
P(A=puppy) 15/100 = 0.15 P(B=puppy) 11/100 = 0.11 P(A=chicken) 85/100 = 0.85 P(B=chicken) 89/100 = 0.89
= 0.15 × 0.11 + 0.85 × 0.89 = 0.773
annotator agreement simply by chance
puppy fried chicken puppy 7 4 fried chicken 8 81
annotator A annotator B κ = po − pe 1 − pe κ = 0.88 − pe 1 − pe κ = 0.88 − 0.773 1 − 0.773 = 0.471
0.80-1.00 Very good agreement 0.60-0.80 Good agreement 0.40-0.60 Moderate agreement 0.20-0.40 Fair agreement < 0.20 Poor agreement
puppy fried chicken puppy fried chicken 100
annotator A annotator B
puppy fried chicken puppy 50 fried chicken 50
annotator A annotator B
puppy fried chicken puppy 50 fried chicken 50
annotator A annotator B
classes.
items.
each of whom may evaluate different items (e.g., crowdsourcing)
measuring the observed agreement compared to the agreement we would expect by chance.
agreement among pairs
κ = Po − Pe 1 − Pe
nij
Number of annotators who assign category j to item i
Pi = 1 n(n − 1)
K
nij(nij − 1)
For item i with n annotations, how many annotators agree, among all n(n-1) possible pairs
Pi = 1 n(n − 1)
K
nij(nij − 1)
For item i with n annotations, how many annotators agree, among all n(n-1) possible pairs
A B C D + + +
Label nij + 3
A-B B-A A-C C-A B-C C-B
Pi = 1 4(3)(3(2) + 1(0))
agreeing pairs
Average agreement among all items Expected agreement by chance — joint probability two raters pick the same label is the product of their independent probabilities of picking that label
pj = 1 Nn
N
nij
Probability of category j
Pe =
K
p2
j
Po = 1 N
N
Pi
Using the EM Algorithm," Journal of the Royal Statistical Society, 28(1):20–28, 1979.
classifications," ACL (for sentiment)
is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP 2008
multiple, noisy labelers", KDD.
everyone lies a bit," ICML
positive negative mixed unknown positive 0.95 0.03 0.02 negative 0.80 0.10 0.10 mixed 0.20 0.05 0.50 0.25 unknown 0.15 0.10 0.10 0.70
truth annotator label confusion matrix for a single annotator (David) P(label | truth)
Annotator bias correction Dawid and Skene 1979
labels
L I
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4annotator confusion matrix P(label | truth) truth
Basic idea: the true label is unobserved; what we observe are noisy judgments by annotators
methods and demonstrating that they work
𝓨 = set of all documents 𝒵 = {english, mandarin, greek, …} A mapping h from input data x (drawn from instance space 𝓨) to a label (or labels) y from some enumerable output space 𝒵 x = a single document y = ancient greek
train dev test
instance space
training development testing size 80% 10% 10% purpose training models model selection evaluation; never look at it until the very end
the fitness of a model.
POS NEG NEUT POS
100 2 15
NEG
104 30
NEUT
30 40 70
Predicted (ŷ) True (y)
POS NEG NEUT POS 100 2 15 NEG 104 30 NEUT 30 40 70
Predicted (ŷ) True (y)
1 N
N
I[ˆ yi = yi]
I[x]
if x is true
Precision: proportion
that are actually that class.
POS NEG NEUT POS 100 2 15 NEG 104 30 NEUT 30 40 70
Predicted (ŷ) True (y)
Precision(POS) = ∑N
i=1 I(yi = ̂
yi = POS) ∑N
i=1 I( ̂
yi = POS)
Recall: proportion of true class that are predicted to be that class.
POS NEG NEUT POS 100 2 15 NEG 104 30 NEUT 30 40 70
Predicted (ŷ) True (y)
Recall(POS) = ∑N
i=1 I(yi = ̂
yi = POS) ∑N
i=1 I(yi = POS)
cause? Is it because one system is better than another,
tested it on other data, would we get the same result?
Your work 58% Current state of the art 50%
what we can say about a null from a sample.
Both evaluated on the same test set of 1000 data points.
would expect your model to get 500 of the 1000 data points right.
this with a Binomial distribution (number of successes in n trials)
Binomial probability distribution for number of correct predictions in n=1000 with p = 0.5
0.000 0.005 0.010 0.015 0.020 0.025 400 450 500 550 600
# Dem
0.000 0.005 0.010 0.015 0.020 0.025 400 450 500 550 600
# Dem
At what point is a sample statistic unusual enough to reject the null hypothesis?
510 580
quantify that level of surprise.
allows us to measure P(X ≤ x) for some sample of size n; for large n, we can often make a normal approximation.
falls in the rejection region defined by α
1
“jobs” is predictive of presidential approval rating
2
“job” is predictive of presidential approval rating
3
“war” is predictive of presidential approval rating
4
“car” is predictive of presidential approval rating
5
“the” is predictive of presidential approval rating
6
“star” is predictive of presidential approval rating
7
“book” is predictive of presidential approval rating
8
“still” is predictive of presidential approval rating
9
“glass” is predictive of presidential approval rating
…
…
1,000
“bottle” is predictive of presidential approval rating
we can expect α⨉n type I errors.
chance
family-wise significance level α0 with n hypothesis tests:
probability of at least one type I error.]
α ← α0 n
assumptions (e.g., normality)
domains (e.g., WSJ →literary texts)
sentences up to length n)
Søgaard et al. 2014
English POS
50 62.5 75 87.5 100 WSJ Shakespeare
81.9 97.0
German POS
50 62.5 75 87.5 100 Modern Early Modern
69.6 97.0
English POS
50 62.5 75 87.5 100 WSJ Middle English
56.2 97.3
Italian POS
50 62.5 75 87.5 100 News Dante
75.0 97.0
English POS
50 62.5 75 87.5 100 WSJ Twitter
73.7 97.3
English NER
40 55 70 85 100 CoNLL Twitter
41.0 89.0
Phrase structure parsing
50 60 70 80 90 WSJ GENIA
79.3 89.5
Dependency parsing
50 59.75 69.5 79.25 89 WSJ Patent
79.6 88.2
Dependency parsing
50 59.25 68.5 77.75 87 WSJ Magazines
77.1 86.9
Why does a discussion about ethics need to be a part of NLP?
http://searchengineland.com/according-google-barack-obama-king-united-states-209733
training data, algorithm, evaluation — are often tied up with its use and impact in the world.
I saw the man with the telescope
nsubj dobj det det pobj prep prep
context in which it is uttered.
about human behavior.
(Hovy and Søgaard 2015) and minorities (Blodgett et al. 2016)
Language identification Dependency parsing Blodgett et al. (2016), "Demographic Dialectal Variation in Social Media: A Case Study
dissent)
censorship
functions for forward/backward pass
next Thursday.