Evaluation measures in NLP
Zdeněk Žabokrtský
October 30, 2020
NPFL070 Language Data Resources
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 - - PowerPoint PPT Presentation
Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Outline Evaluation
Zdeněk Žabokrtský
October 30, 2020
NPFL070 Language Data Resources
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
1/ 39
system.
NLP problem (don’t underestimate this!).
reasons why we need language data resources).
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
2/ 39
percentage of correctly predicted answers
quality of a system, based on a number of criteria
impossible (e.g., when inter-annotator agreement is insuffjcient)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
3/ 39
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
4/ 39
𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢𝑚𝑧 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑓𝑒 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑏𝑚𝑚 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 × 100%
data
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
5/ 39
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
6/ 39
evaluation phases
evaluations (the more evaluations, the worse)
ultimate evaluation
hyperparameter values)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
7/ 39
have huge impact on the measured quantities
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
8/ 39
number …
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
9/ 39
this interval is a reasonable result
given by
performance of which is supposed to be easily surpassed)
annotators
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
10/ 39
methods
𝑏𝑠𝑛𝑏𝑦(𝑄(𝑢𝑏|𝑥𝑝𝑠𝑒))
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
11/ 39
viewpoint of our evaluated component)
imply that even oracle’s performance is less than 100 %
tags for the given word
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
12/ 39
delivered by human experts (annotators)
are done …No, this is too naive!
stable in decision making.
detailed specifjcation of) annotation instructions.
instructions.
issues etc., like with any other work.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
13/ 39
(to measure the reliability of manual annotations)
real decision making).
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
14/ 39
Example:
desired measure: 1 if they agree in all decisions, 0 if their agreement is equal to agreement by chance
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
15/ 39
𝜆 = 𝑄𝑏 − 𝑄𝑓 1 − 𝑄𝑓
agreement)
baseline
0.80-0.90 strong agreements …
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
16/ 39
if it has made 3 errors in 7 task instances (you indicate more more certainty about the result than justifjed).
result uncertainty.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
17/ 39
Actually very simple:
least number of signifjcant digits in any one of the numbers being multiplied/divided.
answer should be the same as the least number of decimal places in any of the numbers being added or subtracted.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
18/ 39
𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢𝑚𝑧 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑓𝑒 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑏𝑚𝑚 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 × 100% Easy if
But what if not?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
19/ 39
𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 𝑗𝑤𝑓𝑜 𝑏𝑚𝑚 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 𝑗𝑤𝑓𝑜 × 100%
𝑠𝑓𝑑𝑏𝑚𝑚 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 𝑗𝑤𝑓𝑜 𝑏𝑚𝑚 𝑞𝑝𝑡𝑡𝑗𝑐𝑚𝑓 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 × 100%
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
20/ 39
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
21/ 39
the other one can be computed 1-X)
𝛾2 ⋅ 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑠𝑓𝑑𝑏𝑚𝑚
𝐺1 = 2 ⋅ 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 ⋅ 𝑠𝑓𝑑𝑏𝑚𝑚 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑠𝑓𝑑𝑏𝑚𝑚
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
22/ 39
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
23/ 39
true words 𝑋𝐹𝑆 = 𝑇 + 𝐸 + 𝐽 𝑇 + 𝐸 + 𝐷 (1)
we do with that?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
24/ 39
𝐶𝑀𝐹𝑉 = 𝐶𝑄 ⋅ exp (1 4
4
∑
𝑜=1
log 𝑞𝑜) 𝐶𝑄 = min (1, exp (1 − 𝑠 𝑑 )) 𝐶𝑄 = brevity penalty (multiplicative!) 𝑞𝑜 = n-gram precision (ref-clipped counts) 𝑠 = total length (#words) of the reference 𝑑 = total length of the candidate translation
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
25/ 39
between two or more criteria pushing in difgerent directions
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
26/ 39
and shiny method developed for a given NLP task, using a given dataset and a given evaluation measure:
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
27/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
28/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
evaluation script.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
29/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
30/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
and/or its hyper-parameters are not set properly, and/or the performance is spoiled by underfjtting or overfjtting. Revise the code to avoid bugs and apply standard ML diagnostics (e.g. examine learning curves). Before beating the baselines, it’s unfortunately not worth publishing.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
31/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
32/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
quality measures in most NLP tasks nowadays. Didn’t you e.g. mix training and evaluation data by mistake?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
33/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
34/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
for publishing it! But fjrst double-check that the whole experimental setup is correct and fair (again, didn’t some pieces of information from the evaluation portion penetrate the training data? Didn’t you use the evaluation data too many times?)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
35/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
36/ 39
0 % 100 % baseline 1 (a naive one) baseline 2 (less naive) baseline 3 (a sophisticated one) current state of the art
inter-annotator agreement What if I am here?
the baselines, which is nice, but you did not beat the current champion, which is …usual. Depending on details, your approach might still be worth using by others, e.g. because
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
37/ 39
represenative hand-annotated data is virtually impossible (e.g. a path through a man-machine dialogue, if more dialogue turns are considered)
dependency parsing seems more important than attaching a punctuation mark); if weighting is needed, then we always risk arbitrariness.
already above that of average annotators (i.e., even hand-annotated gold data might not be gold enough)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
38/ 39
various assumptions (even using plain percentage in the simplest cases implies that we assume that all errors are born equally severe, which is a unrealistic)
considerably across datasets/genres/languages…
a fjnal value of the measured quantity – you should always make it anchored in a bigger picture!
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks
39/ 39