Evaluation measures in NLP
Zdeněk Žabokrtský
8th April 2020
NPFL124 Natural Language Processing
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Evaluation measures in NLP Zdenk abokrtsk 8th April 2020 NPFL124 - - PowerPoint PPT Presentation
Evaluation measures in NLP Zdenk abokrtsk 8th April 2020 NPFL124 Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Title of your
Zdeněk Žabokrtský
8th April 2020
NPFL124 Natural Language Processing
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
1/ 27
system.
NLP problem (don’t underestimate this!).
reasons why we need language data resources).
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
2/ 27
percentage of correctly predicted answers
quality of a system, based on a number of criteria
impossible (e.g., when inter-annotator agreement is insuffjcient)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
3/ 27
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
4/ 27
𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢𝑚𝑧 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑓𝑒 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑏𝑚𝑚 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 × 100%
data
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
5/ 27
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
6/ 27
evaluation phases
evaluations (the more evaluations, the worse)
ultimate evaluation
hyperparameter values)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
7/ 27
have huge impact on the measured quantities
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
8/ 27
number …
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
9/ 27
this interval is a reasonable result
given by
performance of which is supposed to be easily surpassed)
annotators
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
10/ 27
methods
𝑏𝑠𝑛𝑏𝑦(𝑄(𝑢𝑏|𝑥𝑝𝑠𝑒))
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
11/ 27
viewpoint of our evaluated component)
imply that even oracle’s performance is less than 100 %
tags for the given word
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
12/ 27
delivered by human experts (annotators)
are done …No, this is too naive!
stable in decision making.
detailed specifjcation of) annotation instructions.
instructions.
issues etc., like with any other work.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
13/ 27
(to measure the reliability of manual annotations)
real decision making).
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
14/ 27
Example:
desired measure: 1 if they agree in all decisions, 0 if their agreement is equal to agreement by chance
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
15/ 27
𝜆 = 𝑄𝑏 − 𝑄𝑓 1 − 𝑄𝑓
agreement)
baseline
0.80-0.90 strong agreements …
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
16/ 27
if it has made 3 errors in 7 task instances (you indicate more more certainty about the result than justifjed).
result uncertainty.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
17/ 27
Actually very simple:
least number of signifjcant digits in any one of the numbers being multiplied/divided.
answer should be the same as the least number of decimal places in any of the numbers being added or subtracted.
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
18/ 27
𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢𝑚𝑧 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑓𝑒 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 𝑏𝑚𝑚 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 × 100% Easy if
But what if not?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
19/ 27
𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 𝑗𝑤𝑓𝑜 𝑏𝑚𝑚 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 𝑗𝑤𝑓𝑜 × 100%
𝑠𝑓𝑑𝑏𝑚𝑚 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 𝑗𝑤𝑓𝑜 𝑏𝑚𝑚 𝑞𝑝𝑡𝑡𝑗𝑐𝑚𝑓 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 𝑏𝑜𝑡𝑥𝑓𝑠𝑡 × 100%
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
20/ 27
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
21/ 27
the other one can be computed 1-X)
𝛾2 ⋅ 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑠𝑓𝑑𝑏𝑚𝑚
𝐺1 = 2 ⋅ 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 ⋅ 𝑠𝑓𝑑𝑏𝑚𝑚 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑠𝑓𝑑𝑏𝑚𝑚
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
22/ 27
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
23/ 27
true words 𝑋𝐹𝑆 = 𝑇 + 𝐸 + 𝐽 𝑇 + 𝐸 + 𝐷 (1)
we do with that?
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
24/ 27
𝐶𝑀𝐹𝑉 = 𝐶𝑄 ⋅ exp (1 4
4
∑
𝑜=1
log 𝑞𝑜) 𝐶𝑄 = min (1, exp (1 − 𝑠 𝑑 )) 𝐶𝑄 = brevity penalty (multiplicative!) 𝑞𝑜 = n-gram precision (ref-clipped counts) 𝑠 = total length (#words) of the reference 𝑑 = total length of the candidate translation
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
25/ 27
between two or more criteria pushing in difgerent directions
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
26/ 27
represenative hand-annotated data is virtually impossible (e.g. a path through a man-machine dialogue, if more dialogue turns are considered)
dependency parsing seems more important than attaching a punctuation mark); if weighting is needed, then we always risk arbitrariness.
already above that of average annotators (i.e., even hand-annotated gold data might not be gold enough)
Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Final remarks
27/ 27