DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. - - PowerPoint PPT Presentation
DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. - - PowerPoint PPT Presentation
DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. Passonneau Center for Computational Learning Systems Center for Computational Learning Systems Columbia University Columbia University Acknowledgments n Hoa Hoa Dang Dang n n
June 8, 2006 DUC Workshop 2
Acknowledgments
n n Hoa
Hoa Dang Dang
n n Columbia University (Kathy
Columbia University (Kathy McKeown McKeown) )
n n Guideline contributors, testers (
Guideline contributors, testers (Lucy
Lucy Vanderwende Vanderwende, , Adam Adam Goodkind Goodkind, Guy , Guy LaPalme LaPalme, . . . , . . .)
)
n n Pyramid Creators (
Pyramid Creators (Adam
Adam Goodkind Goodkind, Sergey , Sergey Sigelman Sigelman, , Lucy Lucy Vanderwende Vanderwende, , Inderjeet Inderjeet Mani Mani, Qui Long , Qui Long)
)
n n Participants (21 sites)
Participants (21 sites)
June 8, 2006 DUC Workshop 3
Pyramid Overview
n n Human summarizers select overlapping content
Human summarizers select overlapping content
n n A pyramid represents and quantifies the overlap
A pyramid represents and quantifies the overlap
- f Summary Content Units (
- f Summary Content Units (SCUs
SCUs) found in ) found in multiple model summaries multiple model summaries
n n Two pyramid scores based on SCU annotations
Two pyramid scores based on SCU annotations
u u Original
Original Precision Precision
u u Modified
Modified Recall Recall
n n Manual annotation reliability assessment
Manual annotation reliability assessment
u u Pyramid annotations (LREC 2006)
Pyramid annotations (LREC 2006)
u u Peer annotations (DUC 2005)
Peer annotations (DUC 2005)
June 8, 2006 DUC Workshop 4
Sample SCU from D0631
[ [Label Label:The Concorde crossed the Atlantic in less :The Concorde crossed the Atlantic in less than 4 hours than 4 hours] ] Sum1 Sum1 < < making the transatlantic flight in 3 and ½ hrs
making the transatlantic flight in 3 and ½ hrs >
> Sum2 Sum2 < < The Concorde could make the flight in between
The Concorde could make the flight in between New York and London or Paris in less than New York and London or Paris in less than four hours four hours>
> Sum3 Sum3 < < completing its journey from London to
completing its journey from London to New York in about 3 hours, 30 minutes New York in about 3 hours, 30 minutes >
> Sum4 Sum4 < took less than 4 hrs to cross the Atlantic > < took less than 4 hrs to cross the Atlantic >
June 8, 2006 DUC Workshop 5
Building a Pyramid from Model Summaries (N=4)
W=4 W=2 W=1 W=3
June 8, 2006 DUC Workshop 6
2006 Pyramid effort
n n New version of
New version of DUCView DUCView, annotation guidelines , annotation guidelines
n n Pyramids for 20 of the document sets
Pyramids for 20 of the document sets
u u High clarity ratings
High clarity ratings
u u Even distribution of assessors (summary writers)
Even distribution of assessors (summary writers)
n n Pyramid annotation
Pyramid annotation
u u 6 individuals at 3 sites, 2 with prior experience
6 individuals at 3 sites, 2 with prior experience
n n Peer annotation: 21 peers plus the baseline
Peer annotation: 21 peers plus the baseline
u u New procedure:
New procedure: “ “peer peer” ” review review
n n Only modified pyramid score (normalized to average #
Only modified pyramid score (normalized to average # SCUs SCUs per model for each pyramid) per model for each pyramid)
June 8, 2006 DUC Workshop 7
Brief Comparison with 2005
n n Same characteristics for document clusters
Same characteristics for document clusters
n n 4 instead of 7 model summaries
4 instead of 7 model summaries
u u 2005: mean of mean SCU weight = 1.9
2005: mean of mean SCU weight = 1.9
u u 2006: mean of mean SCU weight = 1.56
2006: mean of mean SCU weight = 1.56
n n Possibly simpler task (cf.
Possibly simpler task (cf. Litowski Litowski, DUC 2006) , DUC 2006)
n n Possibly more coherent pyramids
Possibly more coherent pyramids
n n Improved systems
Improved systems
u u 19/25 (76%) beat the baseline in 2005
19/25 (76%) beat the baseline in 2005
u u 17/21 (81%) beat the baseline in 2006
17/21 (81%) beat the baseline in 2006
June 8, 2006 DUC Workshop 8
ANOVA Results
n n Dependent variable: modified score
Dependent variable: modified score
n n 9 Factors:
9 Factors:
u u Peerid
Peerid (p~0) (p~0)
u u Setid
Setid (p~0) (p~0)
u u 5
5 LingQuality LingQuality ratings ratings
u u Content responsiveness (p=0.0001)
Content responsiveness (p=0.0001)
u u Overall responsiveness (includes readability)
Overall responsiveness (includes readability)
June 8, 2006 DUC Workshop 9
System Differences (Tukey’s HSD)
1, 35, 17, 18, 25, 29, 32, 22, 14, 19, 1, 35, 17, 18, 25, 29, 32, 22, 14, 19, 5, 33, 24, 3, 6, 2, 15 5, 33, 24, 3, 6, 2, 15 (N=17) (N=17) 10, 23 10, 23 1, 35, 17, 18, 25, 29, 32, 22, 14 1, 35, 17, 18, 25, 29, 32, 22, 14 (N=9) (N=9) 8 8 1, 35, 17, 18, 25, 29, 32, 22 1, 35, 17, 18, 25, 29, 32, 22 (N=8) (N=8) 27 27 1, 35, 17, 18, 25, 29 1, 35, 17, 18, 25, 29 (N=6) (N=6) 28 28 1, 35, 17, 18, 25 1, 35, 17, 18, 25 (N=5) (N=5) 2, 3, 6, 14, 15 2, 3, 6, 14, 15 (N=5) (N=5) 1, 35, 17, 18 1, 35, 17, 18 (N=4) (N=4) 19, 24, 33 19, 24, 33 (N=3) (N=3) 1 1 22, 29, 32 22, 29, 32 (N=3) (N=3) NIL NIL 1, 17, 18, 25, 25 1, 17, 18, 25, 25 (N=5) (N=5) > peers > peers Peers Peers
June 8, 2006 DUC Workshop 10
For Illustration: Group Means
.241 ( ~ .03) 10, 23 10, 23 .214 8 8 .210 27 27 .205 28 28 .199 2, 3, 6, 14, 15 2, 3, 6, 14, 15 (N=5) (N=5) .176 19, 24, 33 19, 24, 33 (N=3) (N=3) .169 22, 29, 32 22, 29, 32 (N=3) (N=3) .113 (~ .06) 1, 17, 18, 25, 35 1, 17, 18, 25, 35 (N=5) (N=5) Mean modified score Mean modified score Peers Peers
June 8, 2006 DUC Workshop 11
.286 .286 24 24 .269 .269 40 40 .252 .252 43 43 .229 ( .229 (~.03) 14 14 .357 ( .357 (~.07) 31 31 .197 .197 27 .172 .172 16, 17, 20, 29 .164 .164 28 .158 .158 45, 30 .135 .135 50 .133 .133 1, 3, 8, 15, 47 .065 ( .065 (~.06) 5 Mean pyramid score Mean pyramid score Docsets Docsets
DOCSET Differences
June 8, 2006 DUC Workshop 12
Content Evaluation
n n Perfect correlation with mean pyramid score
Perfect correlation with mean pyramid score per content level per content level
Mean Mean Pyr Pyr Score Score Content Assessment Content Assessment .22 .22 5 5 .21 .21 4 4 .19 .19 3 3 .17 .17 2 2 .12 .12 1 1
June 8, 2006 DUC Workshop 13
Comparison with DUC 2005
n n Many more significant differences among
Many more significant differences among peers using peers using Tukey Tukey
u u 2005: 2 distinct comparison sets
2005: 2 distinct comparison sets
u u 2006: 8 distinct comparison sets
2006: 8 distinct comparison sets
n n Better correlation with responsiveness
Better correlation with responsiveness
u u 2 assessors in 2005, r=.81; .90
2 assessors in 2005, r=.81; .90
u u 1 assessor in 2006, r=1
1 assessor in 2006, r=1
June 8, 2006 DUC Workshop 14
Factors Affecting System Scores
n n Differences in document set difficulty/coherence
Differences in document set difficulty/coherence
n n Pyramid characteristics
Pyramid characteristics
u u Mean SCU weight
Mean SCU weight
u u Pyramid size and proportion of weight 1
Pyramid size and proportion of weight 1 SCUs SCUs
n n Score variability
Score variability
u u 2005:
2005: sd sd = .14 = .14
u u 2006:
2006: sd sd = .09 = .09
n n Better systems
Better systems
u u 2005 mean system score range: .20 to .06
2005 mean system score range: .20 to .06
u u 2006 mean system score range: .24 to .11
2006 mean system score range: .24 to .11
June 8, 2006 DUC Workshop 15
Semantics of Pyramids
n n More highly weighted
More highly weighted SCUs SCUs
u u more general
more general
u u less dependent on meaning of other
less dependent on meaning of other SCUs SCUs
June 8, 2006 DUC Workshop 16
Generality of Highly Weighted SCUs
n n W=4
W=4
u u D0603:
D0603: Wetlands help control floods Wetlands help control floods
u u D0605:
D0605: Exercise helps arthritis Exercise helps arthritis
n n W=1
W=1
u u D0603:
D0603: In underdeveloped countries the In underdeveloped countries the increase of rice increase of rice-
- planting has negative impacts
planting has negative impacts
- n wetlands
- n wetlands
u u D0605:
D0605: Arthroscopic Arthroscopic knee surgery appears to knee surgery appears to reduce pain, for unknown reasons reduce pain, for unknown reasons
June 8, 2006 DUC Workshop 17
Semantic Independence of Highly Weighted SCUs
n n W=4
W=4
u u D0640:
D0640: The The Kursk Kursk sank in the Barents Sea sank in the Barents Sea
u u D0617:
D0617: Egypt Air Flight 990 crashed Egypt Air Flight 990 crashed
n n W=1
W=1
u u D0640:
D0640: The escape hatch [of *] was too badly The escape hatch [of *] was too badly damaged to dock in 7 attempts damaged to dock in 7 attempts
u u D0617:
D0617: Tail elevators [of*] were in an uneven Tail elevators [of*] were in an uneven position, indicating a possible malfunction position, indicating a possible malfunction
June 8, 2006 DUC Workshop 18
Impressions/Questions
n n Does greater difficulty of a
Does greater difficulty of a docset docset correlate correlate with greater specificity/interrelatedness? with greater specificity/interrelatedness?
u u D0647 is associated with lower mean
D0647 is associated with lower mean pyramid scores pyramid scores
u u 9
9 SCUs SCUs of W=4 are all very specific
- f W=4 are all very specific
(about sea rescue of Cuban child, (about sea rescue of Cuban child, Elian Elian Gonzales) Gonzales)
u u 5 of 9
5 of 9 SCUs SCUs of W=4 refer to other
- f W=4 refer to other SCUs
SCUs
June 8, 2006 DUC Workshop 19