DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. - - PowerPoint PPT Presentation

duc 2006 pyramid evaluation
SMART_READER_LITE
LIVE PREVIEW

DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. - - PowerPoint PPT Presentation

DUC 2006 Pyramid Evaluation Rebecca J. Passonneau Rebecca J. Passonneau Center for Computational Learning Systems Center for Computational Learning Systems Columbia University Columbia University Acknowledgments n Hoa Hoa Dang Dang n n


slide-1
SLIDE 1

DUC 2006 Pyramid Evaluation

Rebecca J. Passonneau Rebecca J. Passonneau Center for Computational Learning Systems Center for Computational Learning Systems Columbia University Columbia University

slide-2
SLIDE 2

June 8, 2006 DUC Workshop 2

Acknowledgments

n n Hoa

Hoa Dang Dang

n n Columbia University (Kathy

Columbia University (Kathy McKeown McKeown) )

n n Guideline contributors, testers (

Guideline contributors, testers (Lucy

Lucy Vanderwende Vanderwende, , Adam Adam Goodkind Goodkind, Guy , Guy LaPalme LaPalme, . . . , . . .)

)

n n Pyramid Creators (

Pyramid Creators (Adam

Adam Goodkind Goodkind, Sergey , Sergey Sigelman Sigelman, , Lucy Lucy Vanderwende Vanderwende, , Inderjeet Inderjeet Mani Mani, Qui Long , Qui Long)

)

n n Participants (21 sites)

Participants (21 sites)

slide-3
SLIDE 3

June 8, 2006 DUC Workshop 3

Pyramid Overview

n n Human summarizers select overlapping content

Human summarizers select overlapping content

n n A pyramid represents and quantifies the overlap

A pyramid represents and quantifies the overlap

  • f Summary Content Units (
  • f Summary Content Units (SCUs

SCUs) found in ) found in multiple model summaries multiple model summaries

n n Two pyramid scores based on SCU annotations

Two pyramid scores based on SCU annotations

u u Original

Original Precision Precision

u u Modified

Modified Recall Recall

n n Manual annotation reliability assessment

Manual annotation reliability assessment

u u Pyramid annotations (LREC 2006)

Pyramid annotations (LREC 2006)

u u Peer annotations (DUC 2005)

Peer annotations (DUC 2005)

slide-4
SLIDE 4

June 8, 2006 DUC Workshop 4

Sample SCU from D0631

[ [Label Label:The Concorde crossed the Atlantic in less :The Concorde crossed the Atlantic in less than 4 hours than 4 hours] ] Sum1 Sum1 < < making the transatlantic flight in 3 and ½ hrs

making the transatlantic flight in 3 and ½ hrs >

> Sum2 Sum2 < < The Concorde could make the flight in between

The Concorde could make the flight in between New York and London or Paris in less than New York and London or Paris in less than four hours four hours>

> Sum3 Sum3 < < completing its journey from London to

completing its journey from London to New York in about 3 hours, 30 minutes New York in about 3 hours, 30 minutes >

> Sum4 Sum4 < took less than 4 hrs to cross the Atlantic > < took less than 4 hrs to cross the Atlantic >

slide-5
SLIDE 5

June 8, 2006 DUC Workshop 5

Building a Pyramid from Model Summaries (N=4)

W=4 W=2 W=1 W=3

slide-6
SLIDE 6

June 8, 2006 DUC Workshop 6

2006 Pyramid effort

n n New version of

New version of DUCView DUCView, annotation guidelines , annotation guidelines

n n Pyramids for 20 of the document sets

Pyramids for 20 of the document sets

u u High clarity ratings

High clarity ratings

u u Even distribution of assessors (summary writers)

Even distribution of assessors (summary writers)

n n Pyramid annotation

Pyramid annotation

u u 6 individuals at 3 sites, 2 with prior experience

6 individuals at 3 sites, 2 with prior experience

n n Peer annotation: 21 peers plus the baseline

Peer annotation: 21 peers plus the baseline

u u New procedure:

New procedure: “ “peer peer” ” review review

n n Only modified pyramid score (normalized to average #

Only modified pyramid score (normalized to average # SCUs SCUs per model for each pyramid) per model for each pyramid)

slide-7
SLIDE 7

June 8, 2006 DUC Workshop 7

Brief Comparison with 2005

n n Same characteristics for document clusters

Same characteristics for document clusters

n n 4 instead of 7 model summaries

4 instead of 7 model summaries

u u 2005: mean of mean SCU weight = 1.9

2005: mean of mean SCU weight = 1.9

u u 2006: mean of mean SCU weight = 1.56

2006: mean of mean SCU weight = 1.56

n n Possibly simpler task (cf.

Possibly simpler task (cf. Litowski Litowski, DUC 2006) , DUC 2006)

n n Possibly more coherent pyramids

Possibly more coherent pyramids

n n Improved systems

Improved systems

u u 19/25 (76%) beat the baseline in 2005

19/25 (76%) beat the baseline in 2005

u u 17/21 (81%) beat the baseline in 2006

17/21 (81%) beat the baseline in 2006

slide-8
SLIDE 8

June 8, 2006 DUC Workshop 8

ANOVA Results

n n Dependent variable: modified score

Dependent variable: modified score

n n 9 Factors:

9 Factors:

u u Peerid

Peerid (p~0) (p~0)

u u Setid

Setid (p~0) (p~0)

u u 5

5 LingQuality LingQuality ratings ratings

u u Content responsiveness (p=0.0001)

Content responsiveness (p=0.0001)

u u Overall responsiveness (includes readability)

Overall responsiveness (includes readability)

slide-9
SLIDE 9

June 8, 2006 DUC Workshop 9

System Differences (Tukey’s HSD)

1, 35, 17, 18, 25, 29, 32, 22, 14, 19, 1, 35, 17, 18, 25, 29, 32, 22, 14, 19, 5, 33, 24, 3, 6, 2, 15 5, 33, 24, 3, 6, 2, 15 (N=17) (N=17) 10, 23 10, 23 1, 35, 17, 18, 25, 29, 32, 22, 14 1, 35, 17, 18, 25, 29, 32, 22, 14 (N=9) (N=9) 8 8 1, 35, 17, 18, 25, 29, 32, 22 1, 35, 17, 18, 25, 29, 32, 22 (N=8) (N=8) 27 27 1, 35, 17, 18, 25, 29 1, 35, 17, 18, 25, 29 (N=6) (N=6) 28 28 1, 35, 17, 18, 25 1, 35, 17, 18, 25 (N=5) (N=5) 2, 3, 6, 14, 15 2, 3, 6, 14, 15 (N=5) (N=5) 1, 35, 17, 18 1, 35, 17, 18 (N=4) (N=4) 19, 24, 33 19, 24, 33 (N=3) (N=3) 1 1 22, 29, 32 22, 29, 32 (N=3) (N=3) NIL NIL 1, 17, 18, 25, 25 1, 17, 18, 25, 25 (N=5) (N=5) > peers > peers Peers Peers

slide-10
SLIDE 10

June 8, 2006 DUC Workshop 10

For Illustration: Group Means

.241 ( ~ .03) 10, 23 10, 23 .214 8 8 .210 27 27 .205 28 28 .199 2, 3, 6, 14, 15 2, 3, 6, 14, 15 (N=5) (N=5) .176 19, 24, 33 19, 24, 33 (N=3) (N=3) .169 22, 29, 32 22, 29, 32 (N=3) (N=3) .113 (~ .06) 1, 17, 18, 25, 35 1, 17, 18, 25, 35 (N=5) (N=5) Mean modified score Mean modified score Peers Peers

slide-11
SLIDE 11

June 8, 2006 DUC Workshop 11

.286 .286 24 24 .269 .269 40 40 .252 .252 43 43 .229 ( .229 (~.03) 14 14 .357 ( .357 (~.07) 31 31 .197 .197 27 .172 .172 16, 17, 20, 29 .164 .164 28 .158 .158 45, 30 .135 .135 50 .133 .133 1, 3, 8, 15, 47 .065 ( .065 (~.06) 5 Mean pyramid score Mean pyramid score Docsets Docsets

DOCSET Differences

slide-12
SLIDE 12

June 8, 2006 DUC Workshop 12

Content Evaluation

n n Perfect correlation with mean pyramid score

Perfect correlation with mean pyramid score per content level per content level

Mean Mean Pyr Pyr Score Score Content Assessment Content Assessment .22 .22 5 5 .21 .21 4 4 .19 .19 3 3 .17 .17 2 2 .12 .12 1 1

slide-13
SLIDE 13

June 8, 2006 DUC Workshop 13

Comparison with DUC 2005

n n Many more significant differences among

Many more significant differences among peers using peers using Tukey Tukey

u u 2005: 2 distinct comparison sets

2005: 2 distinct comparison sets

u u 2006: 8 distinct comparison sets

2006: 8 distinct comparison sets

n n Better correlation with responsiveness

Better correlation with responsiveness

u u 2 assessors in 2005, r=.81; .90

2 assessors in 2005, r=.81; .90

u u 1 assessor in 2006, r=1

1 assessor in 2006, r=1

slide-14
SLIDE 14

June 8, 2006 DUC Workshop 14

Factors Affecting System Scores

n n Differences in document set difficulty/coherence

Differences in document set difficulty/coherence

n n Pyramid characteristics

Pyramid characteristics

u u Mean SCU weight

Mean SCU weight

u u Pyramid size and proportion of weight 1

Pyramid size and proportion of weight 1 SCUs SCUs

n n Score variability

Score variability

u u 2005:

2005: sd sd = .14 = .14

u u 2006:

2006: sd sd = .09 = .09

n n Better systems

Better systems

u u 2005 mean system score range: .20 to .06

2005 mean system score range: .20 to .06

u u 2006 mean system score range: .24 to .11

2006 mean system score range: .24 to .11

slide-15
SLIDE 15

June 8, 2006 DUC Workshop 15

Semantics of Pyramids

n n More highly weighted

More highly weighted SCUs SCUs

u u more general

more general

u u less dependent on meaning of other

less dependent on meaning of other SCUs SCUs

slide-16
SLIDE 16

June 8, 2006 DUC Workshop 16

Generality of Highly Weighted SCUs

n n W=4

W=4

u u D0603:

D0603: Wetlands help control floods Wetlands help control floods

u u D0605:

D0605: Exercise helps arthritis Exercise helps arthritis

n n W=1

W=1

u u D0603:

D0603: In underdeveloped countries the In underdeveloped countries the increase of rice increase of rice-

  • planting has negative impacts

planting has negative impacts

  • n wetlands
  • n wetlands

u u D0605:

D0605: Arthroscopic Arthroscopic knee surgery appears to knee surgery appears to reduce pain, for unknown reasons reduce pain, for unknown reasons

slide-17
SLIDE 17

June 8, 2006 DUC Workshop 17

Semantic Independence of Highly Weighted SCUs

n n W=4

W=4

u u D0640:

D0640: The The Kursk Kursk sank in the Barents Sea sank in the Barents Sea

u u D0617:

D0617: Egypt Air Flight 990 crashed Egypt Air Flight 990 crashed

n n W=1

W=1

u u D0640:

D0640: The escape hatch [of *] was too badly The escape hatch [of *] was too badly damaged to dock in 7 attempts damaged to dock in 7 attempts

u u D0617:

D0617: Tail elevators [of*] were in an uneven Tail elevators [of*] were in an uneven position, indicating a possible malfunction position, indicating a possible malfunction

slide-18
SLIDE 18

June 8, 2006 DUC Workshop 18

Impressions/Questions

n n Does greater difficulty of a

Does greater difficulty of a docset docset correlate correlate with greater specificity/interrelatedness? with greater specificity/interrelatedness?

u u D0647 is associated with lower mean

D0647 is associated with lower mean pyramid scores pyramid scores

u u 9

9 SCUs SCUs of W=4 are all very specific

  • f W=4 are all very specific

(about sea rescue of Cuban child, (about sea rescue of Cuban child, Elian Elian Gonzales) Gonzales)

u u 5 of 9

5 of 9 SCUs SCUs of W=4 refer to other

  • f W=4 refer to other SCUs

SCUs

slide-19
SLIDE 19

June 8, 2006 DUC Workshop 19

Conclusion

n n Systems have improved: DUC roadmap has

Systems have improved: DUC roadmap has been successful been successful

n n Evaluation document sets have good

Evaluation document sets have good coverage; but can we begin to characterize coverage; but can we begin to characterize document set difficulty? document set difficulty?

n n Would pyramid scores (intrinsic) correlate

Would pyramid scores (intrinsic) correlate with any extrinsic measures? with any extrinsic measures?