An Annotated Corpus of Picture Stories Retold by Language Learners - - PowerPoint PPT Presentation

an annotated corpus of picture stories retold by language
SMART_READER_LITE
LIVE PREVIEW

An Annotated Corpus of Picture Stories Retold by Language Learners - - PowerPoint PPT Presentation

Faculty of Mathematics, Informatics and Natural Sciences Christine Khn and Arne Khn {ckoehn,koehn}@informatik.uni-hamburg.de An Annotated Corpus of Picture Stories Retold by Language Learners Learner Corpora Today Many available but


slide-1
SLIDE 1

Faculty

  • f Mathematics, Informatics

and Natural Sciences

Christine Köhn and Arne Köhn {ckoehn,koehn}@informatik.uni-hamburg.de

An Annotated Corpus

  • f Picture Stories

Retold by Language Learners

slide-2
SLIDE 2

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 2

Many available but small coverage mainly essays

  • nly marginally constrained

→ low agreement between error annotations (e. g. Fitzpatrick and Seegmiller (2004))

languages other than English are underrepresented

Learner Corpora Today

slide-3
SLIDE 3

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 3

Assumption: Reliable interpretation supports reliable annotation Foster reliable interpretation by collecting learner corpus with explicit task context (Ott et al., 2012). knowing the context of an utterance facilitates interpreting it

Annotation Reliability

slide-4
SLIDE 4

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 4

Reading comprehension exercise

reasonable inter-annotator agreement for meaning assessment Ott et al. (2012)) strongly infmuences learner’s choice of words / structures

Picture description

no textual infmuence pictures with single activity constrain the answers to a sensible degree for extracting verb(subj,obj) triples (King and Dickinson, 2013) sentences are conceptually simple

Tasks with Explicit Task Contexts

slide-5
SLIDE 5

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 5

Task design criteria

capture real language use, no textual infmuence free-form answers strong visual context elicit variety of sentence structures → Picture story retelling (with appropriate choice of story)

Moral mit Wespen “moral with wasps” by Erich Ohser

Exploring the Middle Ground

slide-6
SLIDE 6

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 5

Task design criteria

capture real language use, no textual infmuence free-form answers strong visual context elicit variety of sentence structures → Picture story retelling (with appropriate choice of story)

Moral mit Wespen “moral with wasps” by Erich Ohser

Exploring the Middle Ground

slide-7
SLIDE 7

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 5

Task design criteria

capture real language use, no textual infmuence free-form answers strong visual context elicit variety of sentence structures → Picture story retelling (with appropriate choice of story)

Moral mit Wespen “moral with wasps” by Erich Ohser

Exploring the Middle Ground

slide-8
SLIDE 8

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 6

prohibited from working under the Nazi regime father and son comics published under a pseudonym (e.o.plauen) arrested together with Erich Knauf for making political jokes in 1944 committed suicide the day before his trial Knauf was executed (https://en.wikipedia.org/wiki/E._O._Plauen)

Erich Ohser (1903-1944)

slide-9
SLIDE 9

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 7

Comic Strips Retold by Learners of German

The ComiGS Corpus

slide-10
SLIDE 10

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 8

Task and User Interface

slide-11
SLIDE 11

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 9

~90 min for 2–3 stories CEFR levels

from A2 (upper beginner) to B2/C1 (upper intermediate/lower advanced)

70 texts from 30 learners of German

30 texts for stories 1 and 2, respectively 10 texts for story 3 18k tokens, nearly 1.5k sentences tokens/sentence: 12.2 (mean), 11 (median)

The ComiGS corpus

slide-12
SLIDE 12

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 10 Moral mit Wespen “moral with wasps” by Erich Ohser

Der Sohn schlisst sein Mund mit der Hand , The son closes his mouth with the hand , er sieht ängstlich und überraschend gleichzeitig . he sees anxious and surprising simultaneously .

Example

slide-13
SLIDE 13

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 10 Moral mit Wespen “moral with wasps” by Erich Ohser

Der Sohn schlisst sein Mund mit der Hand , The son closes his mouth with the hand , er sieht ängstlich und überraschend gleichzeitig . he sees anxious and surprising simultaneously .

Example

slide-14
SLIDE 14

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 11

target hypothesis (TH): reconstruction of the original utterance THs annotated with

PoS tags using STTS tag set (Schiller et al., 1999) syntactic annotation: labeled dependencies using scheme by Foth (2006) lemmas (base form of words)

Multi-purpose Annotations

slide-15
SLIDE 15

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 12

There are many ways to correct a sentence. Reznicek et al. (2012):

Minimal Target Hypothesis (TH1)

minimal changes normalization for automatic processing adheres to morphological, syntactic and orthographic rules rules, e. g. if verb and arguments don’t match: change arguments structurally similar to learner utterance

Extended Target Hypothesis (TH2)

minimal changes as similar as possible to a native speaker’s utterance also: semantics, pragmatics, information structure

  • nly rough guidelines

b l a b l a b l a b l a b l a b l a b l a similar to learner’s intention b l a b l a b l a b l a b l a b l a b l a

Target Hypotheses

slide-16
SLIDE 16

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 13

Der Sohn schlisst sein Mund mit der Hand , The son closes his mouth with the hand , er sieht ängstlich und überraschend gleichzeitig . he sees anxious and surprising simultaneously .

Example (cont’d)

slide-17
SLIDE 17

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 14

  • rig Der Sohn schlisst

sein Mund mit der Hand , The son closes his mouth with the hand , TH1 Der Sohn schließt seinen Mund mit der Hand ,

Example (cont’d)

slide-18
SLIDE 18

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 14

  • rig Der Sohn schlisst

sein Mund mit der Hand , The son closes his mouth with the hand , TH1 Der Sohn schließt seinen Mund mit der Hand , TH2 Der Sohn hält seinen Mund mit der Hand zu , “The son covers his mouth with his hand”

Example (cont’d)

slide-19
SLIDE 19

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 15

  • rig er sieht

ängstlich und überraschend gleichzeitig . he sees anxious and surprising simultaneously . TH1 er sieht ängstlich und überrascht gleichzeitig aus . he looks-1 anxious and surprised simultaneously looks-2 .

Example (cont’d)

slide-20
SLIDE 20

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 15

  • rig er sieht

ängstlich und überraschend gleichzeitig . he sees anxious and surprising simultaneously . TH1 er sieht ängstlich und überrascht gleichzeitig aus . he looks-1 anxious and surprised simultaneously looks-2 . TH2 er sieht gleichzeitig ängstlich und überrascht aus . “he looks anxious and surprised at the same time.”

Example (cont’d)

slide-21
SLIDE 21

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 16

mostly adhere to Falko annotation manual (Reznicek et al., 2012) minor changes mainly due to difgerences between tasks and language levels

  • e. g. colloquial language is not discouraged in general in TH2

most changes are extensions or clarifjcations → annotations are mainly compatible with Falko

Adaptations for annotating THs

slide-22
SLIDE 22

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 17

if a token is moved & changed, this information cannot be fully recovered later → introduce unique identifjer to indicate token movement (tmid) if tokens aren’t next to each other: use tmid as well

  • rig

Die Kind ist […] liegend TH2 Das Kind liegt […] tmid 1 […] 1 use tmid if tokens aren’t changed in isolation

  • rig

Der Mann geht weiter […] TH2 Der Mann fährt fort […] The man walks/goes

  • n

[…] tmid 1 1 […]

Extension for Movements, Splits, Merges

slide-23
SLIDE 23

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 17

if a token is moved & changed, this information cannot be fully recovered later → introduce unique identifjer to indicate token movement (tmid) tokens can be merged or split

  • rig

Die Kind ist liegend […] TH2 Das Kind liegt […] The child is lying/lies […] if tokens aren’t next to each other: use tmid as well

  • rig

Die Kind ist […] liegend TH2 Das Kind liegt […] tmid 1 […] 1 use tmid if tokens aren’t changed in isolation

  • rig

Der Mann geht weiter […] TH2 Der Mann fährt fort […] The man walks/goes

  • n

[…] tmid 1 1 […]

Extension for Movements, Splits, Merges

slide-24
SLIDE 24

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 17

if a token is moved & changed, this information cannot be fully recovered later → introduce unique identifjer to indicate token movement (tmid) tokens can be merged or split

  • rig

Die Kind ist liegend […] TH2 Das Kind liegt […] The child is lying/lies […] if tokens aren’t next to each other: use tmid as well

  • rig

Die Kind ist […] liegend TH2 Das Kind liegt […] tmid 1 […] 1 use tmid if tokens aren’t changed in isolation

  • rig

Der Mann geht weiter […] TH2 Der Mann fährt fort […] The man walks/goes

  • n

[…] tmid 1 1 […]

Extension for Movements, Splits, Merges

slide-25
SLIDE 25

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 17

if a token is moved & changed, this information cannot be fully recovered later → introduce unique identifjer to indicate token movement (tmid) if (merged or split) tokens aren’t next to each other: use tmid as well

  • rig

Die Kind ist […] liegend TH2 Das Kind liegt […] tmid 1 […] 1 use tmid if tokens aren’t changed in isolation

  • rig

Der Mann geht weiter […] TH2 Der Mann fährt fort […] The man walks/goes

  • n

[…] tmid 1 1 […]

Extension for Movements, Splits, Merges

slide-26
SLIDE 26

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 17

if a token is moved & changed, this information cannot be fully recovered later → introduce unique identifjer to indicate token movement (tmid) if (merged or split) tokens aren’t next to each other: use tmid as well

  • rig

Die Kind ist […] liegend TH2 Das Kind liegt […] tmid 1 […] 1 use tmid if tokens aren’t changed in isolation

  • rig

Der Mann geht weiter […] TH2 Der Mann fährt fort […] The man walks/goes

  • n

[…] tmid 1 1 […]

Extension for Movements, Splits, Merges

slide-27
SLIDE 27

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 18

Split corpus into 2 sets, 2 annotators Set 1

remaining 51 texts from 22 learners jointly annotated

Set 2

19 texts from 8 learners with difgerent mother languages and profjciency levels, all stories covered annotated independently

Annotation of THs

slide-28
SLIDE 28

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 18

Split corpus into 2 sets, 2 annotators Set 1

remaining 51 texts from 22 learners jointly annotated

Set 2

19 texts from 8 learners with difgerent mother languages and profjciency levels, all stories covered annotated independently

Annotation of THs

slide-29
SLIDE 29

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 18

Split corpus into 2 sets, 2 annotators Set 1

remaining 51 texts from 22 learners jointly annotated

Set 2

19 texts from 8 learners with difgerent mother languages and profjciency levels, all stories covered annotated independently

Annotation of THs

slide-30
SLIDE 30

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 18

Split corpus into 2 sets, 2 annotators Set 1

remaining 51 texts from 22 learners jointly annotated

Set 2

19 texts from 8 learners with difgerent mother languages and profjciency levels, all stories covered annotated independently

Annotation of THs

slide-31
SLIDE 31

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 19

Set 2: 5.4k tokens manual alignment (underestimation likely) changes per annotator: 1.2k (TH1), 1.9k (TH2) For comparison: κ=0.3877 for the agreement

  • f which token to change on NUCLE corpus

(Dahlmeier et al.,2013)

Inter-annotator Agreement for THs

slide-32
SLIDE 32

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 19

Set 2: 5.4k tokens manual alignment (underestimation likely) changes per annotator: 1.2k (TH1), 1.9k (TH2) For comparison: κ=0.3877 for the agreement

  • f which token to change on NUCLE corpus

(Dahlmeier et al.,2013)

IAA (Cohen’s κ)

identical position TH1 0.7736 0.8640 TH2 0.5172 0.7388

Inter-annotator Agreement for THs

slide-33
SLIDE 33

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 19

Set 2: 5.4k tokens manual alignment (underestimation likely) changes per annotator: 1.2k (TH1), 1.9k (TH2) For comparison: κ=0.3877 for the agreement

  • f which token to change on NUCLE corpus

(Dahlmeier et al.,2013)

IAA (Cohen’s κ)

identical position TH1 0.7736 0.8640 TH2 0.5172 0.7388

Inter-annotator Agreement for THs

slide-34
SLIDE 34

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 20

task which mildly constrains learner texts encourages complex utterances utterances can be recovered quite well due to visual context reliable annotation of THs visual context could be used as additional feature for automatic processing (cmp. Köhn and Menzel (2015))

Summary and Conclusions

slide-35
SLIDE 35

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 21

Thank you!

Soon on https://nats.gitlab.io/comigs (License CC-BY)

slide-36
SLIDE 36

August 26th, 2018 An Annotated Corpus of Picture Stories Retold by Language Learners, C. Köhn, A. Köhn 22