A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction - - PowerPoint PPT Presentation

a sentence is a sentence is a sentence
SMART_READER_LITE
LIVE PREVIEW

A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction - - PowerPoint PPT Presentation

Segmenting Oral and Historical Language Data A Sentence is a Sentence is a Sentence? Zarah Weiss Introduction Parallels and Differences between the Defining Sentences Written Language Bias Segmentation of Oral and Historical Defining


slide-1
SLIDE 1

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

A Sentence is a Sentence is a Sentence? Parallels and Differences between the Segmentation of Oral and Historical Language Data

Zarah Weiss

based on work done in collaboration with Gohar Schnelle, Laura Perlitz, Carolin Odebrecht, Hagen Hirschmann, Anke Lüdeling, and Detmar Meurers

“Unit segmentation in Spoken Interaction” Segcor Workshop Orléans, June 3-5 2019

1 / 67

slide-2
SLIDE 2

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

What is a Sentence?

◮ Sentence is a fundamental unit of linguistic analysis ◮ More than 200 sentence definitions (Eugen 1935) ◮ Naive understanding influenced by written language

(Schmidt 2016; Bredel 2008, 2011)

◮ Assumes well-formedness (and graphematic markers) ◮ Confounds syntactic and graphematic notion ◮ Written language bias in linguistics

(Linell 1982; Hennig 2008; Harris 1980)

2 / 67

slide-3
SLIDE 3

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

A Sentence is a Sentence is a Sentence?

(1) It was dark outside. Everything was silent. (2) It was dark outside, everything was silent. (3) Olive oil, two shelves full.

3 / 67

slide-4
SLIDE 4

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

What is a Syntactic Sentence?

According to Duden Grammar (Gallmann 2009, p. 763f)

  • 1. Sentences are the maximal projection of a finite verb

including all their arguments and adjuncts.

  • 2. Sentences are syntactically well-formed units.
  • 3. Sentences are the largest unit that can be derived by

syntactically well-formed rules.

4 / 67

slide-5
SLIDE 5

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

What is a Syntactic Sentence?

According to Duden Grammar (Gallmann 2009, p. 763f)

  • 1. Sentences are the maximal projection of a finite verb

including all their arguments and adjuncts.

  • 2. Sentences are syntactically well-formed units.
  • 3. Sentences are the largest unit that can be derived by

syntactically well-formed rules.

→ Well-formedness in grammar refers to written language

(Hennig 2009, 2008)

4 / 67

slide-6
SLIDE 6

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

What is a Graphematic Sentence?

A sentence is a suprasegmental writing unit started by a sentence-initial majuscule and ended by a sentence-final punctuation mark. It does not contain these itself. The sentence-initial majuscule and the sentence-closing sign are terms that can be specified intragraphematically and are mutually

  • dependent. Their alternation constitutes a sentence.

(Schmidt 2016, p. 247)

◮ Follows punctuation theory Bredel (2008, 2011)

5 / 67

slide-7
SLIDE 7

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Written Language Bias

◮ Both definitions assume written language norm

→ Example for written language bias in linguistics

(Schmidt 2016; Hennig 2008; Linell 1982)

◮ Does not account for non-standard varieties of language:

spoken language, learner language, etc.

◮ Relates to comparative fallacy in SLA

6 / 67

slide-8
SLIDE 8

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Comparative Fallacy

◮ Coined by Bley-Vroman (1983) for SLA ◮ Criticizes analysis of L2 in comparison to target language

→ Misrepresents particular systematicity of L2

◮ Linguistic analysis should expose linguistic properties

7 / 67

slide-9
SLIDE 9

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

From Sentences to Sentential Units

Segmentation purpose

◮ Which properties should be exposed by an analysis?

→ Influences the decisions made in segmentation process

Sentential units (SU)

◮ Notion of ‘sentence’ is biased by theory and conventions ◮ May not be in interest of purpose-driven segmentation ◮ Sentential unit as maximal unit of related clauses ◮ Keep clause in its traditional sense as syntactic unit

8 / 67

slide-10
SLIDE 10

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Use Examples for Segmentation Units

◮ Study of syntactic phenomena and linguistic analysis ◮ Pragmatic analysis of discourse structures ◮ Sociolinguistic analysis of interlocutor interaction ◮ Unit for Natural Language Processing tool ◮ Unit for corpus representations (visualization, query)

9 / 67

slide-11
SLIDE 11

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Which Segmentation Do We Want?

T Uhm (.) yes, well, now try to use the sentence A If the whole thing is drawn the same size then (.) is (.) you go on you go on I made the start B Then you can see better A What big what B What is bigger and what is smaller T Very good 7th graders in math class (Prediger & Wessel 2018)

10 / 67

slide-12
SLIDE 12

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Which Segmentation Do We Want?

T Uhm (.) yes, well, now try to use the sentence A If the whole thing is drawn the same size then (.) is (.) you go on you go on I made the start B Then you can see better A What big what B What is bigger and what is smaller T Very good 7th graders in math class (Prediger & Wessel 2018)

10 / 67

slide-13
SLIDE 13

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Requirements of Sentential Segmentation Units

By Auer (2010)

  • 1. Exhaustivity: Capture all the linguistic material
  • 2. Atomism: Do not include other segments of their type
  • 3. Disctreteness: Do not allow overlapping segments
  • 4. Coherence on Linguistic Level: Use linguistic descriptions

Additional criteria

◮ Be reliably applicable across and within annotators ◮ Avoid written language bias and comparative fallacy

11 / 67

slide-14
SLIDE 14

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Define Sentential Units Beyond A Written Norm

◮ Define sentential units for non-standard language ◮ Here for Early New High German (ENHG; 1482–1652) ◮ Bridge from ENHG to spoken language data

→ ENHG shares linguistic properties with spoken German → Historical continuity of non-normative syntactic patterns

(Hennig 2007; Sandig 1973)

12 / 67

slide-15
SLIDE 15

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Dependent Clause with Verb-Last (VL)

(4)

genannt called Absinthium Absinthium / / welcher which Name name bis auf until den the heutigen present Tag day in in den the Apotheken pharmacies geblieben remained ist is

‘called Absinthium, which to the present day is the name used used in pharmacies.’ Fuchs (1543)

13 / 67

slide-16
SLIDE 16

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

weil-2 in Spoken German

(5)

  • a. weil ich Sie hierauf Aufmerksam machen will
  • b. weil ich will Sie hierauf Aufmerksam machen
  • c. ‘because I want to call your attention to this’

STUTTGART21 (2010)

14 / 67

slide-17
SLIDE 17

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

weil-2 in ENHG

(6)

M. M. nennt calls sie her Rapiens Rapiens Vitam Vitam / / wann because sie she ist is giftig poisonous

‘M. calls it Rapiens Vitam because it is poisonous.’

Tallat (1532)

15 / 67

slide-18
SLIDE 18

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

The LangBank Project

◮ From 09/2015 to 09/2018 ◮ Create annotated corpora for teaching and learning ◮ Languages: Classical Latin and ENHG ◮ Complement teaching with corpus-based work ◮ Investigate applicability of NLP for non-standard data ◮ ENHG Corpus available online1

1https://korpling.org/annis3/ 16 / 67

slide-19
SLIDE 19

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Why Early New High German?

◮ Part of the German Arts and Literature curriculum ◮ Shares morpho-syntactic properties with contemporary

German, e.g. verb asymmetry

◮ Interesting diachronic differences wrt. linguistic properties

→ Written ENHG register is forming

◮ Highly variable non-standard language variety

17 / 67

slide-20
SLIDE 20

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

The Ridges Corpus

◮ Register in Diachronic German Science (RIDGES) ◮ Serves as empirical basis for project ◮ Corpus of herbal texts (1482 to 1914)

(Lüdeling et al. 2018)

◮ LangBank time period: 1482 to 1652 ◮ LangBank core: 14 books (80,095 dipl tokens)

→ Sentence ending punctuation emerge

(Hartweg & Wegera 2005)

18 / 67

slide-21
SLIDE 21

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation Marks in ENHG (Bock, 1539)

19 / 67

slide-22
SLIDE 22

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation Marks in ENHG (Bock, 1539)

Virgule (sentence final)

19 / 67

slide-23
SLIDE 23

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation Marks in ENHG (Bock, 1539)

Punctuation mark (sentence final)

19 / 67

slide-24
SLIDE 24

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation Marks in ENHG (cont.)

(7)

das the Wasser water [...] [...] braucht needs [...] [...] Hieronymus Hieronymus von

  • f

Braunschweig Braunschweig für for das the Abnehmen loosing weight . . Für For den the Hauptschwindel dizziness . . Denen those so who Blut blood speien vomit .

‘Hieronimus von Braunschweig uses this water against phthisis, dizziness, and to heal those, who vomit blood’ Megenberg (1482)

20 / 67

slide-25
SLIDE 25

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Sentential Units in LangBank

Approach

◮ No graphematic definition possible ◮ Syntactic definition required

Purpose

  • 1. Annotation of syntactic phenomena
  • 2. Basis for automatic and manual linguistic analysis
  • 3. Unit for corpus representation (visualization, query)
  • 4. (Investigation of diachronic changes from 1482 to 1914)

21 / 67

slide-26
SLIDE 26

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Assumptions from Contemporary German

Linguistic Units

◮ Words and Parts-of-Speech ◮ Phrases and constituents ◮ Clauses

→ Sentential units based on original t-unit (Hunt 1965) → ENHG t-unit, henceforth TU

Linguistic theories

◮ X-bar theory ◮ Topological field model

22 / 67

slide-27
SLIDE 27

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

X-bar Theory

XP spec X’ X’ X head comp adj

23 / 67

slide-28
SLIDE 28

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

X-bar Theory (cont.)

VP NP Julia V’ V’ V meets NP Franca PP at noon

24 / 67

slide-29
SLIDE 29

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

X-bar Theory (cont.)

VP NP Julia V’ V’ V meets NP Franca PP at noon

24 / 67

slide-30
SLIDE 30

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Topological Field Model

◮ Non-hierarchical model of German clause structure ◮ Helps identify word order and constituency structure

→ Arranged around German sentence bracket → Model core applicable ENHG, but with limitations

◮ Wöllstein (2014); Pittner & Bermann (2004); Drach (1937)

25 / 67

slide-31
SLIDE 31

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Topological Field Model (cont.)

N Pre LSB Middle Field RSB Post S0 Ich habe heute früher gegessen — You have today earlier eaten

26 / 67

slide-32
SLIDE 32

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Topological Field Model (cont.)

N Pre LSB Middle Field RSB Post S0 Ich habe heute früher gegessen — You have today earlier eaten S2 — Ist das ein Satz? — — Is this a sentence?

26 / 67

slide-33
SLIDE 33

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Topological Field Model (cont.)

N Pre LSB Middle Field RSB Post S0 Ich habe heute früher gegessen — You have today earlier eaten S2 — Ist das ein Satz? — — Is this a sentence? S3 — damit ich mir Zeit lassen kann — so.that I me time leave can

26 / 67

slide-34
SLIDE 34

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Topological Field Model (cont.)

N Pre LSB Middle Field RSB Post S0 Ich habe heute früher gegessen S3 You have today earlier eaten S2 — Ist das ein Satz? — — Is this a sentence? S3 — damit ich mir Zeit lassen kann — so.that I me time leave can

26 / 67

slide-35
SLIDE 35

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Starting with a T-unit

A t-unit is the “shortest grammatically allowable sentence into which (writing can be split) or minimally terminable unit” (Hunt 1965)

27 / 67

slide-36
SLIDE 36

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Starting with a T-unit

A t-unit is the “shortest grammatically allowable sentence into which (writing can be split) or minimally terminable unit” (Hunt 1965)

→ Circular if used as a sentence definition

27 / 67

slide-37
SLIDE 37

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Adjusting the Traditional T-unit

A TU consists of the head of a phrase and all of its arguments and adjuncts and nothing else.

(Weiss & Schnelle 2016)

28 / 67

slide-38
SLIDE 38

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Adjusting the Traditional T-unit

A TU consists of the head of a phrase and all of its arguments and adjuncts and nothing else.

(Weiss & Schnelle 2016)

◮ Does not require a predefined notion of a sentence ◮ Uses X-bar terminology to express “grammatically” and

“minimally terminable”

◮ On its own misses crucial sentence properties

→ Elaborated on in the remaining rules

28 / 67

slide-39
SLIDE 39

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Atomism

Property

◮ SU do not contain other SU

→ they are atomic (Auer 2010)

Rule

◮ TU do not govern other TU ◮ The head of a TU may not be the argument or the

adjunct of another head itself

29 / 67

slide-40
SLIDE 40

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Atomism in ENHG

(8)

dass that auch also / / hoch highly notwendig necessary zu to wissen know [...] [...] welche which in in ihrem their Leben life / / das that ist is / / wann when sie they noch still grün green und and saftig juicy sind are ihre their Tugend virtue bald soon erzeigen show

‘that it is also necessary to know [...] which ones show their virtue while they are alive, i.e. while they are green and juicy’ von Bodenstein (1557)

30 / 67

slide-41
SLIDE 41

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Atomism in Speech

A If the whole thing is drawn the same size then (.) is (.) you go on you go on I made the start B Then you can see better A [...] B What is bigger and what is smaller 7th graders in math class (Prediger & Wessel 2018)

31 / 67

slide-42
SLIDE 42

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Discreteness

Property

◮ SU do not overlap with other SU

→ they are discrete (Auer 2010)

Rule

◮ TU may not overlap ◮ No phrase is part of more than one ENHG-TU.

32 / 67

slide-43
SLIDE 43

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Discreteness in ENHG

(9)

das the andere

  • ther

[...] [...] / / hat has Blättlein small.leaves sind are ein a wenig bit rauher rougher

‘The other one has leaves are a bit more rough.’ Fuchs (1543)

33 / 67

slide-44
SLIDE 44

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Discreteness in Speech

(10) and there he is on the roof was he FOLK corpus2

2https://dgd.ids-mannheim.de/DGD2Web/ExternalAccessServlet?

command=displayTranscript&id=FOLK_E_00084_SE_01_T_01_DF_01& cID=c28&wID=c28

34 / 67

slide-45
SLIDE 45

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Exhaustivity

Property

◮ SU exhaustively partition a text

→ they are exhaustive (Auer 2010)

Rule

◮ A text has to be partitioned exhaustively into TU ◮ No material is left over

35 / 67

slide-46
SLIDE 46

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Exhaustivity in ENHG

(11)

Da Because ich I aber though das the Widerspiel counter-play erfahren experienced habe have / / dass that auf at eine a Zeit time da that der the Samen seeds unter under dem the / / [...] [...] Grund carried hervorgekommen ground ist emerge / is der / über which den

  • ver

Winter the grün winter verblieben green ist remained / is und / nachwärts and im afterwards Sommer in the sehr summer groß very geworden big ist become is

‘but because I experienced the following counter-play: that after the seeds emerged from the earth [...] they remain green throughout winter and then grow up in summer’ Fuchs (1543)

36 / 67

slide-47
SLIDE 47

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Exhaustivity in Speech

A On the strip board [...] you can see it better because # B # Because of Mehmet A ((laughs)) # because it is in a row hence in a section # B # a board # A # or whatever you call it 7th graders in math class (Prediger & Wessel 2018)

37 / 67

slide-48
SLIDE 48

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

(Dis)continuity

Question

◮ Is a sentential unit continuous or discontinuous? ◮ Depends on the purpose of the segmentation ◮ Here focus on syntactic analysis

Rule

◮ TU are continuous strings of tokens ◮ Exception: meta text artificially inserted by OCR

38 / 67

slide-49
SLIDE 49

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

(Dis)continuity in ENHG

(12)

dass that auch also / / hoch highly notwendig necessary zu to wissen know ... ... welche which in in ihrem their Leben life / / das that ist is / / wann when sie they noch still grün green und and saftig juicy sind are ihre their Tugend virtue bald soon erzeigen show

‘that it is also necessary to know ... which ones show their virtue while they are alive, i.e. while they are green and juicy’ von Bodenstein (1557)

39 / 67

slide-50
SLIDE 50

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

(Dis)continuity in Speech

A On the strip board [...] you can see it better because # B # Because of Mehmet A ((laughs)) # because it is in a row hence in a section # B # a board # A # or whatever you call it 7th graders in math class (Prediger & Wessel 2018)

40 / 67

slide-51
SLIDE 51

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation

Question

◮ Where is punctuation located in a SU? ◮ In practice, it is often SU-final

→ Does not play a role for speech

Rule

◮ Punctuation is located at a TU’s outermost right periphery

41 / 67

slide-52
SLIDE 52

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation in ENHG

(13)

Das the heimisch local Eppich ivy ist is wohlschmeckend tasty / / aber but es it ist is dem the Haupt jead böse evil

‘The local ivy is tasty. But it is bad for the head.’ von Megenberg (1482)

42 / 67

slide-53
SLIDE 53

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Punctuation in ENHG

(14)

Das the heimisch local Eppich ivy ist is wohlschmeckend tasty / / aber but es it ist is dem the Haupt jead böse evil

‘The local ivy is tasty. But it is bad for the head.’ von Megenberg (1482)

42 / 67

slide-54
SLIDE 54

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Handling Attachment Ambiguities

◮ Attachment ambiguities of phrases and clauses

→ Common in non-standard language, in particular ENHG → Challenges the uniqueness and discreteness of TU

43 / 67

slide-55
SLIDE 55

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

TU Attachment Ambiguity

Ambiguity

◮ Is a unit a subordinate clause or a TU on its own?

Rule

◮ Minimize the maximal TU length in words ◮ Two shorter TU are preferred over one longer TU

→ Ambiguous cases are analyzed as TU

44 / 67

slide-56
SLIDE 56

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

TU Attachment Ambiguity in ENHG

(15)

Wermut vermouth ist is ein a Kraut herb mit with vielen many Zinken branches und and Ästen branches / / an

  • n

welchem which/these sind are aschefarbene ash colored Blätter leaves

‘Vermouth is a herb with many branches ...’

  • A. ... on which there are ash colored leaves.’
  • B. ... On these there are ash colored leaves.’

Excerpt from Fuchs (1543)

45 / 67

slide-57
SLIDE 57

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

TU Attachment Ambiguity in Speech

(16)

Larry Larry is is grad just fast nearly ausm

  • ut

Fenster window gefalln fell voll really schlimm bad

  • A. ‘Larry just nearly fell out of the window really badly’
  • B. ‘Larry just nearly fell out of the window. So bad!’

FOLK corpus3

3https://dgd.ids-mannheim.de/DGD2Web/ExternalAccessServlet?

command=displayTranscript&id=FOLK_E_00084_SE_01_T_01_DF_01& cID=c15&wID=&textSize=200&contextSize=4

46 / 67

slide-58
SLIDE 58

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity

Ambiguity

◮ To which TU does a phrase (or clause) belong?

Rule

◮ Minimize the maximal TU length in words ◮ Two shorter TU are preferred over one longer TU

→ Attachment to the shorter TU → Attachment to preceding TU for for equal TU length

47 / 67

slide-59
SLIDE 59

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity in ENHG

(17)

So so bedeuten mean nun now diese these zwei two Blätter leaves / / das the Gesetz law in in zwei two Tafeln tablets fest firm / / von by Gott god gegeben given / / die the anderen

  • ther

drei three grünen green Blättlein leaves / / die which drei three Personen people dir you zeigen show

‘These two leaves symbolize the law in two tablets ...

  • A. that were given by god. The other three green leaves
  • B. The other three green leaves given by god

... show you three people’ Rosbachs (1588)

48 / 67

slide-60
SLIDE 60

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity in ENHG

(18)

So so bedeuten mean nun now diese these zwei two Blätter leaves / / das the Gesetz law in in zwei two Tafeln tablets fest firm / / von by Gott god gegeben given / / die the anderen

  • ther

drei three grünen green Blättlein leaves / / die which drei three Personen people dir you zeigen show

‘These two leaves symbolize the law in two tablets ...

  • A. that were given by god. The other three green leaves
  • B. The other three green leaves given by god

... show you three people’ Rosbachs (1588)

48 / 67

slide-61
SLIDE 61

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity in ENHG

(19)

So so bedeuten mean nun now diese these zwei two Blätter leaves / / das the Gesetz law in in zwei two Tafeln tablets fest firm / / von by Gott god gegeben given / / die the anderen

  • ther

drei three grünen green Blättlein leaves / / die which drei three Personen people dir you zeigen show

‘These two leaves symbolize the law in two tablets ...

  • A. that were given by god. The other three green leaves
  • B. The other three green leaves given by god

... show you three people’ Rosbachs (1588)

48 / 67

slide-62
SLIDE 62

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity in Speech

(20)

ihr you könnt can euch you das this mal now gegenseitig each.other [...] [...] erklären explain was what ihr you da there gemacht dine habt have am the besten best ihr you habt have das this ja yes alle all richtig correctly gemacht done

  • A. ‘It would be best you explain to each other what

you did. All of you did everything correct anyway.’

  • B. ‘Now explain to each other what you did. Ideally, all
  • f you did everything correct anyway.’

Prediger & Wessel (2018)

49 / 67

slide-63
SLIDE 63

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity in Speech

(21)

ihr you könnt can euch you das this mal now gegenseitig each.other [...] [...] erklären explain was what ihr you da there gemacht dine habt have am the besten best ihr you habt have das this ja yes alle all richtig correctly gemacht done

  • A. ‘It would be best you explain to each other what

you did. All of you did everything correct anyway.’

  • B. ‘Now explain to each other what you did. Ideally, all
  • f you did everything correct anyway.’

Prediger & Wessel (2018)

49 / 67

slide-64
SLIDE 64

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Phrase Attachment Ambiguity in Speech

(22)

ihr you könnt can euch you das this mal now gegenseitig each.other [...] [...] erklären explain was what ihr you da there gemacht dine habt have am the besten best ihr you habt have das this ja yes alle all richtig correctly gemacht done

  • A. ‘It would be best you explain to each other what

you did. All of you did everything correct anyway.’

  • B. ‘Now explain to each other what you did. Ideally, all
  • f you did everything correct anyway.’

Prediger & Wessel (2018)

49 / 67

slide-65
SLIDE 65

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Well-Formedness vs. Brevity

Issue

◮ SU in non-standard language may (not) be well-formed ◮ What to do if a well-formed analysis is possible?

→ Ambiguities between well-formed and short TU arise

Rule

◮ When in doubt, prefer well-formedness over brevity

50 / 67

slide-66
SLIDE 66

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Well-Formed vs. Brevity in ENHG

(23)

aber but sie they trocknen dry die the Zungen tongues [...] [...] wenn if man you Köl- köl- und and Haselbaum hazeltree pflanzt plant zu next.to Weinrebenwurzeln grape.vine.roots / / so then verderben spoil sie they die the Reben graoe.vine

‘But they dry the tongues [...]. If you plant köl tree and hazel tree next to grape vine roots, then they spoil the grape vine.’ von Megenberg (1482)

51 / 67

slide-67
SLIDE 67

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Well-Formed vs. Brevity in ENHG

(24) I have already all the time been hanging in the no not all the time a minute I have been hanging in the line it’s quite funny when you don’t hear anything

FOLK corpus4

4https://dgd.ids-mannheim.de/DGD2Web/ExternalAccessServlet?

command=displayTranscript&id=FOLK_E_00084_SE_01_T_01_DF_01& cID=c10&wID=c10

52 / 67

slide-68
SLIDE 68

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Our TU Definition in one Slide

  • 1. A TU consists of a phrasal head and all its arguments

and adjuncts

  • 2. TUs are independent, unique, exhaustive
  • 3. TUs are continuous and do not start with punctuation
  • 4. Ambiguous clause and phrase attachment is resolved by

preference of brevity

  • 5. Well-formedness trumps brevity

53 / 67

slide-69
SLIDE 69

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Inter-Rater Reliability

◮ End of every word is sentence boundary candidate

→ Binary ± sentence boundary annotation

◮ 3 annotators ◮ 5 text excerpts from 1532 to 1639 ◮ 2,609 tokens → ca. 5% sentence boundaries ◮ Cohen’s Kappa (κ) = .82

→ Almost perfect agreement: κ ≥ .80 (Landis & Koch 1977)

54 / 67

slide-70
SLIDE 70

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

The SegCor Project

◮ Segmentation of Oral Corpora (06/16 to 02/19) ◮ Develop methods for segmentation of spoken language ◮ Languages: Spoken French and German ◮ Corpora: FOLK (German), CLAPI and ESLO (French) ◮ Westpfahl et al. (2018); Westpfahl & Gorisch (2018); Schmidt

& Westpfahl (2018)

55 / 67

slide-71
SLIDE 71

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

SegCor Guidelines: Purpose and Means

◮ Based on syntactic criteria ◮ Find phenomena typical for spoken language ◮ Offer information about syntactic phenomena ◮ Practical approach ◮ See Westpfahl et al. (2018, p. 3)

56 / 67

slide-72
SLIDE 72

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Syntactic Segmentation Layers

◮ Topological fields ◮ Clause type (based on verb position) ◮ Maximal syntactic unit (MSU) ◮ (Certain speech phenomena)

→ Hierarchical multi-layer annotation approach → MSU serve as ‘maximal unit of related clauses’ (SU) → Based on same linguistic units and theories as TU

57 / 67

slide-73
SLIDE 73

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Types of Maximal Syntactic Units

◮ Simple sentential unit (complete, 1 clause) ◮ Complex sentential unit (complete, 2+ clauses) ◮ Abandoned unit (incomplete clause) ◮ Non-sentential unit (independent phrases, non-verbal)

58 / 67

slide-74
SLIDE 74

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Types of Maximal Syntactic Units

◮ Simple sentential unit (complete, 1 clause) ◮ Complex sentential unit (complete, 2+ clauses) ◮ Abandoned unit (incomplete clause) ◮ Non-sentential unit (independent phrases, non-verbal)

→ Currently no similar differentiation for TU → Applicable to current TU definition

58 / 67

slide-75
SLIDE 75

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Similarities between MSU and TU

◮ Maximal units of connected clauses (syntactic definition) ◮ Atomic, discrete, exhaustive, and coherent (Auer 2010) ◮ Differ in linguistic theories used to express concept

→ TU maximal phrasal projection → MSU all related clauses or independent/abandoned units

59 / 67

slide-76
SLIDE 76

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Historical and Spoken Language Phenomena

Shared phenomena

◮ Dialect and non-standard orthography ◮ Parentheses and ellipses, apo koinu ◮ Non-sentential and abandoned language material ◮ Syntactic similarities between ENHG and spoken German

Different challenges and cues

◮ Speech: collaborative or interrupted speech, pauses ◮ ENHG: no native speakers’ judgments, sections

60 / 67

slide-77
SLIDE 77

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Similar Solutions to Similar Challenges

Apo koinu (25) and there he is on the roof was he (FOLK corpus) Parentheses and insertions (26)

  • a. I’ve been waiting all the time in the no not all the

time for a minute in the line (FOLK corpus)

  • b. because (.) madam minister (.) you don’t only

know the conservatory (FOLK corpus)

61 / 67

slide-78
SLIDE 78

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Similar Solutions to Similar Challenges (cont.)

Free noun phrase (27) a reasonable cat (.) the big black one (FOLK corpus) Non-language material (28) and then ((laughter)) ((breathing)) (FOLK corpus) (29) then they spoil the vine - ¶ (von Megenberg, 1482)

62 / 67

slide-79
SLIDE 79

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Discontinuous Speech Continuity

T Why is it difficult to formulate? A On the strip board, on the strip board you can see it better because # B # Because of Mehmet A ((laughs)) # because it is in a row hence in a section # B # a board # A # or whatever you call it T Uhm what do you mean # 7th graders in math class (Prediger & Wessel 2018)

63 / 67

slide-80
SLIDE 80

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Discontinuous Speech Continuity

T Why is it difficult to formulate? A On the strip board, on the strip board you can see it better because # B # Because of Mehmet A ((laughs)) # because it is in a row hence in a section # B # a board # A # or whatever you call it T Uhm what do you mean # 7th graders in math class (Prediger & Wessel 2018)

63 / 67

slide-81
SLIDE 81

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Collaborative Speech Continuity

T Uhm (.) yes, well, now try to use the sentence A If the whole thing is drawn the same size then (.) is (.) you go on you go on I made the start B Then you can see better A What big what B What is bigger and what is smaller T Very good 7th graders in math class (Prediger & Wessel 2018)

64 / 67

slide-82
SLIDE 82

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Collaborative Speech Continuity

T Uhm (.) yes, well, now try to use the sentence A If the whole thing is drawn the same size then (.) is (.) you go on you go on I made the start B Then you can see better A What big what B What is bigger and what is smaller T Very good 7th graders in math class (Prediger & Wessel 2018)

64 / 67

slide-83
SLIDE 83

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Conclusion

Sentence segmentation

◮ Sentence definitions suffer from written language bias ◮ Different purposes call for different SU ◮ Yet many principles can be shared across purposes ◮ Only few phenomena require specialized rules

→ Multi-layer annotation of SU (see SegCor annotation)

65 / 67

slide-84
SLIDE 84

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Conclusion (cont.)

Generalizability of TU and MSU

◮ Share assumptions and principles ◮ Often yield same or comparable segmentation ◮ TU lack applicability to collaborative speech ◮ MSU relies on topological fields and native-speaker

intuition → limits generalizability

66 / 67

slide-85
SLIDE 85

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Outlook

◮ COLD project: 04/2019 to 04/20215 ◮ Investigate interaction of teachers and students ◮ Adaptation strategies of teachers to language learners ◮ Challenges

→ How do we segment our classroom interactions? → Multiple interlocutors and parallel discourses → Incomplete and collaborative utterances

◮ Currently: decide transcription and segmentation

5https://www.die-bonn.de/COLD/ 67 / 67

slide-86
SLIDE 86

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

References

Auer, P . (2010). Zum Segmentierungsproblem in der Gesprochenen Sprache. InLiSt – Interaction and Linguistic Structures 49, 1–19. Bley-Vroman, R. (1983). The comparative fallacy in interlanguage studies: The case of systematicity. Language Learning 33(1), 1–17. Bredel, U. (2008). Die Interpunktion des Deutschen. Ein kompositionelles System zur Online-Steuerung des Lesens.. Tübingen: Niemeyer. Bredel, U. (2011). Interpunktion. Heidelberg: Winter. Drach, E. (1937). Grundgedanken der deutschen Satzlehre. Frankfurt am Main, Germany: Diesterweg. Eugen, S. (1935). Geschichte und Kritik der wichtigsten Satzdefinitionen. Jena: Fromannsche Buchhandlung. Feilke, H. (2010). Schriftliches Argumentieren zwischen Nähe und Distanz – am Beispiel wissenschaftlichen Schreibens. Nähe und Distanz im Kontext variationslinguistischer Forschung 35, 209–231. Foster, P . & P . Skehan (1996). The Influence of Planning and Task Type on Second Language Performance. Studies in Second Language Acquisition 18(3), 299–323. Gallmann, P . (2009). Duden. Die Grammatik, Mannheim, Leipzig, Wien, Zürich: Dudenverlag, chap. Der Satz, pp. 763–1056. Harris, R. (1980). The Language Markers. Ithaca, N.Y.: Cornell University Press. Hartweg, F. & K.-P . Wegera (2005). Frühneuhochdeutsch: eine Einführung in die deutsche Sprache des Spätmittelalters und der frühen Neuzeit. Germanistische Arbeitshefte 33.

67 / 67

slide-87
SLIDE 87

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Hennig, M. (2007). Thesen zur Erforschung historischer Nähesprachlichkeit. In

  • M. Balaskó & P

. Szatmári (eds.), Sprach- und Literaturwissenschaftliche Brückenschläge. Vorträge der 13. Jahrestagung der GESUS in Szombathely, München: Lincom (Edition Linguistik 59), pp. 13–26. Hennig, M. (2008). Grammatik der gesprochenen Sprache in Theorie und Praxis. Hennig, M. (2009). Nähe und Distanzierung. Verschriftlichung und Reorganisation des Nähebereichs im Neuhochdeutschen. Kassel: Kassel University Press. Hunt, K. W. (1965). Grammatical Structures Written at Three Grade Levels. NCTE Research Report No. 3. Landis, J. R. & G. G. Koch (1977). The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174. Linell, P . (1982). The Written Language Bias in Linguistics. Linköping: University of Linköping. Lüdeling, A., C. Odebrecht, L. Perlitz & A. Zeldes (2018). RIDGES-HErbology (Version 8.0). http://korpling.org/ridges/.http://hdl.handle.net/11022/0000-0007-C6A3-1. Pittner, K. & J. Bermann (2004). Deutsche Syntax: ein Arbeitsbuch. Tübingen, Germany: Narr. Prediger, S. & L. Wessel (2018). Brauchen mehrsprachige jugendliche eine andere fach- und sprachintegrierte Förderung als einsprachige? Zeitschrift für Erziehungswissenschaft 21(2), 361–382. Reichmann, O. & R. P . Ebert (1993). Frühneuhochdeutsche Grammatik, vol. 12. Tübingen, Germany: Niemeyer. Sandig, B. (1973). Zur historischen Kontinuität normativ diskriminierter syntaktischer Muster in spontaner Sprachsprache. Deutsche Sprache pp. 37–56.

67 / 67

slide-88
SLIDE 88

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Schmidt, K. (2016). Der graphematische Satz. Zeitschrift für germanistische Linguistik 44(2), 215–256. Schmidt, T. & S. Westpfahl (2018). A Study on Gaps and Syntactic Boundaries in Spoken Interaction. In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 14). Vienna, Austria, pp. 40–49. STUTTGART21 (2010). STUTTGART21: German panel discussion. Phoenix TV. Weiss, Z. & G. Schnelle (2016). Early New High German Sentence Segmentation Guidelines. Westpfahl, S. & J. Gorisch (2018). A Syntax-Based Scheme for the Annotation and Segmentation of German Spoken Language Interactions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and

  • COnstructions. Santa Fe, New Mexico, USA, pp. 109–120.

Westpfahl, S., N. Proske, M. Hobich, A. Borlinghaus & H. Strub (2018). Syntactic Segmentation in the SegCor project. Version 1. Wöllstein, A. (2014). Topologisches Satzmodell. Heidelberg: Winter, 2 ed.

68 / 67

slide-89
SLIDE 89

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

A sentence is a sentence is a sentence

68 / 67

slide-90
SLIDE 90

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Written Language Bias and Comparative Fallacy

The learner’s system is worthy of study in its own right, not just as a degenerate form of the target system Bley-Vroman (1983, p. 4) In grammar description, linguists tend to regard the peculiarities of the former [= spoken language] as ’deviations’ rather than as independent structural

  • principles. (translated from Small 1985a: 13)

69 / 67

slide-91
SLIDE 91

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Particles After Comparative in ENHG

(30)

Die the Wasser waters [. . . ] . . . sind are gut good / / jedoch but nicht not so as stark strong und and kräftig vigorous / / als than wie as durch through das the Instrument instrument destilliert distilled

‘The waters are good but not as strong and vigorous as waters distilled with the instrument.’ Libavius (1603)

70 / 67

slide-92
SLIDE 92

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Particles After Comparative in Spoken German

(31)

Eine a Eisenbahnfahrt train ride im in the Tiefbahnhof underground station findet takes unter under keinen not anderen

  • ther

eisenbahntechnischen technical railway Voraussetzungen conditions statt place als than wie as im in the Kopfbahnhof terminus station

‘A railway journey in an underground station does not take place under any other technical railway conditions than in a terminus station.’ STUTTGART21 (2010)

71 / 67

slide-93
SLIDE 93

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Another Punctuation Example

(32)

das this is is mehr more für for gelehrte educated Leute people / / als as die those sich themselves mit with dem the Werk work zu to belustigen entertain begehre seek . . In in welchem which er he aus from Plinius Plinius sehr very viel much genommen took

‘this is rather for researchers than for those who read for pleasure the book in which he adopted a lot from Plinius’ Rhagor (1693c)

72 / 67

slide-94
SLIDE 94

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

VFIN-initial Verb Cluster

(33)

derer

  • f which

man you alle all morgens mornings und and abends evenings zwei two Skrupel scrupuli dem the Schwindsüchtigen tuberculosis infected [soll shall geben] give / /

Adam von Bodenstein (1557)

73 / 67

slide-95
SLIDE 95

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Another TU Attachment Ambiguity in ENHG

(34)

aber but von

  • f

dem the Baum tree der

  • f

Erkenntnis knowledge Gutes good und and Böses bad sollst shall du you nicht not essen eat / / denn because welches which Tages day du you davon

  • f it

isst eat / / sollst shall du you des the Todes death sterben die

‘but of the tree of knowledge about good and evil you shall not eat, because on the day that you eat of it, you will die’ Rosbachs (1588)

74 / 67

slide-96
SLIDE 96

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Another TU Attachment Ambiguity in ENHG

(35)

aber but von

  • f

dem the Baum tree der

  • f

Erkenntnis knowledge Gutes good und and Böses bad sollst shall du you nicht not essen eat / / denn because welches which Tages day du you davon

  • f it

isst eat / / sollst shall du you des the Todes death sterben die

‘but of the tree of knowledge about good and evil you shall not eat, because on the day that you eat of it, you will die’ Rosbachs (1588)

74 / 67

slide-97
SLIDE 97

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

A Sentence is a Sentence is a Sentence?

"LEASH DOGS SHEEP" from Feilke (2010)

75 / 67

slide-98
SLIDE 98

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Spoken Learner Language

A which which what is your opinion? B (1.0) maybe er (5.0) he (7.0) A long time? or it’s for for you it’s a major mistake or a small mistake? B maybe three months A three months for this one okay for me it’s ten B ten? A ten years B yeah ten years oh very long L2 interaction on prison sentences (Foster & Skehan 1996)

76 / 67

slide-99
SLIDE 99

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Interrupted Turns L1 German

T Why is it difficult to formulate? A On the strip board, on the strip board you can see it better because # B # Because of Mehmet A ((laughs)) # because it is in a row hence in a section # B # a board # A # or whatever you call it T Uhm what do you mean # 7th graders in math class (Prediger & Wessel 2018)

77 / 67

slide-100
SLIDE 100

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Interrupted Turns L1 German

T Why is it difficult to formulate? A On the strip board, on the strip board you can see it better because # B # Because of Mehmet A ((laughs)) # because it is in a row hence in a section # B # a board # A # or whatever you call it T Uhm what do you mean # 7th graders in math class (Prediger & Wessel 2018)

77 / 67

slide-101
SLIDE 101

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Interrupted Turns L1 German (cont.)

T Uhm what do you mean # C # a line # T # a stripe? B Because # A # That # B # Mehmet is there A ((laughs)) T No speak again constructively 7th graders in math class (Prediger & Wessel 2018)

78 / 67

slide-102
SLIDE 102

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Interrupted Turns L1 German (cont.)

T Uhm what do you mean # C # a line # T # a stripe? B Because # A # That # B # Mehmet is there A ((laughs)) T No speak again constructively 7th graders in math class (Prediger & Wessel 2018)

78 / 67

slide-103
SLIDE 103

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Interrupted Turns L1 German (cont.)

T Uhm what do you mean # C # a line # T # a stripe? B Because # A # That # B # Mehmet is there A ((laughs)) T No speak again constructively 7th graders in math class (Prediger & Wessel 2018)

78 / 67

slide-104
SLIDE 104

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Incomplete MSU

FR he’s just a little bit of a klutz EG yes yes exactly (.) that’s the problem (.) Pepper

  • bviously didn’t go out either

FR yes EG and only Larry does such a thing here FR a reasonable cat (.) the big black one EG ((laughing)) yeah kind of

79 / 67

slide-105
SLIDE 105

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Incomplete MSU

FR he’s just a little bit of a klutz EG yes yes exactly (.) that’s the problem (.) Pepper

  • bviously didn’t go out either

FR yes EG and only Larry does such a thing here FR a reasonable cat (.) the big black one EG ((laughing)) yeah kind of

◮ MSU annotation as non-sentential unit (Westpfahl et al.

2018, p. 39)

79 / 67

slide-106
SLIDE 106

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Afinite VL-Clauses (Reichmann & Ebert 1993, §256)

(36)

[...] [...] von

  • f

dem whom L. L. C. C. meldet reports / / dass that er he der the erste first gewesen been ist has / / so who den the Feldbau agriculture die the lateinische Latin Sprache language gelehrt taught hat has

‘of whom L. C. reports that he has been the first who brought Latin into agriculture’ Excerpt from von Bodenstein (1557)

80 / 67

slide-107
SLIDE 107

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Semantic and Syntactic Change of Connectives

(37)

doch but mögen want wir we [...] [...] dafür for this gebrauchen use das the Kraut herb so which man

  • ne

Welsamen Welsamen nennt calls / / denn because es it der the Kraft power [...] [...] nach following dem the rechten right Seriphium Seriphium ganz completely gleich similar ist is

‘but for this we want to use the herb that is called Welsamen because it is very similar to the right Seriphium in terms of its powers’ Excerpt from Fuchs (1543)

81 / 67

slide-108
SLIDE 108

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Historical Continuity of Apo Koinu

(38)

Uns Us ist is in in alten

  • ld

mæren tales wunders wonderful vil much geseit told / / von

  • f

helden heroes lobebæren, laudable von

  • f

grôzer great arebeit, labor / / von

  • f

freuden, joys hôchgezîten, celebrations von

  • f

weinen crying und and von laments klagen, / /

  • f

von bold küener warriors’ recken fights strîten may muget you ir now nu wonderful wunder hear hœren told sagen.

‘In wonderful stories we have heard of laudable heroes, great labor, joy, celebrations,crying and laments, and fighting warriors may you now hear wonderful tales’

82 / 67

slide-109
SLIDE 109

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Heavy Right Periphery in Spoken German

(39)

Ich I zeige show es it Ihnen you deshalb, therefore weil because ich I will want Sie you Aufmerksam attention machen, make auf

  • n

hier here diesen this Abschnitt paragraph hier here

‘I am showing you this, because I want to call your attention to this paragraph here’

STUTTGART21 (2010)

83 / 67

slide-110
SLIDE 110

Segmenting Oral and Historical Language Data

Zarah Weiss

Introduction

Defining Sentences Written Language Bias Defining Sentential Units Historical Continuity

The LangBank T-Unit

Background Annotation Principles Basic Definition Sentential Properties Handling Ambiguities Inter-Rater Reliability

Segmenting Speech

The SegCor Project Maximal Syntactic Unit Generalizable Solutions Open Issues

Conclusion References

Heavy Right Periphery in ENHG

(40)

darin in this der the Samen seed frei free verborgen hidden [liegt] lies eine a lange long Zeit time

‘in this lies the seed openly hidden for a long time’

Rosbachs (1588)

84 / 67