A mas novas vos torn / Now I take you back Corpus to my tale - - PowerPoint PPT Presentation

a mas novas vos torn now i take you back
SMART_READER_LITE
LIVE PREVIEW

A mas novas vos torn / Now I take you back Corpus to my tale - - PowerPoint PPT Presentation

Introduction Parallel Corpora A mas novas vos torn / Now I take you back Corpus to my tale Structure Corpus Study Conclusion The Romance of Flamenca References Olga Scrivner, E.D. Blodgett*, Sandra K ubler, Michael McGuire


slide-1
SLIDE 1

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

“A mas novas vos torn / Now I take you back to my tale” The Romance of Flamenca

Olga Scrivner, E.D. Blodgett*, Sandra K¨ ubler, Michael McGuire

Indiana University *University of Alberta

June 2013

1 / 36

slide-2
SLIDE 2

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Introduction

In the past, historical documents and manuscripts were studied exclusively by using a manual paper-based approach. Recent achievements of corpus linguistics have introduced state-of-art methods and tools for digitization, semi-automatic annotation, and visualization of such resources.

2 / 36

slide-3
SLIDE 3

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Linguistic Annotation

“ By accessing linguistic annotation, we can extend the range

  • f phenomena that can be found with high precision”

(K¨ ubler and Zinsmeister, 2014)

1 Morphological annotation - collocations, spelling variation 2 Syntactic annotation - sentence structure in narratives vs.

dialogues, prose vs. verse

3 Discourse annotation - analysis of scenes and characters

(Female vs. male speaker, King vs. servants)

3 / 36

slide-4
SLIDE 4

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Medieval Romance Languages

In recent years, a number of annotated corpora have been developed for Medieval Romance languages: Corpora of Old Spanish (Davies, 2002) Old Portuguese (Davies and Ferreira, 2006) Old French (Stein, 2008; Martineau et al., 2010)

4 / 36

slide-5
SLIDE 5

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Medieval Romance Languages

In recent years, a number of annotated corpora have been developed for Medieval Romance languages: Corpora of Old Spanish (Davies, 2002) Old Portuguese (Davies and Ferreira, 2006) Old French (Stein, 2008; Martineau et al., 2010) There exist (to our knowledge) two electronic databases:

1 “The Concordance of Medieval Occitan” (Ricketts and

Reed, 2005)

2 “Proven¸

cal poetry” (ARTFL Project, 1998) Users of those corpora are limited to lexical search.

4 / 36

slide-6
SLIDE 6

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Le Roman de Flamenca - 13th century

Le Roman de Flamenca, a “universally acknowledged masterpiece of Old Occitan narrative” (Fleischmann, 1995). “Flamenca est la cr´ eation d’un homme d’esprit qui a voulu faire une oeuvre agr´ eable o` u fˆ ut repr´ esent´ ee dans ce qu’elle avait de plus brillant la vie des cours au XII si`

  • ecle. C’´

etait un roman de moeurs contemporaines” (Meyer, 1865) “Flamenca is the creation of a man of talent who wished to write an agreeable work representing the most brilliant aspects

  • f courtly life in the twelfth century. It is a novel of manners”

(Bradley, 1922)

5 / 36

slide-7
SLIDE 7

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Le Roman de Flamenca

This romance presents a very intriguing love story between the beautiful Flamenca, who is imprisoned in a tower by her jealous husband Archambaut, and the sharp-witted knight Guillem.

The photo of the tapestry is used by permission of FreeLargePhotos.com

6 / 36

slide-8
SLIDE 8

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Le Roman de Flamenca

The anonymous manuscript of Le Roman de Flamenca was accidentally discovered in Carcassonne (France) by Raynouard and was first fully edited P. Meyer in 1865. This romance is unique in genre (“the first modern novel”), its use of setting, adventures, and character portrayal (Blodgett, 1995; Bradley, 1922; Meyer, 1865). The potential value of this historical resource, however, is limited by the lack of an accessible digital format and linguistic annotation.

7 / 36

slide-9
SLIDE 9

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Goals

Our corpus is intended not only as material for linguistic research, but also to aid in broader studies: Interactive online database with access to a glossary, to translations of verses, and to comments (Meyer, 1901)

http://nlp.indiana.edu/~obscrivn/Introduction.html

Multiple-level annotation - morphological, syntactic and pragmatic (Scrivner et al., 2013) Parallel English-Occitan corpus (Blodgett, 1995)

8 / 36

slide-10
SLIDE 10

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

What is Parallel Corpus

Parallel corpus is “an association between two texts (written or spoken) in different languages that represent translations of each other” (Tufis, 2006). Parallel alignment is reciprocal translation units that encode valuable lexical and syntactic knowledge.

9 / 36

slide-11
SLIDE 11

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Alignment types

One-to-one: one word from a source language corresponds to only one word in a target language

10 / 36

slide-12
SLIDE 12

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Alignment types

One-to-one: one word from a source language corresponds to only one word in a target language One-to-many:

10 / 36

slide-13
SLIDE 13

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Historical Parallel Corpora

Given that parallel words have the same content, we can identify forms that have not been studied (Koolen et al., 2006; Enrique-Arias, 2012): Spelling and lexical variation Morphosyntactic variation Null occurrences

11 / 36

slide-14
SLIDE 14

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Null Occurrences

A mas novas Ø vos torn Now I take you back to my tale

12 / 36

slide-15
SLIDE 15

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Null Occurrences

A mas novas Ø vos torn Now I take you back to my tale

12 / 36

slide-16
SLIDE 16

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Automatic Alignment

English and Occitan texts are aligned by lines of verses Bilingual lexicon is generated by NATools1 (Matrix of word-to-word probabilities) Automatic alignment via Berkeley parser (Liang et al., 2006) Manual correction of alignment

1http://linguateca.di.uminho.pt/natools/ 13 / 36

slide-17
SLIDE 17

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Morphological Annotation -TNT Tagger

14 / 36

slide-18
SLIDE 18

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Syntactic Annotation - Berkeley Parser

15 / 36

slide-19
SLIDE 19

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Syntactic Annotation

“...nor did he want to omit Flamenca”

16 / 36

slide-20
SLIDE 20

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Discourse Annotation - Speakers

The labels correspond to the main characters names, namely Flamenca, Archambaut, Guillem, Father, King, Queen. Less important characters are marked as FemaleSpeakers and MaleSpeakers.

17 / 36

slide-21
SLIDE 21

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Parallel Alignment Annotation

18 / 36

slide-22
SLIDE 22

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Corpus Design

Since we are targeting two different types of users, linguists and non-linguistics, with different needs, the corpus is made available in two different modes: Web Interface: Users can mainly browse the text and look up translations, glosses, and comments Query Search: Users interested in the linguistic annotation can query the corpus on-line

19 / 36

slide-23
SLIDE 23

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

  • 1. Web Database

Glossary definitions, comments, and footnotes are linked to tokens and are made visible when the user hovers over a marked word.

20 / 36

slide-24
SLIDE 24

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

  • 2. Search Tool (ANNIS)

Our web search based on ANNIS allows for basic queries, to search for a word or phrase, and more complex queries for syntactic, morphosyntactic, discourse and alignment annotation.

21 / 36

slide-25
SLIDE 25

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Null Subject

Modern Occitan varieties are null subject languages (Hinzelin and Kaiser, 2012) (1) Ø Era was pertot, everywhere, dintrava entered pertot everywhere ‘the light was everywhere and it was coming from everywhere’ (Lo b`

  • n de la nu`
  • ch, Max Rouquette)

22 / 36

slide-26
SLIDE 26

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Previous Findings - Overt Subject

Overt subjects - disambiguation or “mise en relief” (2) Femna que ieu ame illuminada de non r` en ‘Woman who I love illuminated from nothingness’ (Saume dins lo vent, Serge Bec) Person - more frequent with 1st person (Vance, 2009) Genre - more frequent in prose No difference by clause types (Sitaridou, 2005) Subjunctive clause - preference for null subject (Vance, 2009)

23 / 36

slide-27
SLIDE 27

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Search by Genre

Discourse annotation: Flamenca, King etc

  • ex. speaker=”Flamenca”

Narrative vs. Dialogues Male vs. Female High social rank vs. low social rank

24 / 36

slide-28
SLIDE 28

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Search by Person

Token annotation: I, you, it etc

  • ex. tok=”it”

Personal pronouns Impersonal pronouns

25 / 36

slide-29
SLIDE 29

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Search by Clause Type

Syntactic annotation:

  • ex. matrix cat=”IP” >[func=”MAT”]

embedded cat=”IP” >[func=”SUB”] Main clause Embedded clause

26 / 36

slide-30
SLIDE 30

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Overt vs. Null Subjects

Searching for null subjects:

27 / 36

slide-31
SLIDE 31

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Overt vs. Null Subjects

Searching for overt subjects:

28 / 36

slide-32
SLIDE 32

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Overt vs. Null Subjects - 1000 lines

Factor Null (%) Overt (%) Total 308 (87) 45 (13) Matrix Clause 34 (87) 5 (13) Embedded Clause 107 (84) 21 (16) Impersonal pronouns 23 (92) 2 (8) 1st person 32 (70) 14 (30) 2nd person 28 (88) 4 (12) 3rd person 200 (91) 19 (9) Narration 187 (91) 19 (9) Discourse 121 (88) 26 (12)

29 / 36

slide-33
SLIDE 33

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Explicite Impersonal Pronouns

Only tonic pronouns

1 mais aisso -m par causa tro brava

but it seems to me hard

2 mais so fon sos meillors thesaurs

and it was her greatest treasure

30 / 36

slide-34
SLIDE 34

Introduction Parallel Corpora Corpus Structure Corpus Study

Null Subjects Corpus Search Results

Conclusion References

Social Variation

Factor Null (%) Overt (%) Male speakers 46 (87) 7 (13) Female speaker 29 (78) 8 (12) High social status 54 (86) 9 (13) Low social status 29 (83) 6 (17) Author 11 (92) 1 (8)

31 / 36

slide-35
SLIDE 35

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Conclusion

In contrast to traditional corpora, this corpus is structured to fulfill two objectives: The web design facilitates the reading and understanding

  • f The Romance of Flamenca. Words are interactively

linked to the glossary, comments, and translations. The corpus search design via its ANNIS interface allows for a visualization and for complex queries of the (morpho-)syntactic, discourse and parallel word aligned annotations.

32 / 36

slide-36
SLIDE 36

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Future Directions

Culmination of manual correction Preservation and annotation of other Old Occitan manuscripts Building a collaborative effort to continue with this project

33 / 36

slide-37
SLIDE 37

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Bibliography I

ARTFL Project. Proven¸ cal Poetry database (American and French Research on the Treasury of the French Language), Robert Morrissey, director, with F.R. Akehurst, 1998. URL http://artfl-project.uchicago.edu/content/proven%5Cc%7Bc%7Dal. E.D. Blodgett. The Romance of Flamenca. Garland, New York, 1995. W.A. Bradley. The Story of Flamenca. Harcourt Brace, New York, 1922. Mark Davies. Corpus del Espa˜ nol: 100 million words, 1200s-1900s. Available

  • nline at http://www.corpusdelespanol.org, 2002.

Mark Davies and Michael Ferreira. Corpus do Portugues: 45 million words, 1300s-1900s. Available online at http://www.corpusdoportugues.org, 2006. Andr´ es Enrique-Arias. Parallel texts in diachronic investigations: insights from a parallel corpus of spanish medieval bible translations. In Exploring Ancient Languages through Corpora EALC, 2012. Suzanne Fleischmann. The non-lyric texts. In F.R.P. Akehurst and Judith M. Davis, editors, A Handbook of the Troubadours, pages 176–184. University of California Press, 1995.

34 / 36

slide-38
SLIDE 38

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Bibliography II

Marc-Olivier Hinzelin and Georg A. Kaiser. Etudes de linguistique gallo-romane, chapter Le param` etre du sujet nul dans les vari´ et´ es dialectales de l’occitan et du francoproven¸ cal, pages 247–260. Presses Universitaires de Vincennes, Saint-Denis, 2012. Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. A cross-language approach to historic document retrieval. In M. Lalmas and et al., editors, ECIR 2006 LNCS 3936, pages 407–419. Springer-Verlag, 2006. Sandra K¨ ubler and Heike Zinsmeister. Corpus Linguistics and Linguistically Annotated Corpora. Bloomsbury, 2014. Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In Proceedings

  • f the Human Language Technology Conference of the North American

Chapter of the ACL, HLT-NAACL ’06, pages 104–111, New York, NY, 2006. France Martineau, Paul Hirschb¨ uhler, Anthony Kroch, and Yves Charles Morin. Corpus MCVF (parsed corpus), mod´ eliser le changement: les voies du fran¸ cais, D´ epartment de Fran¸ cais, University of Ottawa. CD-ROM, first edition, http://www.arts.uottawa.ca/voies/voies_fr.html, 2010. Paul Meyer. Le Roman de Flamenca. B´ eziers, 1865. Paul Meyer. Le Roman de Flamenca. Librairie Emile Bouillon, 2nd edition, 1901. Peter T. Ricketts and Alan Reed. Concordance de l’Occitan M´ edi´

  • eval. COM 2:

Les Troubadours, Les Textes Narratifs en vers. Brepols, Turnhout, 2005.

35 / 36

slide-39
SLIDE 39

Introduction Parallel Corpora Corpus Structure Corpus Study Conclusion References

Bibliography III

Olga Scrivner, Sandra K¨ ubler, Barbara Vance, and Eric Beuerlein. Le Roman de Flamenca : An annotated corpus of old occitan. In Francesco Mambrini, Marco Passarotti, and Caroline Sporleder, editors, Proceedings of the Third Workshop

  • n Annotation of Corpora for Research in Humanities, pages 85–96, 2013.

Ioanna Sitaridou. Corpora and Diachronic Linguistics, chapter A corpus-based study of null subjects in Old French and Old Occitan, pages 359–374. Narr., T¨ ubingen, 2005. Achim Stein. Syntactic annotation of Old French text corpora. Corpus, 7: 157–161, 2008. Dan Tufis. From word alignment to word senses, via multilingual wordnets. In Computer Science Journal of Moldova, volume 14, pages 3–33, 2006. Barbara Vance. The evolution of subject pronoun systems in Medieval Occitan.

  • Manuscript. Indiana University, 2009.

36 / 36