Example Sentences and Making them Useful for Theoretical and - - PowerPoint PPT Presentation
Example Sentences and Making them Useful for Theoretical and - - PowerPoint PPT Presentation
Example Sentences and Making them Useful for Theoretical and Computational Linguistics Stefan M uller Email: Stefan.Mueller@cl.uni-bremen.de http://www.cl.uni-bremen.de/stefan/ DGfS-Jahrestagung Mainz, 27.02.2004 Outline Why test
Outline
- Why test suites / data collections?
- What do we have?
- B-Ger-TS
- Demo
- Suggestions for using test suites / data collections
- Guidelines
- Conclusions
Why are Test Suites Needed for NLP?
- Language is very complex → minimal changes to a grammar may have unexpected
effects
- Check improvement in grammar development
– coverage – processing speed – memory requirements
2/15
What Test Suites and Data Bases are There?
- Test Suites developed in TSNLP (Oepen, Netter and Klein, 1997)
– English – German – French
3/15
What Test Suites and Data Bases are There?
- Test Suites developed in TSNLP (Oepen, Netter and Klein, 1997)
– English – German – French
- Test Suites that come with [incr TSDB()] wich is part of the LKB (Copestake, 2002)
– English (Lingo, CSLI) – German (VM, DFKI) – Spanish – Japanese – Norwegian
3/15
What Test Suites and Data Bases are There?
- Test Suites developed in TSNLP (Oepen, Netter and Klein, 1997)
– English – German – French
- Test Suites that come with [incr TSDB()] wich is part of the LKB (Copestake, 2002)
– English (Lingo, CSLI) – German (VM, DFKI) – Spanish – Japanese – Norwegian
- Babel Test Suite
3/15
What Test Suites and Data Bases are There?
- Test Suites developed in TSNLP (Oepen, Netter and Klein, 1997)
– English – German – French
- Test Suites that come with [incr TSDB()] wich is part of the LKB (Copestake, 2002)
– English (Lingo, CSLI) – German (VM, DFKI) – Spanish – Japanese – Norwegian
- Babel Test Suite
- A3-Datenbank in T¨
ubingen (Sternefeld, et. al.)
- Others?
3/15
Why Should we Have Additional Ones? (I)
- Babel Test Suite is unsystematic, naturally grown from a diploma thesis
4/15
Why Should we Have Additional Ones? (I)
- Babel Test Suite is unsystematic, naturally grown from a diploma thesis
- TSNLP is very systematic:
(1) a. die alte Wand
- b. * der alte Wand
- c. * das alte Wand
- d. * des alte Wand
- e. * den alte Wand
- f. * dem alte Wand
- g. * die alte W¨
ande
- h. * der alte W¨
ande
- i. * das alte W¨
ande
- j. * des alte W¨
ande
- k. * den alte W¨
ande
- l. * dem alte W¨
ande
- m. * der alte W¨
anden
- n. * die alte W¨
anden
4/15
Why Should we Have Additional Ones? (II)
but it is only a part of what is needed:
- phenomena are missing
5/15
Why Should we Have Additional Ones? (II)
but it is only a part of what is needed:
- phenomena are missing
- There are tons of strange ungrammatical sentences that are relevant in the context of
a discussion of a particular analysis only. Such things are not in TSNLP. Examples: – Agreement as head feature and coordination. – Haider’s Designated Argument as a head feature and coordination of unergatives and unakkusatives
5/15
Outline
- Why test suites / data collections?
- What do we have?
- B-Ger-TS
- Demo
- Suggestions for using such test suites / data collections
- Guidelines
- Conclusions
B-Ger-TS (I)
- B-Ger-TS developed from Babel-TS
- contains examples I gathered over the past ten years
- I started to systematize it, to crossclassify items with regard to phenomena
- extended the database by examples from the literature
- provided references to bibliographic sources
- eliminated lexical ambiguity
6/15
B-Ger-TS (II)
- verb position, scrambling, fronting and island data, extraposition, subjacency, . . .
- coherent/incoherent constructions, complex predicates, particle verbs,
control and raising, AcI constructions
- incomplete category fronting with adjectives and verbs, multiple frontings
- adjunction in the nominal and verbal area
– attributive adjectives and participles – prepositional phrases – relative clauses
- free relative clauses
- left dislocation
- topic drop
7/15
B-Ger-TS (III)
- depictive secondary predicates
- passive in various forms (e.g., stative passive, dative passive, lassen passive)
- modal infinitives
- coordination
- and the interaction between all of this!
8/15
B-Ger-TS (III)
- depictive secondary predicates
- passive in various forms (e.g., stative passive, dative passive, lassen passive)
- modal infinitives
- coordination
- and the interaction between all of this!
- items are crossclassified according to the phenomena
8/15
B-Ger-TS (III)
- depictive secondary predicates
- passive in various forms (e.g., stative passive, dative passive, lassen passive)
- modal infinitives
- coordination
- and the interaction between all of this!
- items are crossclassified according to the phenomena
- retreival with respect to various aspects is possible
8/15
Outline
- Why test suites / data collections?
- What do we have?
- B-Ger-TS
- Demo
- Suggestions for using such test suites / data collections
- Guidelines
- Conclusions
Demo of TSDB
9/15
Suggestions for Using Test Suites / Data Collections
- All published grammar fragments should come with a list of used test suites and
- results. (many already do, mainly those connected to the CSLI/DFKI groups)
- example: http://www.cl.uni-bremen.de/Fragments/b-ger-gram.html
10/15
Suggestions for Using Test Suites / Data Collections
- All published grammar fragments should come with a list of used test suites and
- results. (many already do, mainly those connected to the CSLI/DFKI groups)
- example: http://www.cl.uni-bremen.de/Fragments/b-ger-gram.html
- Journal articles can be written and reviewed
with reference to publically availible data collections.
10/15
Outline
- Why test suites / data collections?
- What do we have?
- B-Ger-TS
- Demo
- Suggestions for using such test suites / data collections
- Guidelines
- Conclusions
The Format
- simple ASCII text
- lines with ‘;;;’ indicate a phenomenon until the next line with ‘;;;’
;;; Extraposition daß der Mann schl¨ aft, der stirbt. ;; Extraposition aus Subjekt Der Mann liebt Maria, der ihn verachtet. ;; Extraposition aus Subjekt im Vorfeld Den Mann liebt Maria, der ihn verachtet. ;; Extraposition aus Objekt im Vorfeld Daß Karl schl¨ aft, ist dem Mann aufgefallen, der ihn kennt. ;; @ nach Haider94
11/15
The Format
- simple ASCII text
- lines with ‘;;;’ indicate a phenomenon until the next line with ‘;;;’
;;; Extraposition daß der Mann schl¨ aft, der stirbt. ;; Extraposition aus Subjekt Der Mann liebt Maria, der ihn verachtet. ;; Extraposition aus Subjekt im Vorfeld Den Mann liebt Maria, der ihn verachtet. ;; Extraposition aus Objekt im Vorfeld Daß Karl schl¨ aft, ist dem Mann aufgefallen, der ihn kennt. ;; @ nach Haider94
- everything that follows ‘;;’ and preceedes ‘@’ is a comment
- everything that follows ‘@’ is the source of the example
11/15
The Format
- simple ASCII text
- lines with ‘;;;’ indicate a phenomenon until the next line with ‘;;;’
;;; Extraposition daß der Mann schl¨ aft, der stirbt. ;; Extraposition aus Subjekt Der Mann liebt Maria, der ihn verachtet. ;; Extraposition aus Subjekt im Vorfeld Den Mann liebt Maria, der ihn verachtet. ;; Extraposition aus Objekt im Vorfeld Daß Karl schl¨ aft, ist dem Mann aufgefallen, der ihn kennt. ;; @ nach Haider94
- everything that follows ‘;;’ and preceedes ‘@’ is a comment
- everything that follows ‘@’ is the source of the example
- crossclassification of phenomena: listing phenomena separated by ‘+’
;;; Extraktion + w-Satz * daß ich nicht weiß, dieses Buch warum ich lesen sollte. ;; @GMueller98a:244
11/15
Lexical Ambiguity and Efficiency
Ambiguity in case does not hurt, but ambiguity in number does. (2)
- a. Will der Manager lachen?
- b. Will der Mann lachen?
Manager projects to a full NP, Manager lachen a full VP + sentence
12/15
Lexical Ambiguity and Efficiency
Ambiguity in case does not hurt, but ambiguity in number does. (2)
- a. Will der Manager lachen?
- b. Will der Mann lachen?
Manager projects to a full NP, Manager lachen a full VP + sentence Even worse: If the verb has an optional object, we get unwanted ambiguities: (3) Will der Manager essen? (der = subject, manager = object)
12/15
Lexical Ambiguity and Efficiency
Ambiguity in case does not hurt, but ambiguity in number does. (2)
- a. Will der Manager lachen?
- b. Will der Mann lachen?
Manager projects to a full NP, Manager lachen a full VP + sentence Even worse: If the verb has an optional object, we get unwanted ambiguities: (3) Will der Manager essen? (der = subject, manager = object) (4)
- a. Will der Manager essen? → 307 passive edges
- b. Will der Mann essen? → 114 passive edges
12/15
Lexical Ambiguity and Usability of Test Suites (Grammatical Sentences) ihr is ambiguous between dative feminine and second person plural and the possessive
- pronoun. A theory/grammar that makes wrong claims about case could analyze (5) as a
sentence with two nominatives. (5) Ihr helfen wir. So the grammatical sentence could be parsed although the theory assigns a wrong structure/wrong case values.
13/15
Lexical Ambiguity and Usability of Test Suites (Grammatical Sentences) ihr is ambiguous between dative feminine and second person plural and the possessive
- pronoun. A theory/grammar that makes wrong claims about case could analyze (5) as a
sentence with two nominatives. (5) Ihr helfen wir. So the grammatical sentence could be parsed although the theory assigns a wrong structure/wrong case values. → general rule for grammatical sentences: Be as specific as possible!
13/15
Lexical Ambiguity and Usability of Test Suites (Grammatical Sentences) ihr is ambiguous between dative feminine and second person plural and the possessive
- pronoun. A theory/grammar that makes wrong claims about case could analyze (5) as a
sentence with two nominatives. (5) Ihr helfen wir. So the grammatical sentence could be parsed although the theory assigns a wrong structure/wrong case values. → general rule for grammatical sentences: Be as specific as possible! ihr → ihm
13/15
Lexical Ambiguity and Usability of Test Suites (Ungrammatical Sentences) For ungrammatical examples we have to distinquish two cases:
- Ungrammaticality due to wrong case assignments:
(6) * der Mann, den ihn liebt (admitted by subsumption-based approaches to Free Relative Clauses) Here we have to be specific. Using das Buch instead of den Mann would make the sentence grammatical.
- Ungrammaticality due to other reasons.
(7)
- a. * Die Frau ist ihm zu helfen.
- b. * Die Frau ist das Buch zu lesen.
The reverse situation: The case specifications should be as unspecific as possible, so that we can find analyses with wrong case assignment. → no minimal pairs in the traditional sense
14/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
15/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
- For theoretical linguistics we need
– everything that is now stored in various Zettelk¨ asten all over the world
15/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
- For theoretical linguistics we need
– everything that is now stored in various Zettelk¨ asten all over the world
- It would be great to have a central place where one could look for data relevant for a
particular phenomenon.
15/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
- For theoretical linguistics we need
– everything that is now stored in various Zettelk¨ asten all over the world
- It would be great to have a central place where one could look for data relevant for a
particular phenomenon.
- There is a start at http://www.cl.uni-bremen.de/Software/TS/b-ger-ts.html
15/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
- For theoretical linguistics we need
– everything that is now stored in various Zettelk¨ asten all over the world
- It would be great to have a central place where one could look for data relevant for a
particular phenomenon.
- There is a start at http://www.cl.uni-bremen.de/Software/TS/b-ger-ts.html
15/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
- For theoretical linguistics we need
– everything that is now stored in various Zettelk¨ asten all over the world
- It would be great to have a central place where one could look for data relevant for a
particular phenomenon.
- There is a start at http://www.cl.uni-bremen.de/Software/TS/b-ger-ts.html
15/15
Conclusion
- For computational linguistics we need
– large systematic test suites that vary paradigmes along several dimensions, i.e. agreement – systematic test suites that contain bizarre (ungrammatical) examples from the literature that are only discussed in the context of particular analyses
- For theoretical linguistics we need
– everything that is now stored in various Zettelk¨ asten all over the world
- It would be great to have a central place where one could look for data relevant for a
particular phenomenon.
- There is a start at http://www.cl.uni-bremen.de/Software/TS/b-ger-ts.html
15/15
References
Copestake, Ann. 2002. Implementing Typed Feature Structure Grammars. CSLI Lecture Notes, No. 110, Stanford: CSLI Publications, http://cslipublications.stanford.edu/site/1575862603.html. 12.07.2003. Ingria, Robert J. P. 1990. The Limits of Unification. In Proceedings of the Twenty-Eight Annual Meeting of the ACL, pages 194–204, Association for Computational Linguistics, Pittsburgh, Pennsylvania. M¨ uller, Stefan. 1996. The Babel-System—An HPSG Prolog Implementation. In Proceedings of the Fourth International Conference on the Practical Application of Prolog, pages 263–277, London, http://www.cl.uni-bremen.de/˜stefan/Pub/babel.html. 27.02.2004. Oepen, Stephan, Netter, Klaus and Klein, Judith. 1997. tsnlp — Test Suites for Natural Language Processing. In John Nerbonne (ed.), Linguistic Databases, pages 13–36, Stanford: CSLI Publications, http://www.coli.uni-sb.de/itsdb/publications/tsnlp.ps.gz.