two plots from last time accelerated natural language
play

Two plots from last time Accelerated Natural Language Processing - PowerPoint PPT Presentation

Two plots from last time Accelerated Natural Language Processing Lecture 2 Morphology Sharon Goldwater (based on slides by Philipp Koehn) 17 September 2019 Sharon Goldwater ANLP Lecture 2 17 September 2019 Sharon Goldwater ANLP Lecture 2


  1. Two plots from last time Accelerated Natural Language Processing Lecture 2 Morphology Sharon Goldwater (based on slides by Philipp Koehn) 17 September 2019 Sharon Goldwater ANLP Lecture 2 17 September 2019 Sharon Goldwater ANLP Lecture 2 1 How Many Different Words? Today’s Lecture 10,000 sentences from the Europarl corpus • What is morphology, how does it differ across languages, and why Language Different words does it matter for NLP? English 16k French 22k • What’s the difference between a stem, lemma, and affix? Dutch 24k Italian 25k Portuguese 26k • What are the characteristics of derivational and inflectional Spanish 26k morphology? Danish 29k Swedish 30k • What is an FSM, and what is the relationship between FSMs and German 32k Greek 33k regular languages? Finnish 55k Why the difference? Morphology. Sharon Goldwater ANLP Lecture 2 2 Sharon Goldwater ANLP Lecture 2 3

  2. Interlude/reminder: types and tokens What is morphology? The word word is ambiguous. The study of wordforms and word formation. • Word type : “10k sentences from English Europarl have 16k • Structured relationships between words: different words” (unique strings, lexical items) play, played, replay, player • Word token : “English Europarl has 54m words” (possibly played, walked, jumped repeated instances) • Units of meaning ( morphemes ) and their ordering a cat and a brown dog chased a black dog : ( morphotactics ): 10 tokens, 7 types. de+salin+ate+ion but not ate+salin+ion+de Sharon Goldwater ANLP Lecture 2 4 Sharon Goldwater ANLP Lecture 2 5 Why does morphology matter? Why does morphology matter? Example (Russian): • Information retrieval: return pages with related forms. zhenshin a devochk e dala knigu woman+NOM girl+DAT gave book+ACC • Language modelling: make predictions about unseen words ‘the woman gave the girl a book’ • Machine translation and language understanding: signals vs. differences in meaning (might be expressed using word order zhenshin e devochk a dala knigu in other languages). woman+DAT girl+NOM gave book+ACC ‘the girl gave the woman a book’ A noun’s case marking (a kind of morphology) indicates its role in the sentence, where English uses word order and prepositions. Sharon Goldwater ANLP Lecture 2 6 Sharon Goldwater ANLP Lecture 2 7

  3. Morphemes: Stems and Affixes Stems vs. Lemmas • Two types of morphemes • Lemma: the canonical form or dictionary form of a set of words – stems: small , cat , walk – fly , flies , flew and flying all have the lemma fly . – affixes: +ed , un+ – walk , walks , walked and walking all have the lemma walk . – walker , walkers have the lemma walker . • Four types of affixes – suffix – prefix – infix – circumfix Sharon Goldwater ANLP Lecture 2 8 Sharon Goldwater ANLP Lecture 2 9 Stems vs. Lemmas Suffix • Lemma: the canonical form or dictionary form of a set of words • Plural of nouns cat+s – fly , flies , flew and flying all have the lemma fly . – walk , walks , walked and walking all have the lemma walk . • Comparative and superlative of adjectives – walker , walkers have the lemma walker . small+er • Stem: definitions can vary, but often: the part of the word that • Formation of adverbs is common to all its variants great+ly – stem of produce , production is produc . – stem of walk , walks , walked , walking , walker , walkers is walk . • Verb tenses – Do fly , flies , flew , flying have a common stem fl ? walk+ed Or maybe only fly and flying share a stem: fly . • All inflectional morphology in English uses suffixes Decision may depend on application. Sharon Goldwater ANLP Lecture 2 10 Sharon Goldwater ANLP Lecture 2 11

  4. Prefix Other types of morphology Mainly in non-English languages; check textbook or online. • In English: these typically change the meaning • Infixes • Adjectives un+friendly • Circumfixes dis+interested • Reduplication • Verbs re+consider • Root and pattern • Some language use prefixing much more widely Sharon Goldwater ANLP Lecture 2 12 Sharon Goldwater ANLP Lecture 2 13 Not that easy... Irregular Forms • Affixes are not always simply attached • Some words have irregular forms: – is, was, been • In writing, some letters may be changed/added/removed – eat, ate, eaten – walk+ed – go, went, gone – frame+d – emit+ted • Irregular forms tend to be the most frequent (and vice versa) – carr(–y)+ied • In speaking, some sounds may be changed/added/removed – Compare the final sound: cats [s] vs dogs [z] vs foxes [ @ z] Sharon Goldwater ANLP Lecture 2 14 Sharon Goldwater ANLP Lecture 2 15

  5. Inflectional Morphology Forms of the German the • In English, we inflect Case Singular Plural – nouns for count (plural: +s ) and for possessive case ( +’s ) male fem. n. male fem. n. – verbs for tense ( +ed , +ing) and a special 3rd person singular nominative ( s ubject) der die das die die die present form ( +s ) genitive ( p ossessive) des der des der der der – adjectives in comparative ( +er ) and superlative ( +est ) forms. dative ( i ndirect o bject) dem der dem den den den accusative (direct o bject) den die das die die die • In German, we inflect – nouns for count and case Phrase/role: [The A]/ s put [the B]/ o [of the C]/ p [on the D]/ io – verbs for tense, person, and count – adjectives for count, case, gender, and definiteness Not only many different forms, – determiners for count, case and gender but each form is highly ambiguous. Sharon Goldwater ANLP Lecture 2 16 Sharon Goldwater ANLP Lecture 2 17 Inflectional vs. Derivational Morphology Derivational Morphology • Inflectional morphology typically • Changing the part of speech, e.g. noun to verb – does not change basic meaning or part of speech word → wordify – expresses grammatical features or relations between words – applies to all words of the same part of speech • Is it a real word? • Derivational morphology • Consulting Google (a few years ago): – may change the part of speech or meaning of a word – 8,840 hits: e.g., wordify mugs, tshirts and magnets – is not driven by syntactic relations outside the word – may be “picky”: drama+(t)ize but not traged(-y)+ize • Google now returns over 75k hits. (Why?) – applies closer to the stem; whereas inflection occurs at word edges: govern+ment+s , centr+al+ize+d Sharon Goldwater ANLP Lecture 2 18 Sharon Goldwater ANLP Lecture 2 19

  6. Derivational Morphology Derivational Morphology • Turning wordification into a ideology: • Changing the verb back to a noun wordify → wordification (8k hits on Google) wordification → wordificationism (was just 1 hit:) • A person/thing who engages in wordification I think you’re confusing the term “Democracy” with “Capitalism”; I think you mean “Has Capitalism failed”? wordification → wordificator (was 8 hits, now 21k: another app!) No. It hasn’t. I agree, Hambone; I’m just trying to correct the • A person/thing who wordifies wordificationism. wordify → wordifier (1500 hits on Google) Where in the world did you get the word “wordificationism”? Not in the Merriam-Webster dictionary, • What is the difference between a wordifier and a wordificator ? not in the Thesaurus... Sharon Goldwater ANLP Lecture 2 20 Sharon Goldwater ANLP Lecture 2 21 Derivational Morphology Compounds • An adherent of wordificationism • Creating new words by merging multiple words wordificationism → wordificationist • (Somewhat) rare in English • Used to have 0 hits on Google, now you get these slides! home work → homework • We created a new word! web site → website • More common in other languages (like German) Sharon Goldwater ANLP Lecture 2 22 Sharon Goldwater ANLP Lecture 2 23

  7. Acronyms/Initialisms Morphology differs across languages • Usually a trade-off between morphology and syntax (word order) • Wikileaks / Guardian, document 2007-081-100110-0444: – Some languages have no verb tenses OGA operating in TF Catamount sector moved into Malekshay for operation. LN Shum Khan ran at the sight of → use explicit time references ( yesterday ) the approaching CFA’s. CF utilized the escalation of force – Case inflection determines roles of noun phrase doctrine and shouted to stop, fired warning shots and then → use fixed word order instead fired to wound. The LN was hit in the ankle and treated → use prepositional phrases instead of cased noun phrases by Element medics on scene. It was determined through discussions with local Elders that the man was a deaf mute • Examples from the World Atlas of Language Structures (wals.info) that was nervous of the CF operation. Solatia was made in – prefixes vs. suffixes the form of supplies and the Element mission progressed – cases (zero to more than ten) – past tense remoteness distinctions Sharon Goldwater ANLP Lecture 2 24 Sharon Goldwater ANLP Lecture 2 25 Sharon Goldwater ANLP Lecture 2 26 Sharon Goldwater ANLP Lecture 2 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend