natural language processing
play

Natural Language Processing Diachronics Dan Klein UC Berkeley - PowerPoint PPT Presentation

12/1/2014 Natural Language Processing Diachronics Dan Klein UC Berkeley Includes joint work with Alex Bouchard Cote, Tom Griffiths, and David Hall 1 12/1/2014 The Task 2 12/1/2014 Lexical Reconstruction Latin focus French Spanish


  1. 12/1/2014 Natural Language Processing Diachronics Dan Klein – UC Berkeley Includes joint work with Alex Bouchard ‐ Cote, Tom Griffiths, and David Hall 1

  2. 12/1/2014 The Task 2

  3. 12/1/2014 Lexical Reconstruction Latin focus French Spanish Italian Portuguese feu fuego fuoco fogo 3

  4. 12/1/2014 Tree of Languages  We assume the phylogeny is known  Much work in biology, e.g. work by Warnow, Felsenstein, Steele…  Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html 4

  5. 12/1/2014 Evolution through Sound Changes Eng. camera from Latin, “camera obscura” camera / kamera / Latin Deletion: / e /, / a / Change: / k / .. / t ṏ / .. / ṏ / Insertion: / b / chambre / ṏ amb Й / French Eng. chamber from Old Fr. before the initial / t / dropped 5

  6. 12/1/2014 Changes are Systematic camera / kamera / numerus / numerus / e  _ e  _ camra / kamra / numrus / numrus / 6

  7. 12/1/2014 Changes are Contextual camera / kamera / e  _ e  _ / after stress camra / kamra / 7

  8. 12/1/2014 Changes Have Structure camra / kamra / _  b _  b / m_r _  [ stop x ] / [ nasal x ]_r cambra / kambra / 8

  9. 12/1/2014 Changes are Systematic English Great Vowel Shift (Simplified!) “time” = teem “time” = taim i e a 9

  10. 12/1/2014 Diachronic Evidence Yahoo! Answers [ca 2000] Appendix Probi [ca 300] tonight not tonite tonitru non tonotru 10

  11. 12/1/2014 Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon 11

  12. 12/1/2014 The Data 12

  13. 12/1/2014 The Data  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets FR IT PT ES  Target: (Vulgar) Latin 13

  14. 12/1/2014 The Data  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets FR IT PT ES  Target: (Vulgar) Latin  Large: Austronesian  637 languages  140K words  Incomplete cognate sets  Target: Proto ‐ Austronesian 14

  15. 12/1/2014 Austronesian 15

  16. 12/1/2014 Austronesian Examples From the Austronesian Basic Vocabulary Database 16

  17. 12/1/2014 The Model 17

  18. 12/1/2014 Simple Model: Single Characters G G C G G C C C C C G G [cf. Felsenstein 81] 18

  19. 12/1/2014 Changes are Systematic /fokus/ /fokus/ /kentrum/ /fogo/ /fogo/ /sentro/ /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ /t ṏ ƌ ntro/ /sentro/ /sentro/ 19

  20. 12/1/2014 Parameters are Branch ‐ Specific focus  ES  IB LA /fokus/  IT  PT /fogo/ IB fuoco fuego fogo /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ IT ES PT [Bouchard ‐ Cote, Griffiths, Klein, 07] 20

  21. 12/1/2014 Edits are Contextual, Structured o # f /fokus/ Ꜽ w # f  IT /fw Ꜽ ko/ 21

  22. 12/1/2014 Inference 22

  23. 12/1/2014 Learning: Objective /fokus/ z /fogo/ /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ w 23

  24. 12/1/2014 Learning: EM  M ‐ Step  Find parameters which fit /fokus/ (expected) sound change counts /fogo/  Easy: gradient ascent on theta /fw Ꜽ ko/ /fwe ɋ o/ /fogo/  E ‐ Step  Find (expected) change /fokus/ counts given parameters  Hard: variables are string ‐ /fogo/ valued /fw Ꜽ ko/ /fwe ɋ o/ /fogo/ 24

  25. 12/1/2014 Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence ‘grass’ [Holmes 01, Bouchard ‐ Cote, Griffiths, Klein 07] 25

  26. 12/1/2014 A Gibbs Sampler ‘grass’ 26

  27. 12/1/2014 A Gibbs Sampler ‘grass’ 27

  28. 12/1/2014 A Gibbs Sampler ‘grass’ 28

  29. 12/1/2014 Getting Stuck ? How could we jump to a state where the liquids /r/ and /l/ have a common ancestor? 29

  30. 12/1/2014 Getting Stuck 30

  31. 12/1/2014 Efficient Sampling: Vertical Slices Single Sequence Resampling Ancestry Resampling [Bouchard ‐ Cote, Griffiths, Klein, 08] 31

  32. 12/1/2014 Results 32

  33. 12/1/2014 Results: Romance 33

  34. 12/1/2014 Learned Rules / Mutations 34

  35. 12/1/2014 Learned Rules / Mutations 35

  36. 12/1/2014 Results: Austronesian 36

  37. 12/1/2014 Examples: Austronesian [Bouchard ‐ Cote, Hall, Griffiths, Klein, 13] 37

  38. 12/1/2014 Result: More Languages Help Distance from Blust [1993] Reconstructions Mean edit distance Number of modern languages used 38

  39. 12/1/2014 Visualization: Learned Universals *The model did not have features encoding natural classes 39

  40. 12/1/2014 Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English p/d versus t/th High Load: p/d: pot/dot, pin/din dress/press, pew/dew, ... Low Load: t/th: thin/tin 40

  41. 12/1/2014 Functional Load: Timeline 1955: Functional Load Hypothesis (FLH): Sound changes are less frequent when they merge phonemes with high functional load [Martinet, 55] 1967: Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] (Based on 4 languages as noted by [Hocket, 67; Surandran et al., 06]) Our approach: we reexamined the question with two orders of magnitude more data [Bouchard ‐ Cote, Hall, Griffiths, Klein, 13] 41

  42. 12/1/2014 Regularity and Functional Load Data: only 4 languages from the Austronesian data Merger posterior probability Each dot is a sound change identified by the system Functional load as computed by [King, 67] 42

  43. 12/1/2014 Regularity and Functional Load Data: all 637 languages from the Austronesian data Merger posterior probability Functional load as computed by [King, 67] 43

  44. 12/1/2014 Extensions 44

  45. 12/1/2014 Cognate Detection ‘fire’  /fw Ꜽ ko/ /v ƌ rbo/ /t ṏ ƌ ntro/ /sentro/ /ber Ǎ o/ /fwe ɋ o/ /v ƌ rbo/ /fogo/ /s ƌ ntro/ [Hall and Klein, 11] 45

  46. 12/1/2014 Grammar Induction GL Avg rel gain: 29% IE G RM 70 WG NG 60 50 Portuguese Swedish Chinese Spanish Slovene English Danish Dutch 40 30 20 10 0 [Berg ‐ Kirkpatrick and Klein, 07] 46

  47. 12/1/2014 Language Diversity Why are the languages of the world so similar? Universal grammar answer: Hardware constraints Common source answer: Not much time has passed [Rafferty, Griffiths, and Klein, 09] 47

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend