sign clustering and topic extraction in proto elamite
play

Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 - PowerPoint PPT Presentation

Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 Kate Kelley 2 Nishant Kambhatla 1 Carolyn Chen 1 Anoop Sarkar 1 1 Natural Language Laboratory 2 Department of Classical, Near School of Computing Science Eastern, and Religious


  1. Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 Kate Kelley 2 Nishant Kambhatla 1 Carolyn Chen 1 Anoop Sarkar 1 1 Natural Language Laboratory 2 Department of Classical, Near School of Computing Science Eastern, and Religious Studies Simon Fraser University University of British Columbia 7 June 2019 1 / 37

  2. Outline Introduction to Proto-Elamite Experiments Sign Clustering n -Gram Frequency LDA Topic Modeling Summary References 2 / 37

  3. Introduction 3 / 37

  4. Proto-Elamite Overview 4 / 37

  5. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  6. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  7. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  8. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  9. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  10. Proto-Elamite Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056 ∼ f M288 , 1(N14) 3(N01) 3. |M054+M384 ∼ i+M054 ∼ i| M365 , 5(N01) 4. M111 ∼ e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075 ∼ g , 1(N14) 3(N01) 7. M387 ∼ l M348 , 1(N14) 3(N01) 5 / 37

  11. Proto-Elamite Overview Proto-Elamite Proto-Cuneiform N08A N01 N14 N34 N48 N45 N50 6 / 37

  12. Proto-Elamite Overview 7 / 37

  13. Proto-Elamite Data ◮ Corpus transcribed by CDLI 8 / 37

  14. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign 8 / 37

  15. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) 8 / 37

  16. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types 8 / 37

  17. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric 8 / 37

  18. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric 8 / 37

  19. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants 8 / 37

  20. Proto-Elamite Data ◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types ◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants ◮ 249 complex graphemes 8 / 37

  21. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts 9 / 37

  22. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs 9 / 37

  23. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs ◮ n -gram frequencies 9 / 37

  24. Data Exploration in Proto-Elamite ◮ Goal: Extract information to assist human decipherment experts ◮ Hierarchical clustering of signs ◮ n -gram frequencies ◮ LDA topic modelling 9 / 37

  25. Contributions ◮ Rediscover results from manual investigation of the corpus 10 / 37

  26. Contributions ◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts 10 / 37

  27. Contributions ◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts ◮ Provide code for other groups to work with proto-Elamite 10 / 37

  28. Sign Clustering 11 / 37

  29. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. 12 / 37

  30. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: 12 / 37

  31. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) 12 / 37

  32. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities 12 / 37

  33. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering 12 / 37

  34. Sign Clustering Methodology Goal: ◮ Group signs with similar distributions. Three different clustering techniques: ◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering Reduce impact of noise by finding common groupings across all three techniques. 12 / 37

  35. Sign Clustering Results Rediscover results from manual work: ◮ Groups variants believed to have similar/identical function 13 / 37

  36. Sign Clustering Results Rediscover results from manual work: ◮ Groups “syllabic” signs (Dahl 2019, Desset 2016, Meriggi 1971) Neighbor HMM Brown 13 / 37

  37. Sign Clustering Results Novel grouping: signs resembling numerals Neighbor HMM Brown 14 / 37

  38. Sign Clustering Results Novel grouping: signs resembling numerals or written with rounded stylus. Neighbor HMM Brown 14 / 37

  39. n -Gram Frequency 15 / 37

  40. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. 16 / 37

  41. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. 16 / 37

  42. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. Did not count n -grams containing numeric signs. ◮ Want to focus on undeciphered signs. 16 / 37

  43. n -Gram Frequency Methodology Goal: ◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context. Did not count n -grams containing numeric signs. ◮ Want to focus on undeciphered signs. ◮ Do not want n -grams spanning multiple entries. 16 / 37

  44. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 17 / 37

  45. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? 17 / 37

  46. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? ◮ Identifying specific classes of individual 17 / 37

  47. n -Gram Frequency Results Can group n -grams with low edit distance: M305 M388 M240 M097 ∼ h M004 M218 M305 M388 M146 M097 ∼ h M004 M218 M305 M388 M347 M097 ∼ h M004 M218 Highlighted signs may... ◮ Qualify M388? ◮ Identifying specific classes of individual ◮ Form series of names built on M097 ∼ h M004 M218? 17 / 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend