Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 - - PowerPoint PPT Presentation

sign clustering and topic extraction in proto elamite
SMART_READER_LITE
LIVE PREVIEW

Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 - - PowerPoint PPT Presentation

Sign Clustering and Topic Extraction in Proto-Elamite Logan Born 1 Kate Kelley 2 Nishant Kambhatla 1 Carolyn Chen 1 Anoop Sarkar 1 1 Natural Language Laboratory 2 Department of Classical, Near School of Computing Science Eastern, and Religious


slide-1
SLIDE 1

Sign Clustering and Topic Extraction in Proto-Elamite

Logan Born1 Kate Kelley2 Nishant Kambhatla1 Carolyn Chen1 Anoop Sarkar1

1Natural Language Laboratory 2Department of Classical, Near

School of Computing Science Eastern, and Religious Studies Simon Fraser University University of British Columbia

7 June 2019

1 / 37

slide-2
SLIDE 2

Outline

Introduction to Proto-Elamite Experiments Sign Clustering n-Gram Frequency LDA Topic Modeling Summary References

2 / 37

slide-3
SLIDE 3

Introduction

3 / 37

slide-4
SLIDE 4

Proto-Elamite

Overview

4 / 37

slide-5
SLIDE 5

Proto-Elamite

Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056∼f M288 , 1(N14) 3(N01) 3. |M054+M384∼i+M054∼i| M365 , 5(N01) 4. M111∼e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075∼g , 1(N14) 3(N01) 7. M387∼l M348 , 1(N14) 3(N01)

5 / 37

slide-6
SLIDE 6

Proto-Elamite

Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056∼f M288 , 1(N14) 3(N01) 3. |M054+M384∼i+M054∼i| M365 , 5(N01) 4. M111∼e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075∼g , 1(N14) 3(N01) 7. M387∼l M348 , 1(N14) 3(N01)

5 / 37

slide-7
SLIDE 7

Proto-Elamite

Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056∼f M288 , 1(N14) 3(N01) 3. |M054+M384∼i+M054∼i| M365 , 5(N01) 4. M111∼e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075∼g , 1(N14) 3(N01) 7. M387∼l M348 , 1(N14) 3(N01)

5 / 37

slide-8
SLIDE 8

Proto-Elamite

Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056∼f M288 , 1(N14) 3(N01) 3. |M054+M384∼i+M054∼i| M365 , 5(N01) 4. M111∼e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075∼g , 1(N14) 3(N01) 7. M387∼l M348 , 1(N14) 3(N01)

5 / 37

slide-9
SLIDE 9

Proto-Elamite

Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056∼f M288 , 1(N14) 3(N01) 3. |M054+M384∼i+M054∼i| M365 , 5(N01) 4. M111∼e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075∼g , 1(N14) 3(N01) 7. M387∼l M348 , 1(N14) 3(N01)

5 / 37

slide-10
SLIDE 10

Proto-Elamite

Overview &P008016 = MDP 06, 217 #atf: lang qpc @tablet @obverse 1. |M218+M218| , # header 2. M056∼f M288 , 1(N14) 3(N01) 3. |M054+M384∼i+M054∼i| M365 , 5(N01) 4. M111∼e , 4(N14) 1(N01) 3(N39B) 5. M365 , 1(N14) 3(N01) 6. M075∼g , 1(N14) 3(N01) 7. M387∼l M348 , 1(N14) 3(N01)

5 / 37

slide-11
SLIDE 11

Proto-Elamite

Overview

Proto-Elamite Proto-Cuneiform N08A N01 N14 N34 N48 N45 N50

6 / 37

slide-12
SLIDE 12

Proto-Elamite

Overview

7 / 37

slide-13
SLIDE 13

Proto-Elamite

Data

◮ Corpus transcribed by CDLI

8 / 37

slide-14
SLIDE 14

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign

8 / 37

slide-15
SLIDE 15

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric)

8 / 37

slide-16
SLIDE 16

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types

8 / 37

slide-17
SLIDE 17

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types

◮ 49 numeric 8 / 37

slide-18
SLIDE 18

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types

◮ 49 numeric ◮ 287 basic non-numeric 8 / 37

slide-19
SLIDE 19

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types

◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants 8 / 37

slide-20
SLIDE 20

Proto-Elamite

Data

◮ Corpus transcribed by CDLI ◮ 1399 texts containing ≥ 1 readable non-numeric sign ◮ Average tablet length is 27 signs (10 non-numeric) ◮ 1623 sign types

◮ 49 numeric ◮ 287 basic non-numeric ◮ 1087 variants ◮ 249 complex graphemes 8 / 37

slide-21
SLIDE 21

Data Exploration in Proto-Elamite

◮ Goal: Extract information to assist human decipherment experts

9 / 37

slide-22
SLIDE 22

Data Exploration in Proto-Elamite

◮ Goal: Extract information to assist human decipherment experts

◮ Hierarchical clustering of signs 9 / 37

slide-23
SLIDE 23

Data Exploration in Proto-Elamite

◮ Goal: Extract information to assist human decipherment experts

◮ Hierarchical clustering of signs ◮ n-gram frequencies 9 / 37

slide-24
SLIDE 24

Data Exploration in Proto-Elamite

◮ Goal: Extract information to assist human decipherment experts

◮ Hierarchical clustering of signs ◮ n-gram frequencies ◮ LDA topic modelling 9 / 37

slide-25
SLIDE 25

Contributions

◮ Rediscover results from manual investigation of the corpus

10 / 37

slide-26
SLIDE 26

Contributions

◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts

10 / 37

slide-27
SLIDE 27

Contributions

◮ Rediscover results from manual investigation of the corpus ◮ Highlight novel patterns to inform future decipherment attempts ◮ Provide code for other groups to work with proto-Elamite

10 / 37

slide-28
SLIDE 28

Sign Clustering

11 / 37

slide-29
SLIDE 29

Sign Clustering

Methodology

Goal:

◮ Group signs with similar distributions.

12 / 37

slide-30
SLIDE 30

Sign Clustering

Methodology

Goal:

◮ Group signs with similar distributions.

Three different clustering techniques:

12 / 37

slide-31
SLIDE 31

Sign Clustering

Methodology

Goal:

◮ Group signs with similar distributions.

Three different clustering techniques:

◮ Co-occurrence vectors (left and right neighbors)

12 / 37

slide-32
SLIDE 32

Sign Clustering

Methodology

Goal:

◮ Group signs with similar distributions.

Three different clustering techniques:

◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities

12 / 37

slide-33
SLIDE 33

Sign Clustering

Methodology

Goal:

◮ Group signs with similar distributions.

Three different clustering techniques:

◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering

12 / 37

slide-34
SLIDE 34

Sign Clustering

Methodology

Goal:

◮ Group signs with similar distributions.

Three different clustering techniques:

◮ Co-occurrence vectors (left and right neighbors) ◮ Hidden Markov Model (HMM) emission probabilities ◮ Brown clustering

Reduce impact of noise by finding common groupings across all three techniques.

12 / 37

slide-35
SLIDE 35

Sign Clustering

Results

Rediscover results from manual work:

◮ Groups variants believed to have similar/identical function

13 / 37

slide-36
SLIDE 36

Sign Clustering

Results

Rediscover results from manual work:

◮ Groups “syllabic” signs (Dahl 2019, Desset 2016, Meriggi 1971)

Neighbor HMM Brown

13 / 37

slide-37
SLIDE 37

Sign Clustering

Results

Novel grouping: signs resembling numerals Neighbor HMM Brown

14 / 37

slide-38
SLIDE 38

Sign Clustering

Results

Novel grouping: signs resembling numerals or written with rounded stylus. Neighbor HMM Brown

14 / 37

slide-39
SLIDE 39

n-Gram Frequency

15 / 37

slide-40
SLIDE 40

n-Gram Frequency

Methodology

Goal:

◮ Identify important (i.e. frequently repeated) signs and phrases.

16 / 37

slide-41
SLIDE 41

n-Gram Frequency

Methodology

Goal:

◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context.

16 / 37

slide-42
SLIDE 42

n-Gram Frequency

Methodology

Goal:

◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context.

Did not count n-grams containing numeric signs.

◮ Want to focus on undeciphered signs.

16 / 37

slide-43
SLIDE 43

n-Gram Frequency

Methodology

Goal:

◮ Identify important (i.e. frequently repeated) signs and phrases. ◮ See signs in wider context.

Did not count n-grams containing numeric signs.

◮ Want to focus on undeciphered signs. ◮ Do not want n-grams spanning multiple entries.

16 / 37

slide-44
SLIDE 44

n-Gram Frequency

Results

Can group n-grams with low edit distance: M305 M388 M240 M097∼h M004 M218 M305 M388 M146 M097∼h M004 M218 M305 M388 M347 M097∼h M004 M218

17 / 37

slide-45
SLIDE 45

n-Gram Frequency

Results

Can group n-grams with low edit distance: M305 M388 M240 M097∼h M004 M218 M305 M388 M146 M097∼h M004 M218 M305 M388 M347 M097∼h M004 M218 Highlighted signs may...

◮ Qualify M388?

17 / 37

slide-46
SLIDE 46

n-Gram Frequency

Results

Can group n-grams with low edit distance: M305 M388 M240 M097∼h M004 M218 M305 M388 M146 M097∼h M004 M218 M305 M388 M347 M097∼h M004 M218 Highlighted signs may...

◮ Qualify M388?

◮ Identifying specific classes of individual 17 / 37

slide-47
SLIDE 47

n-Gram Frequency

Results

Can group n-grams with low edit distance: M305 M388 M240 M097∼h M004 M218 M305 M388 M146 M097∼h M004 M218 M305 M388 M347 M097∼h M004 M218 Highlighted signs may...

◮ Qualify M388?

◮ Identifying specific classes of individual

◮ Form series of names built on M097∼h M004 M218?

17 / 37

slide-48
SLIDE 48

n-Gram Frequency

Results

Can group n-grams with low edit distance: M305 M388 M240 M097∼h M004 M218 M305 M388 M146 M097∼h M004 M218 M305 M388 M347 M097∼h M004 M218 Highlighted signs may...

◮ Qualify M388?

◮ Identifying specific classes of individual

◮ Form series of names built on M097∼h M004 M218?

◮ Alternating initial syllable/logogram, as in Old Elamite Tem-Sanit, Kuk-Sanit 17 / 37

slide-49
SLIDE 49

Bigram Frequency

10 20 30 40 50

Bigram Frequency (with constituent unigram counts) M371 M288 M259 M218 M377~e M347 M096 M288 M009 M371 M347 M371 M388 M218 M218 M288 M305 M388 M004 M218

22 (308, 829) 22 (69, 525) 23 (56, 84) 26 (212, 829) 30 (223, 308) 31 (84, 308) 32 (620, 525) 36 (525, 829) 40 (127, 620) 45 (115, 525)

18 / 37

slide-50
SLIDE 50

Trigram Frequency

5 10 15 20

Trigram Frequency (with constituent bigram counts) M347 M219 M101 M219 M218 M288 M371 M009 M371 M259 M218 M288 |M131+M388| M101 M066 M386~a M240 M096 M004 M263 M218 M340 M054 M388 M097~h M004 M218 M377~e M347 M371

5 (6, 8) 5 (16, 36) 5 (6, 30) 5 (22, 36) 5 (5, 8) 6 (12, 16) 7 (9, 18) 7 (14, 17) 11 (13, 45) 17 (23, 31)

19 / 37

slide-51
SLIDE 51

n-Gram Frequency

Results

Little repetition relative to the size of the corpus:

20 / 37

slide-52
SLIDE 52

n-Gram Frequency

Results

Little repetition relative to the size of the corpus:

◮ Many frequent n-grams contain object signs (esp. M288)

20 / 37

slide-53
SLIDE 53

n-Gram Frequency

Results

Little repetition relative to the size of the corpus:

◮ Many frequent n-grams contain object signs (esp. M288)

⇒ probably span a “word” boundary

20 / 37

slide-54
SLIDE 54

n-Gram Frequency

Results

Little repetition relative to the size of the corpus:

◮ Many frequent n-grams contain object signs (esp. M288)

⇒ probably span a “word” boundary

Suggests complex sign strings do not encode information of wide importance to the PE administration.

20 / 37

slide-55
SLIDE 55

LDA Topic Modeling

21 / 37

slide-56
SLIDE 56

LDA Topic Modeling

Methodology

Goal:

◮ Group related signs into interpretable topics.

22 / 37

slide-57
SLIDE 57

LDA Topic Modeling

Methodology

Goal:

◮ Group related signs into interpretable topics. ◮ Identify genres of related texts.

22 / 37

slide-58
SLIDE 58

LDA Topic Modeling

Methodology

Goal:

◮ Group related signs into interpretable topics. ◮ Identify genres of related texts.

Omit numeric signs.

22 / 37

slide-59
SLIDE 59

LDA Topic Modeling

Methodology

Goal:

◮ Group related signs into interpretable topics. ◮ Identify genres of related texts.

Omit numeric signs. Used 10 topics for ease of interpretation.

22 / 37

slide-60
SLIDE 60

LDA Topic Modeling

Overview

◮ Few overlapping/redundant topics.

PC1 PC2 1 2 3 4 5 6 7 8 9 10

23 / 37
slide-61
SLIDE 61

LDA Topic Modeling

Results

◮ Small livestock husbandry/slaughter (Dahl, 2005)

PC1 PC2 1 2 3 Livestock 5 6 Livestock 8 9 10

24 / 37
slide-62
SLIDE 62

LDA Topic Modeling

Results

◮ Labor administration (Damerow and Englund, 1989; Nissen et al., 1994)

PC1 PC2 1 2 3 Livestock 5 6 Livestock 8 9 Labor Administration

25 / 37
slide-63
SLIDE 63

LDA Topic Modeling

Results

◮ Rediscovered the “syllabary” (again!)

PC1 PC2 1 2 3 Livestock 5 Syllabary Livestock 8 9 Labor Administration

26 / 37
slide-64
SLIDE 64

LDA Topic Modeling

Results

◮ Sealed tablets. Not clear what content these tablets have in common.

PC1 PC2 1 2 3 Livestock Sealed Syllabary Livestock 8 9 Labor Administration

27 / 37
slide-65
SLIDE 65

LDA Topic Modeling

Results

◮ Same number systems (B, B#); associated with rationing.

PC1 PC2 1 2 3 Livestock Sealed Syllabary Livestock 8 Rationing Labor Administration

28 / 37
slide-66
SLIDE 66

LDA Topic Modeling

Results

◮ Agriculture, including cattle?

PC1 PC2 Cattle? 2 3 Livestock Sealed Syllabary Livestock 8 Rationing Labor Administration

29 / 37
slide-67
SLIDE 67

LDA Topic Modeling

Results

◮ Other topics require further interpretation:

PC1 PC2 Cattle? Agriculture? Beer? Livestock Sealed Syllabary Livestock ? Rationing Labor Administration

slide-68
SLIDE 68

Summary

Techniques

◮ Sign clustering

31 / 37

slide-69
SLIDE 69

Summary

Techniques

◮ Sign clustering ◮ n-gram frequencies

31 / 37

slide-70
SLIDE 70

Summary

Techniques

◮ Sign clustering ◮ n-gram frequencies ◮ LDA topics

31 / 37

slide-71
SLIDE 71

Summary

Results

✓ Rediscovered known groups of related signs and tablets.

32 / 37

slide-72
SLIDE 72

Summary

Results

✓ Rediscovered known groups of related signs and tablets. ✓ Discovered new groups of related signs and tablets

32 / 37

slide-73
SLIDE 73

Summary

Results

✓ Rediscovered known groups of related signs and tablets. ✓ Discovered new groups of related signs and tablets

◮ Signs written with circular stylus 32 / 37

slide-74
SLIDE 74

Summary

Results

✓ Rediscovered known groups of related signs and tablets. ✓ Discovered new groups of related signs and tablets

◮ Signs written with circular stylus ◮ Sealed tablets with unclear relationship to one another 32 / 37

slide-75
SLIDE 75

Summary

Results

✓ Rediscovered known groups of related signs and tablets. ✓ Discovered new groups of related signs and tablets

◮ Signs written with circular stylus ◮ Sealed tablets with unclear relationship to one another

✓ Group number systems based on tablet content.

32 / 37

slide-76
SLIDE 76

Summary

Results

✓ Rediscovered known groups of related signs and tablets. ✓ Discovered new groups of related signs and tablets

◮ Signs written with circular stylus ◮ Sealed tablets with unclear relationship to one another

✓ Group number systems based on tablet content. ✓ Low number of repeated n-grams.

32 / 37

slide-77
SLIDE 77

Future Work

◮ Cross-lingual comparisons

33 / 37

slide-78
SLIDE 78

Future Work

◮ Cross-lingual comparisons

◮ Sumerian or Akkadian accounting tablets 33 / 37

slide-79
SLIDE 79

Future Work

◮ Cross-lingual comparisons

◮ Sumerian or Akkadian accounting tablets ◮ Proto-Cuneiform 33 / 37

slide-80
SLIDE 80

Future Work

◮ Cross-lingual comparisons

◮ Sumerian or Akkadian accounting tablets ◮ Proto-Cuneiform

◮ Collapse sign variants

33 / 37

slide-81
SLIDE 81

Future Work

◮ Cross-lingual comparisons

◮ Sumerian or Akkadian accounting tablets ◮ Proto-Cuneiform

◮ Collapse sign variants

◮ Determine which variants are meaningfully distinct. 33 / 37

slide-82
SLIDE 82

Resources github.com/sfu-natlang/pe-decipher-toolkit

34 / 37

slide-83
SLIDE 83

Resources github.com/sfu-natlang/pe-decipher-toolkit

◮ Additional figures

34 / 37

slide-84
SLIDE 84

Resources github.com/sfu-natlang/pe-decipher-toolkit

◮ Additional figures ◮ Discussion

34 / 37

slide-85
SLIDE 85

Resources github.com/sfu-natlang/pe-decipher-toolkit

◮ Additional figures ◮ Discussion ◮ Code

34 / 37

slide-86
SLIDE 86

Acknowledgments Thank you for your attention!

We would like to thank Jacob Dahl and the anonymous reviewers for their helpful remarks. Thanks also to Barbara Winter of the SFU Museum of Archaeology and Ethnology for putting the authors in contact and hosting our first meetings. This research was partially supported by the Natural Sciences and Engineering Research Council

  • f Canada grants NSERC RGPIN-2018-06437 and RGPAS-2018-522574 and a Department of

National Defence (DND) and NSERC grant DGDND-2018-00025 to the last author.

35 / 37

slide-87
SLIDE 87

References I

Jacob L. Dahl. Animal husbandry in Susa during the proto-Elamite period. Studi Micenei ed Egeo-Anatolici, 47:81–134, 2005. Jacob L. Dahl. Tablettes et fragments proto-´ elamites / proto-Elamite tablets and

  • fragments. Textes Cun´

eiform Tomes XXXII Mus´ ee de Louvre, 2019. Peter Damerow and Robert K. Englund. The Proto-Elamite Texts from Tepe Yahya. Bulletin (American School of Prehistoric Research). Peabody Museum of Archaeology and Ethnology, Harvard University, 1989. ISBN 9780873655422. URL http://www.hup.harvard.edu/catalog.php?isbn=9780873655422. Fran¸ cois Desset. Proto-Elamite writing in Iran. Arch´ eo-nil. Revue de la soci´ et´ e pour l’´ etude des cultures pr´ epharaoniques de la val´ ee du Nil, 26:67–104, 2016. URL https://www.academia.edu/30228260/Proto-Elamite_writing_in_Iran. Piero Meriggi. La scrittura proto-elamica. Parte Ia: La scrittura e il contenuto dei

  • testi. Accademia Nazionale dei Lincei, Rome, 1971.

36 / 37

slide-88
SLIDE 88

References II

Hans J. Nissen, Peter Damerow, and Robert K. Englund. Archaic Bookkeeping: Writing and Techniques of Economic Administration in the Ancient Near East. University of Chicago Press, 1994.

37 / 37

slide-89
SLIDE 89

Unigram Frequency

100 200 300 400 500 600 700 800

Sign Frequency M066 M387 M346 M054 M297 M124 M371 M218 M388 M288

243 249 249 258 265 294 308 525 620 829

38 / 37

slide-90
SLIDE 90

Bigram Frequency

Suspected Anthroponyms

10 20 30 40 50

Bigram Frequency (with constituent unigram counts) M242~b M096 M097~h M004 M240 M096 M219 M218 M066 M352~o M263 M218 M387 M218 M259 M218 M009 M371 M004 M218

13 (25, 212) 13 (46, 115) 16 (49, 212) 16 (88, 525) 18 (243, 40) 18 (174, 525) 21 (249, 525) 22 (69, 525) 30 (223, 308) 45 (115, 525)

39 / 37

slide-91
SLIDE 91

Trigram Frequency

Suspected Anthroponyms

2 4 6 8 10 12

Trigram Frequency (with constituent bigram counts) M032 M387 M218 M262 M259 M218 |M131+M388| M004 M263 M101 M066 M263 M066 M352~o M218 M371 M009 M371 |M131+M388| M101 M066 M386~a M240 M096 M004 M263 M218 M097~h M004 M218

3 (3, 21) 3 (4, 22) 3 (3, 9) 3 (8, 5) 4 (18, 4) 5 (6, 30) 5 (5, 8) 6 (12, 16) 7 (9, 18) 11 (13, 45)

40 / 37