Searching Sanskrit Texts in SARIT Patrick McAllister June 6, 2017 - - PDF document

searching sanskrit texts in sarit
SMART_READER_LITE
LIVE PREVIEW

Searching Sanskrit Texts in SARIT Patrick McAllister June 6, 2017 - - PDF document

Searching Sanskrit Texts in SARIT Patrick McAllister June 6, 2017 This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License . To view a copy of this license, visit


slide-1
SLIDE 1

Searching Sanskrit Texts in SARIT

Patrick McAllister June 6, 2017

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

slide-2
SLIDE 2

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Searching Sanskrit Texts in SARIT

SARIT’s search facilities Patrick McAllister

Institute for the Cultural and Intellectual History of Asia (IKGA)

2017-05-23

slide-3
SLIDE 3

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

slide-4
SLIDE 4

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The TEI Guidelines and SARIT

▶ Toolset for the analysis of many texts common in humanities Not a technology to create/edit/display TEI documents

slide-5
SLIDE 5

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The TEI Guidelines and SARIT

▶ Toolset for the analysis of many texts common in humanities ▶ Not a technology to create/edit/display TEI documents

slide-6
SLIDE 6

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What might make or break SARIT?

  • 1. clear and simple way to add texts

Various editorial systems possible, e.g.:

Series-, area-, text-editor Open source software development models

  • 2. basic toolset for dealing with TEI encoded texts

Toolset [!=] Finished application

slide-7
SLIDE 7

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What might make or break SARIT?

  • 1. clear and simple way to add texts

▶ Various editorial systems possible, e.g.:

Series-, area-, text-editor Open source software development models

  • 2. basic toolset for dealing with TEI encoded texts

Toolset [!=] Finished application

slide-8
SLIDE 8

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What might make or break SARIT?

  • 1. clear and simple way to add texts

▶ Various editorial systems possible, e.g.:

▶ Series-, area-, text-editor Open source software development models

  • 2. basic toolset for dealing with TEI encoded texts

Toolset [!=] Finished application

slide-9
SLIDE 9

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What might make or break SARIT?

  • 1. clear and simple way to add texts

▶ Various editorial systems possible, e.g.:

▶ Series-, area-, text-editor ▶ Open source software development models

  • 2. basic toolset for dealing with TEI encoded texts

Toolset [!=] Finished application

slide-10
SLIDE 10

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What might make or break SARIT?

  • 1. clear and simple way to add texts

▶ Various editorial systems possible, e.g.:

▶ Series-, area-, text-editor ▶ Open source software development models

  • 2. basic toolset for dealing with TEI encoded texts

Toolset [!=] Finished application

slide-11
SLIDE 11

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What might make or break SARIT?

  • 1. clear and simple way to add texts

▶ Various editorial systems possible, e.g.:

▶ Series-, area-, text-editor ▶ Open source software development models

  • 2. basic toolset for dealing with TEI encoded texts

▶ Toolset [!=] Finished application

slide-12
SLIDE 12

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Tools

Burghart 2016: Introduction: The various mechanisms ofgered by the TEI schema and Guidelines for the encoding of crit- ical editions sufger from one major shortcoming: the lack of user-friendly tools allowing philolo- gists and their readers to display and process TEI-encoded editions. After witnessing–and per- sonally experiencing–this frustration, I decided to develop an application especially dedicated to supporting philologists in their work, and helping them to fully benefjt from their encoding work.

slide-13
SLIDE 13

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Grepping?

(when (string-match ”limited.*utility” ”limited inutility”) ”A match!”)

A match!

(when (not (string-match ”limited.*utility” ”the utility is limited for searching XML documents”))

֒ →

”No match!”)

No match!

slide-14
SLIDE 14

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

General indexing?

Figure: Recoll search for “(liṅga OR hetu) AND *numān*”

slide-15
SLIDE 15

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s framework

  • 1. The SARIT texts

(https://github.com/sarit/sarit-data)

  • 2. A dedicated XML database (http://exist-db.org/)
  • 3. Two applications that ‘speak’ to the database:

3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.

slide-16
SLIDE 16

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s framework

  • 1. The SARIT texts

(https://github.com/sarit/sarit-data)

  • 2. A dedicated XML database (http://exist-db.org/)
  • 3. Two applications that ‘speak’ to the database:

3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.

slide-17
SLIDE 17

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s framework

  • 1. The SARIT texts

(https://github.com/sarit/sarit-data)

  • 2. A dedicated XML database (http://exist-db.org/)
  • 3. Two applications that ‘speak’ to the database:

3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.

slide-18
SLIDE 18

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s framework

  • 1. The SARIT texts

(https://github.com/sarit/sarit-data)

  • 2. A dedicated XML database (http://exist-db.org/)
  • 3. Two applications that ‘speak’ to the database:

3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.

slide-19
SLIDE 19

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s framework

  • 1. The SARIT texts

(https://github.com/sarit/sarit-data)

  • 2. A dedicated XML database (http://exist-db.org/)
  • 3. Two applications that ‘speak’ to the database:

3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.

slide-20
SLIDE 20

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s full text search

SARIT’s documents have two encodings:

  • 1. Devanāgarī
  • 2. IAST, International Alphabet of Sanskrit

Transliteration asti + eva = asty eva <–> अ ्एव <–> astyeva <–> अेव

slide-21
SLIDE 21

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s full text search

SARIT’s documents have two encodings:

  • 1. Devanāgarī
  • 2. IAST, International Alphabet of Sanskrit

Transliteration asti + eva = asty eva <–> अ ्एव <–> astyeva <–> अेव

slide-22
SLIDE 22

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SARIT’s full text search

SARIT’s documents have two encodings:

  • 1. Devanāgarī
  • 2. IAST, International Alphabet of Sanskrit

Transliteration asti + eva = asty eva <–> अ ्एव <–> astyeva <–> अेव

slide-23
SLIDE 23

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Wildcards and encoding

Wildcard: ast*eva -> asty eva + अेव Regular expression: ast.*eva -> asty eva +

अेव

(list (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) (string-to-list ”astyeva”) (mapcar ’char-to-string (string-to-list ”astyeva”)))

2309 2360 2381 2340 2381 2351 2375 2357

त ् य े व

97 115 116 121 101 118 97 a s t y e v a

slide-24
SLIDE 24

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Wildcards and encoding

▶ Wildcard: ast*eva -> asty eva + अेव Regular expression: ast.*eva -> asty eva +

अेव

(list (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) (string-to-list ”astyeva”) (mapcar ’char-to-string (string-to-list ”astyeva”)))

2309 2360 2381 2340 2381 2351 2375 2357

त ् य े व

97 115 116 121 101 118 97 a s t y e v a

slide-25
SLIDE 25

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Wildcards and encoding

▶ Wildcard: ast*eva -> asty eva + अेव ▶ Regular expression: ast.*eva -> asty eva +

अेव

(list (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) (string-to-list ”astyeva”) (mapcar ’char-to-string (string-to-list ”astyeva”)))

2309 2360 2381 2340 2381 2351 2375 2357

त ् य े व

97 115 116 121 101 118 97 a s t y e v a

slide-26
SLIDE 26

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Diffjculty with inherent “a” vowel

(let ((regex-dev ”व”) (regex-iast ”va”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”va” ”व”) (”vi” ”व”))))

matched iast matched deva nil matched deva

slide-27
SLIDE 27

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis of “a” vowel diffjculty

(list (cons ”va:” (string-to-list ”व”)) (cons ”vi:” (string-to-list ”व”)))

va: 2357 vi: 2357 2367

slide-28
SLIDE 28

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Workaround for a vowel diffjculty

(let ((regex-dev (rx-to-string ’(and ”व” (or line-end (not (any ”” ”” ”” ”” ”” ”” ”” ”” ”” ”े” ”” ”” ”” ”” ”” ”” ””))))))

֒ → ֒ → ֒ →

(regex-iast ”va”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”va” ”व”) (”vi” ”व”) (”vastu” ”व”) (”vistara” ”व”))))

matched iast matched deva nil nil matched iast matched deva nil nil

slide-29
SLIDE 29

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis ast*eva in Devanāgarī

;; ast*eva (list (string-to-list ”अ ् *एव”) (mapcar ’char-to-string (string-to-list ”अ ् *एव”)))

2309 2360 2381 2340 2381 42 2319 2357

त ्

*

ए व

slide-30
SLIDE 30

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Matching any (star)

(let ((regex-dev (rx-to-string ’(and ”अ ् ” (and ”त” ;; or, more correctly: (or ”त” ”त ् ”)

֒ →

(0+ anything) (or ”ए” ”े”)) ”व”))) (regex-iast (rx-to-string ’(and ”ast” (0+ anything) ”eva”))))

֒ →

(mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”astyeva” ”अेव”) (”astītyeva” ”अेव”) (”asti. tadeva” ”अ तेव”))))

֒ →

matched iast matched deva matched iast matched deva matched iast matched deva

slide-31
SLIDE 31

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Single char matching (dot)

(let ((regex-dev (rx-to-string ’(and ”अ ् ” (and (or ”त” ”त ् ”) anything;; one only (or ”ए” ”े”) ) ”व”))) (regex-iast (rx-to-string ’(and ”ast” anything ”eva”)))) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”astyeva” ”अेव”))))

matched iast matched deva

slide-32
SLIDE 32

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis single character match

(list ;; the string to search (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) ;; simplified regex (string-to-list ”अ ् .ेव”) (mapcar ’char-to-string (string-to-list ”अ ् .ेव”))) ;; difference: 2351 (य) --> 46 (.)

2309 2360 2381 2340 2381 2351 2375 2357

त ् य े व

2309 2360 2381 2340 2381 46 2375 2357

त ्

.

े व

slide-33
SLIDE 33

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Failing regex Devanāgarī (no virāma)

(let ((regex-dev (rx-to-string ’(and ”त” (or ”त ्

” ””)

anything))) (regex-iast ”tatr.”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”tatra” ”त”) (”tatraiva” ”तव”))))

matched iast nil matched iast matched deva

slide-34
SLIDE 34

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis failing regex Devanāgarī (no virāma)

(list (string-to-list ”तव”) (mapcar ’char-to-string (string-to-list ”तव”)) ;; simplified regex (string-to-list ”तत ्

?.”)

(mapcar ’char-to-string (string-to-list ”तत ्

?.”))

;; the failing case (string-to-list ”त”) (mapcar ’char-to-string (string-to-list ”त”)) (string-to-list ”तत ्

.”)

(mapcar ’char-to-string (string-to-list ”तत ्

.”))) 2340 2340 2381 2352 2376 2357

त त ्

2340 2340 2381 2352 2381 63 46

त त ्

? . 2340 2340 2381 2352

त त ्

  • 2340

2340 2381 2352 2381 46

त त ्

.

slide-35
SLIDE 35

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Regex saving “a”

(let ((regex-dev (rx-to-string ’(and ”त” (or (and ”त ्

” anything)

””)))) (regex-iast ”tatr.”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”tatra” ”त”) (”tatraiva” ”तव”))))

matched iast matched deva matched iast matched deva

slide-36
SLIDE 36

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Workaround fjnal non-a vowel with consonant quantifjer

(let ((regex-dev (rx-to-string ’(and ”व” (or ”त” ”त ् ” ” ् ” ””) ””))) (regex-iast ”vṛtt?i”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”vṛti” ”वत”) (”vṛtti” ”व”) (”vṛtta” ”व”))))

matched iast matched deva matched iast matched deva nil nil

slide-37
SLIDE 37

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis fjnal non-a vowel with consonant quantifjer

(list (string-to-list ”वत”) (mapcar ’char-to-string (string-to-list ”वत”)) (string-to-list ”व”) (mapcar ’char-to-string (string-to-list ”व”)))

2357 2371 2340 2367

  • 2357

2371 2340 2381 2340 2367

् त

slide-38
SLIDE 38

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Impossible to search for fjnal “a”

(let ((regex-dev (rx-to-string ’(and ”व” (or ”त” ””)))) (regex-iast ”vṛtt?a”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)

֒ →

(when (string-match regex-dev (second case)) ”matched deva”)))

֒ →

’((”vṛti” ”वत”) (”vṛtti” ”व”) (”vṛtta” ”व”))))

nil matched deva nil matched deva matched iast matched deva

slide-39
SLIDE 39

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A unifjed SLP1 index

Figure: SARIT lucene index on tei:p elements

slide-40
SLIDE 40

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Benefjts of unifjed index

  • 1. they match on both Devanāgarī and IAST encoded

texts, and

  • 2. terms combined into a single weighting system:

relevance unrelated to encoding

  • 3. full Lucene index, with all of its query syntax: [AND],

[OR], [NOT], brackets, proximity searches, etc.

slide-41
SLIDE 41

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Benefjts of unifjed index

  • 1. they match on both Devanāgarī and IAST encoded

texts, and

  • 2. terms combined into a single weighting system:

relevance unrelated to encoding

  • 3. full Lucene index, with all of its query syntax: [AND],

[OR], [NOT], brackets, proximity searches, etc.

slide-42
SLIDE 42

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Benefjts of unifjed index

  • 1. they match on both Devanāgarī and IAST encoded

texts, and

  • 2. terms combined into a single weighting system:

relevance unrelated to encoding

  • 3. full Lucene index, with all of its query syntax: [AND],

[OR], [NOT], brackets, proximity searches, etc.

slide-43
SLIDE 43

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sample search: [yatra AND tatra]

Figure: SARIT search: [yatra AND tatra] (mixed Devanāgarī and IAST results)

slide-44
SLIDE 44

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Basic problem of indexing mixed data

<div> <p n=”1”>त<note n=”1-2”><p>य<ref target=”#avaya-S” xml:lang=”en”>S</ref></p></note> ...</p>

֒ →

<p n=”2”> वत, त ...</p> <p n=”3”> <hi></hi> वत, त ...</p> </div>

slide-45
SLIDE 45

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sample eXistDB index

<index xmlns:tei=”http://www.tei-c.org/ns/1.0”> <lucene> <!-- index configuration: slp1 and standard analyzers -->

֒ →

<analyzer class=”SLP1TranscodingAnalyzer”/> <analyzer id=”standard-analyzer” class=”StandardAnalyzer”/>

֒ →

<!-- TEI headers are usually more English than Sanskrit -->

֒ →

<text qname=”tei:teiHeader” analyzer=”standard-analyzer”/>

֒ →

<!-- be sure to catch all div elements --> <text qname=”tei:div”/> <!-- sample for paragraph with hi elements inlined

  • ->

֒ →

<text qname=”tei:p”> <inline match=”tei:hi”/> </text> <!-- sample for line-groups with notes ignored --> <text qname=”tei:lg”> <ignore qname=”tei:note”/> </text> <text qname=”tei:quote”/> <text qname=”tei:note”/> </lucene> </index>

slide-46
SLIDE 46

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Source and index states for eXistdb

<div> <p n=”1”>त<note n=”1-2”><p>य<ref target=”#avaya-S” xml:lang=”en”>S</ref></p></note> ...</p>

֒ →

<!-- index contains: tatra yatra S hi ... --> <l n=”1”>त<note n=”1-2”><p>य<ref target=”#avaya-S” xml:lang=”en”>S</ref></p></note> ...</l>

֒ →

<!-- index contains: tatra hi ... --> <p n=”2”> वत, त ...</p> <!-- index contains: na hi Bavati, tatra ... --> <p n=”3”> <hi></hi> वत, त ...</p> <!-- index contains: na hi Bavati, tatra ... --> </div>

slide-47
SLIDE 47

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Problems with this index

  • 1. Possible to miss cases
  • 2. Results sometimes not clear to users
  • 3. No possibility to index variations in text separately
  • 4. No possibility to index based on [@xml:lang]
slide-48
SLIDE 48

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Problems with this index

  • 1. Possible to miss cases
  • 2. Results sometimes not clear to users
  • 3. No possibility to index variations in text separately
  • 4. No possibility to index based on [@xml:lang]
slide-49
SLIDE 49

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Problems with this index

  • 1. Possible to miss cases
  • 2. Results sometimes not clear to users
  • 3. No possibility to index variations in text separately
  • 4. No possibility to index based on [@xml:lang]
slide-50
SLIDE 50

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Problems with this index

  • 1. Possible to miss cases
  • 2. Results sometimes not clear to users
  • 3. No possibility to index variations in text separately
  • 4. No possibility to index based on [@xml:lang]
slide-51
SLIDE 51

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Distribution of [//text()] in SARIT’s documents

: Characters per element in SARIT library Characters TEI element (in body) Percent 22248297 p 49.907421 14347985 l 32.185426 1633839 seg 3.6650307 1162691 label 2.6081506 1122531 quote 2.5180636 911228 q 2.0440683 896495 hi 2.0110192 686775 ab 1.5405749 463133 note 1.0389008 336104 ref 0.75394911 … …

slide-52
SLIDE 52

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Special text types: Commentaries

<titleStmt> <title type=”main” subtype=”base-text”>Pramāṇavārttika</title>

֒ →

<title type=”sub” subtype=”commentary”>Pramāṇavārttikavṛtti</title>

֒ →

<author role=”base-author”>Dharmakīrti</author> <author role=”commentator”>Manorathanandin</author> </titleStmt>

slide-53
SLIDE 53

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Special text types: Commentaries (contd.)

Figure: Base text embedded in commentary (Pramāṇavārttika and Vṛtti; image prepared by Liudmila Olalde)

slide-54
SLIDE 54

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Special text types: Commentaries (contd.)

<div> <quote type=”base-text”> <lg xml:id=”pv.3.206b” prev=”#pv.2.206a”> <l>ते

</l>

֒ →

</lg> </quote> <quote type=”base-text”> <lg xml:id=”pv.3.207a” next=”#pv.3.207b”> <l> त ्व े </l> </lg> </quote> <p><q type=”lemma”>ते</q>

(व)े <q

type=”lemma”> े </q>

त ... </p>

֒ → ֒ → ֒ →

</div>

slide-55
SLIDE 55

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sample use case: Tattvasaṅgraha with Pañjikā

Figure: List lemmas in Pañjikā together with verses of Tattvasaṅgraha (results)

slide-56
SLIDE 56

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Achievements

  • 1. a working index that allows us to search Indic

documents independently of their encoding

  • 2. Lucene’s powerful query syntax
  • 3. Basic indexing rules
slide-57
SLIDE 57

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Achievements

  • 1. a working index that allows us to search Indic

documents independently of their encoding

  • 2. Lucene’s powerful query syntax
  • 3. Basic indexing rules
slide-58
SLIDE 58

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Achievements

  • 1. a working index that allows us to search Indic

documents independently of their encoding

  • 2. Lucene’s powerful query syntax
  • 3. Basic indexing rules
slide-59
SLIDE 59

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What still needs to be done

  • 1. Make indexing rules more fmexible
  • 2. Develop custom queries that ‘understand’ specifjc

text types (commentaries, e.g.)

  • 3. Add more texts!
slide-60
SLIDE 60

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What still needs to be done

  • 1. Make indexing rules more fmexible
  • 2. Develop custom queries that ‘understand’ specifjc

text types (commentaries, e.g.)

  • 3. Add more texts!
slide-61
SLIDE 61

Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What still needs to be done

  • 1. Make indexing rules more fmexible
  • 2. Develop custom queries that ‘understand’ specifjc

text types (commentaries, e.g.)

  • 3. Add more texts!