Searching Sanskrit Texts in SARIT Patrick McAllister June 6, 2017 - - PDF document
Searching Sanskrit Texts in SARIT Patrick McAllister June 6, 2017 - - PDF document
Searching Sanskrit Texts in SARIT Patrick McAllister June 6, 2017 This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License . To view a copy of this license, visit
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Searching Sanskrit Texts in SARIT
SARIT’s search facilities Patrick McAllister
Institute for the Cultural and Intellectual History of Asia (IKGA)
2017-05-23
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outline
Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The TEI Guidelines and SARIT
▶ Toolset for the analysis of many texts common in humanities Not a technology to create/edit/display TEI documents
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The TEI Guidelines and SARIT
▶ Toolset for the analysis of many texts common in humanities ▶ Not a technology to create/edit/display TEI documents
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What might make or break SARIT?
- 1. clear and simple way to add texts
Various editorial systems possible, e.g.:
Series-, area-, text-editor Open source software development models
- 2. basic toolset for dealing with TEI encoded texts
Toolset [!=] Finished application
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What might make or break SARIT?
- 1. clear and simple way to add texts
▶ Various editorial systems possible, e.g.:
Series-, area-, text-editor Open source software development models
- 2. basic toolset for dealing with TEI encoded texts
Toolset [!=] Finished application
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What might make or break SARIT?
- 1. clear and simple way to add texts
▶ Various editorial systems possible, e.g.:
▶ Series-, area-, text-editor Open source software development models
- 2. basic toolset for dealing with TEI encoded texts
Toolset [!=] Finished application
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What might make or break SARIT?
- 1. clear and simple way to add texts
▶ Various editorial systems possible, e.g.:
▶ Series-, area-, text-editor ▶ Open source software development models
- 2. basic toolset for dealing with TEI encoded texts
Toolset [!=] Finished application
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What might make or break SARIT?
- 1. clear and simple way to add texts
▶ Various editorial systems possible, e.g.:
▶ Series-, area-, text-editor ▶ Open source software development models
- 2. basic toolset for dealing with TEI encoded texts
Toolset [!=] Finished application
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What might make or break SARIT?
- 1. clear and simple way to add texts
▶ Various editorial systems possible, e.g.:
▶ Series-, area-, text-editor ▶ Open source software development models
- 2. basic toolset for dealing with TEI encoded texts
▶ Toolset [!=] Finished application
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tools
Burghart 2016: Introduction: The various mechanisms ofgered by the TEI schema and Guidelines for the encoding of crit- ical editions sufger from one major shortcoming: the lack of user-friendly tools allowing philolo- gists and their readers to display and process TEI-encoded editions. After witnessing–and per- sonally experiencing–this frustration, I decided to develop an application especially dedicated to supporting philologists in their work, and helping them to fully benefjt from their encoding work.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Grepping?
(when (string-match ”limited.*utility” ”limited inutility”) ”A match!”)
A match!
(when (not (string-match ”limited.*utility” ”the utility is limited for searching XML documents”))
֒ →
”No match!”)
No match!
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General indexing?
Figure: Recoll search for “(liṅga OR hetu) AND *numān*”
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s framework
- 1. The SARIT texts
(https://github.com/sarit/sarit-data)
- 2. A dedicated XML database (http://exist-db.org/)
- 3. Two applications that ‘speak’ to the database:
3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s framework
- 1. The SARIT texts
(https://github.com/sarit/sarit-data)
- 2. A dedicated XML database (http://exist-db.org/)
- 3. Two applications that ‘speak’ to the database:
3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s framework
- 1. The SARIT texts
(https://github.com/sarit/sarit-data)
- 2. A dedicated XML database (http://exist-db.org/)
- 3. Two applications that ‘speak’ to the database:
3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s framework
- 1. The SARIT texts
(https://github.com/sarit/sarit-data)
- 2. A dedicated XML database (http://exist-db.org/)
- 3. Two applications that ‘speak’ to the database:
3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s framework
- 1. The SARIT texts
(https://github.com/sarit/sarit-data)
- 2. A dedicated XML database (http://exist-db.org/)
- 3. Two applications that ‘speak’ to the database:
3.1 Loader/manager of SARIT etext library (https://github.com/sarit/sarit-data) 3.2 Interface to the texts (https://github.com/sarit/sarit-pm), which is what currently allows you to read and search the texts.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s full text search
SARIT’s documents have two encodings:
- 1. Devanāgarī
- 2. IAST, International Alphabet of Sanskrit
Transliteration asti + eva = asty eva <–> अ ्एव <–> astyeva <–> अेव
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s full text search
SARIT’s documents have two encodings:
- 1. Devanāgarī
- 2. IAST, International Alphabet of Sanskrit
Transliteration asti + eva = asty eva <–> अ ्एव <–> astyeva <–> अेव
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SARIT’s full text search
SARIT’s documents have two encodings:
- 1. Devanāgarī
- 2. IAST, International Alphabet of Sanskrit
Transliteration asti + eva = asty eva <–> अ ्एव <–> astyeva <–> अेव
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wildcards and encoding
Wildcard: ast*eva -> asty eva + अेव Regular expression: ast.*eva -> asty eva +
अेव
(list (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) (string-to-list ”astyeva”) (mapcar ’char-to-string (string-to-list ”astyeva”)))
2309 2360 2381 2340 2381 2351 2375 2357
अ
- ्
त ् य े व
97 115 116 121 101 118 97 a s t y e v a
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wildcards and encoding
▶ Wildcard: ast*eva -> asty eva + अेव Regular expression: ast.*eva -> asty eva +
अेव
(list (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) (string-to-list ”astyeva”) (mapcar ’char-to-string (string-to-list ”astyeva”)))
2309 2360 2381 2340 2381 2351 2375 2357
अ
- ्
त ् य े व
97 115 116 121 101 118 97 a s t y e v a
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wildcards and encoding
▶ Wildcard: ast*eva -> asty eva + अेव ▶ Regular expression: ast.*eva -> asty eva +
अेव
(list (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) (string-to-list ”astyeva”) (mapcar ’char-to-string (string-to-list ”astyeva”)))
2309 2360 2381 2340 2381 2351 2375 2357
अ
- ्
त ् य े व
97 115 116 121 101 118 97 a s t y e v a
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Diffjculty with inherent “a” vowel
(let ((regex-dev ”व”) (regex-iast ”va”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”va” ”व”) (”vi” ”व”))))
matched iast matched deva nil matched deva
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis of “a” vowel diffjculty
(list (cons ”va:” (string-to-list ”व”)) (cons ”vi:” (string-to-list ”व”)))
va: 2357 vi: 2357 2367
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Workaround for a vowel diffjculty
(let ((regex-dev (rx-to-string ’(and ”व” (or line-end (not (any ”” ”” ”” ”” ”” ”” ”” ”” ”” ”े” ”” ”” ”” ”” ”” ”” ””))))))
֒ → ֒ → ֒ →
(regex-iast ”va”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”va” ”व”) (”vi” ”व”) (”vastu” ”व”) (”vistara” ”व”))))
matched iast matched deva nil nil matched iast matched deva nil nil
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis ast*eva in Devanāgarī
;; ast*eva (list (string-to-list ”अ ् *एव”) (mapcar ’char-to-string (string-to-list ”अ ् *एव”)))
2309 2360 2381 2340 2381 42 2319 2357
अ
- ्
त ्
*
ए व
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Matching any (star)
(let ((regex-dev (rx-to-string ’(and ”अ ् ” (and ”त” ;; or, more correctly: (or ”त” ”त ् ”)
֒ →
(0+ anything) (or ”ए” ”े”)) ”व”))) (regex-iast (rx-to-string ’(and ”ast” (0+ anything) ”eva”))))
֒ →
(mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”astyeva” ”अेव”) (”astītyeva” ”अेव”) (”asti. tadeva” ”अ तेव”))))
֒ →
matched iast matched deva matched iast matched deva matched iast matched deva
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Single char matching (dot)
(let ((regex-dev (rx-to-string ’(and ”अ ् ” (and (or ”त” ”त ् ”) anything;; one only (or ”ए” ”े”) ) ”व”))) (regex-iast (rx-to-string ’(and ”ast” anything ”eva”)))) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”astyeva” ”अेव”))))
matched iast matched deva
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis single character match
(list ;; the string to search (string-to-list ”अेव”) (mapcar ’char-to-string (string-to-list ”अेव”)) ;; simplified regex (string-to-list ”अ ् .ेव”) (mapcar ’char-to-string (string-to-list ”अ ् .ेव”))) ;; difference: 2351 (य) --> 46 (.)
2309 2360 2381 2340 2381 2351 2375 2357
अ
- ्
त ् य े व
2309 2360 2381 2340 2381 46 2375 2357
अ
- ्
त ्
.
े व
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Failing regex Devanāgarī (no virāma)
(let ((regex-dev (rx-to-string ’(and ”त” (or ”त ्
” ””)
anything))) (regex-iast ”tatr.”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”tatra” ”त”) (”tatraiva” ”तव”))))
matched iast nil matched iast matched deva
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis failing regex Devanāgarī (no virāma)
(list (string-to-list ”तव”) (mapcar ’char-to-string (string-to-list ”तव”)) ;; simplified regex (string-to-list ”तत ्
?.”)
(mapcar ’char-to-string (string-to-list ”तत ्
?.”))
;; the failing case (string-to-list ”त”) (mapcar ’char-to-string (string-to-list ”त”)) (string-to-list ”तत ्
.”)
(mapcar ’char-to-string (string-to-list ”तत ्
.”))) 2340 2340 2381 2352 2376 2357
त त ्
- व
2340 2340 2381 2352 2381 63 46
त त ्
- ्
? . 2340 2340 2381 2352
त त ्
- 2340
2340 2381 2352 2381 46
त त ्
- ्
.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regex saving “a”
(let ((regex-dev (rx-to-string ’(and ”त” (or (and ”त ्
” anything)
””)))) (regex-iast ”tatr.”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”tatra” ”त”) (”tatraiva” ”तव”))))
matched iast matched deva matched iast matched deva
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Workaround fjnal non-a vowel with consonant quantifjer
(let ((regex-dev (rx-to-string ’(and ”व” (or ”त” ”त ् ” ” ् ” ””) ””))) (regex-iast ”vṛtt?i”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”vṛti” ”वत”) (”vṛtti” ”व”) (”vṛtta” ”व”))))
matched iast matched deva matched iast matched deva nil nil
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis fjnal non-a vowel with consonant quantifjer
(list (string-to-list ”वत”) (mapcar ’char-to-string (string-to-list ”वत”)) (string-to-list ”व”) (mapcar ’char-to-string (string-to-list ”व”)))
2357 2371 2340 2367
व
- त
- 2357
2371 2340 2381 2340 2367
व
- त
् त
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Impossible to search for fjnal “a”
(let ((regex-dev (rx-to-string ’(and ”व” (or ”त” ””)))) (regex-iast ”vṛtt?a”)) (mapcar (lambda (case) (list (when (string-match regex-iast (first case)) ”matched iast”)
֒ →
(when (string-match regex-dev (second case)) ”matched deva”)))
֒ →
’((”vṛti” ”वत”) (”vṛtti” ”व”) (”vṛtta” ”व”))))
nil matched deva nil matched deva matched iast matched deva
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A unifjed SLP1 index
Figure: SARIT lucene index on tei:p elements
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Benefjts of unifjed index
- 1. they match on both Devanāgarī and IAST encoded
texts, and
- 2. terms combined into a single weighting system:
relevance unrelated to encoding
- 3. full Lucene index, with all of its query syntax: [AND],
[OR], [NOT], brackets, proximity searches, etc.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Benefjts of unifjed index
- 1. they match on both Devanāgarī and IAST encoded
texts, and
- 2. terms combined into a single weighting system:
relevance unrelated to encoding
- 3. full Lucene index, with all of its query syntax: [AND],
[OR], [NOT], brackets, proximity searches, etc.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Benefjts of unifjed index
- 1. they match on both Devanāgarī and IAST encoded
texts, and
- 2. terms combined into a single weighting system:
relevance unrelated to encoding
- 3. full Lucene index, with all of its query syntax: [AND],
[OR], [NOT], brackets, proximity searches, etc.
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample search: [yatra AND tatra]
Figure: SARIT search: [yatra AND tatra] (mixed Devanāgarī and IAST results)
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic problem of indexing mixed data
<div> <p n=”1”>त<note n=”1-2”><p>य<ref target=”#avaya-S” xml:lang=”en”>S</ref></p></note> ...</p>
֒ →
<p n=”2”> वत, त ...</p> <p n=”3”> <hi></hi> वत, त ...</p> </div>
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample eXistDB index
<index xmlns:tei=”http://www.tei-c.org/ns/1.0”> <lucene> <!-- index configuration: slp1 and standard analyzers -->
֒ →
<analyzer class=”SLP1TranscodingAnalyzer”/> <analyzer id=”standard-analyzer” class=”StandardAnalyzer”/>
֒ →
<!-- TEI headers are usually more English than Sanskrit -->
֒ →
<text qname=”tei:teiHeader” analyzer=”standard-analyzer”/>
֒ →
<!-- be sure to catch all div elements --> <text qname=”tei:div”/> <!-- sample for paragraph with hi elements inlined
- ->
֒ →
<text qname=”tei:p”> <inline match=”tei:hi”/> </text> <!-- sample for line-groups with notes ignored --> <text qname=”tei:lg”> <ignore qname=”tei:note”/> </text> <text qname=”tei:quote”/> <text qname=”tei:note”/> </lucene> </index>
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Source and index states for eXistdb
<div> <p n=”1”>त<note n=”1-2”><p>य<ref target=”#avaya-S” xml:lang=”en”>S</ref></p></note> ...</p>
֒ →
<!-- index contains: tatra yatra S hi ... --> <l n=”1”>त<note n=”1-2”><p>य<ref target=”#avaya-S” xml:lang=”en”>S</ref></p></note> ...</l>
֒ →
<!-- index contains: tatra hi ... --> <p n=”2”> वत, त ...</p> <!-- index contains: na hi Bavati, tatra ... --> <p n=”3”> <hi></hi> वत, त ...</p> <!-- index contains: na hi Bavati, tatra ... --> </div>
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems with this index
- 1. Possible to miss cases
- 2. Results sometimes not clear to users
- 3. No possibility to index variations in text separately
- 4. No possibility to index based on [@xml:lang]
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems with this index
- 1. Possible to miss cases
- 2. Results sometimes not clear to users
- 3. No possibility to index variations in text separately
- 4. No possibility to index based on [@xml:lang]
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems with this index
- 1. Possible to miss cases
- 2. Results sometimes not clear to users
- 3. No possibility to index variations in text separately
- 4. No possibility to index based on [@xml:lang]
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems with this index
- 1. Possible to miss cases
- 2. Results sometimes not clear to users
- 3. No possibility to index variations in text separately
- 4. No possibility to index based on [@xml:lang]
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution of [//text()] in SARIT’s documents
: Characters per element in SARIT library Characters TEI element (in body) Percent 22248297 p 49.907421 14347985 l 32.185426 1633839 seg 3.6650307 1162691 label 2.6081506 1122531 quote 2.5180636 911228 q 2.0440683 896495 hi 2.0110192 686775 ab 1.5405749 463133 note 1.0389008 336104 ref 0.75394911 … …
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Special text types: Commentaries
<titleStmt> <title type=”main” subtype=”base-text”>Pramāṇavārttika</title>
֒ →
<title type=”sub” subtype=”commentary”>Pramāṇavārttikavṛtti</title>
֒ →
<author role=”base-author”>Dharmakīrti</author> <author role=”commentator”>Manorathanandin</author> </titleStmt>
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Special text types: Commentaries (contd.)
Figure: Base text embedded in commentary (Pramāṇavārttika and Vṛtti; image prepared by Liudmila Olalde)
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Special text types: Commentaries (contd.)
<div> <quote type=”base-text”> <lg xml:id=”pv.3.206b” prev=”#pv.2.206a”> <l>ते
</l>
֒ →
</lg> </quote> <quote type=”base-text”> <lg xml:id=”pv.3.207a” next=”#pv.3.207b”> <l> त ्व े </l> </lg> </quote> <p><q type=”lemma”>ते</q>
(व)े <q
type=”lemma”> े </q>
त ... </p>
֒ → ֒ → ֒ →
</div>
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample use case: Tattvasaṅgraha with Pañjikā
Figure: List lemmas in Pañjikā together with verses of Tattvasaṅgraha (results)
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Achievements
- 1. a working index that allows us to search Indic
documents independently of their encoding
- 2. Lucene’s powerful query syntax
- 3. Basic indexing rules
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Achievements
- 1. a working index that allows us to search Indic
documents independently of their encoding
- 2. Lucene’s powerful query syntax
- 3. Basic indexing rules
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Achievements
- 1. a working index that allows us to search Indic
documents independently of their encoding
- 2. Lucene’s powerful query syntax
- 3. Basic indexing rules
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What still needs to be done
- 1. Make indexing rules more fmexible
- 2. Develop custom queries that ‘understand’ specifjc
text types (commentaries, e.g.)
- 3. Add more texts!
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What still needs to be done
- 1. Make indexing rules more fmexible
- 2. Develop custom queries that ‘understand’ specifjc
text types (commentaries, e.g.)
- 3. Add more texts!
Searching Sanskrit Texts in SARIT Patrick McAllister Introduction SARIT’s full text search Indexing SARIT’s texts Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What still needs to be done
- 1. Make indexing rules more fmexible
- 2. Develop custom queries that ‘understand’ specifjc
text types (commentaries, e.g.)
- 3. Add more texts!