complexities and challenges for language informatics Girish Nath - - PowerPoint PPT Presentation

complexities and challenges for
SMART_READER_LITE
LIVE PREVIEW

complexities and challenges for language informatics Girish Nath - - PowerPoint PPT Presentation

India's Ancient Manuscripts - complexities and challenges for language informatics Girish Nath Jha Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U.,


slide-1
SLIDE 1

India's Ancient Manuscripts - complexities and challenges for language informatics

Girish Nath Jha

Professor in Computational Linguistics Special Center for Sanskrit Studies, Professor & Concurrent Faculty, Center of Linguistics J.N.U., New Delhi-67

12/9/2015 1 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-2
SLIDE 2

In this presentation…

1) Big Data, Language Informatics, Digital Humanities and the desirable goals for ancient manuscripts 2) Levels of digitization and the Complexity involved 3) Standards, tools and technologies required 4) Work done in India in general and at JNU 5) Digitization and beyond 6) Suggestions and conclusion

12/9/2015 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 2

slide-3
SLIDE 3

Big data is here

 Big data can get bigger in India  Language informatics  Opportunities for Indian languages

 Less resourced and fringe languages  Scheduled languages  Classical and heritage languages

12/9/2015 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 3

slide-4
SLIDE 4

DH – Digital Humanities

 Applying IT for various sub disciplines of

humanities

 India is a curious case for DH research as we

have multitude of languages, literatures, arts, traditions etc

 All of these can potentially lead to big data

and data oriented informatics and intelligence 12/9/2015

Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 4

slide-5
SLIDE 5

12/9/2015

Indian Language Families and % Speakers

IndoAryan - 76.87% Dravidian -20.82% Austro Asiatic - 1.11% Tibeto Burman - 1% Andamanese* - 0%

Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 5

slide-6
SLIDE 6

Official languages and scripts of India

12/9/2015 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 6

slide-7
SLIDE 7

Why Sanskrit?

 The language with most heritage material  Predominantly Devanagari handwritten texts,

but other scripts also used (like Odia, Maithili, Bangla, Grantha, Sharada, Brahmi, other major scripts)

 More than 30 million waiting to be digitized  95% estimated to be un the domain of

Science & Technology

12/9/2015 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 7

slide-8
SLIDE 8

Tasks at hand

 Digitize manuscripts  Editing/limited processing  Enable search and cross linking  Enable readability  Text processing  Translation  Research & Development  Promotion

12/9/2015 8 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-9
SLIDE 9

The problem

Definition of Sanskrit or Indian manuscript?

 Sanskrit vs Indian manuscript  Geographical expansion

 Whole of South Asia, South East Asia, China, other countries

culturally related or where any mss/text/translation is found

 Older Rough count (30 million  David Pingree, 6.2 million 

NMM)

 67% or more in Sanskrit  Estimated loss (several hundred per week Dominik Wujastyk)

12/9/2015 9 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-10
SLIDE 10

National Manuscript Mission (NMM)

 Liberal definition of manuscript  Effort to collect copies of mss  Good survey of libraries in northern India

(Orissa, Bihar and Uttar Pradesh  35000 repositories

 Cataloguing and Microfilming  Online search for some  Training in ancient scripts

12/9/2015 10 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-11
SLIDE 11

NMM - problems

 poor quality of catalogues,  missing manuscripts, incomplete folios,

access issues

 No work on creating technology and

standards  over dependence on manpower

 No retention of trained manpower

12/9/2015 11 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-12
SLIDE 12

Desirable goals ….

 A right mix of human labor and computing

technologies

 Digitizing, Archiving, search, cross linking  Reading help, Translation  Fundamental Research, experimentations  Promotion (popular media, target younger

readers, multilingual delivery, internationalization)

12/9/2015

12 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-13
SLIDE 13

Levels of digitization

 Online/interactive catalogues with

multilingual/multi-script search

 Scanned images  e-books/download  Human transcribed e-texts  e-

books/downloadable

 OCR transcribed/human edited e-texts 

e-books/downloadable

12/9/2015 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 13

slide-14
SLIDE 14

What is required ?

 Standards for

 Uni, bi and multimodal data encoding  metadata, storage, search

 Tools

 Data Input / output mechanisms  Editing, spelling & Grammar checking  Text Readers  Translation  E-learning/Multimedia

12/9/2015

14 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-15
SLIDE 15

Standards for Digital technologies

 do we have one?  How difficult to get standards in India  MSR initiative  Efforts under BIS  Sanskrit POS background  The ILCI corpora and the first National

standard in POS

12/9/2015 15 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-16
SLIDE 16

Input mechanisms

 With unicode in most of our major

languages, texts can be entered

 However, most of the heritage exists as

handwritten manuscripts

 Do we have a mechanism for it?  The printed text recognition consortium

under IIT Delhi

12/9/2015 16 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-17
SLIDE 17

Input mechanisms…

 Oliver Hellwig’s OCR  OLHWR  Consortium under I.I.Sc

Bangalore

 OLHWR (Hindi) for tablets  Microsoft

Windows group (Redmond)

 How difficult it is  Resources needed

12/9/2015 17 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-18
SLIDE 18

Efforts at JNU - OLHWR

 Microsoft consultancy  Ink collection

 Hindi states of UP, Rajasthan, Delhi  2 million ink samples

 Lexical Resources

 System dictionary (basic wordlist, corpora of

newspapers, literature, frequency marked words, offensive words, NEs)

12/9/2015 18 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-19
SLIDE 19

Efforts at JNU – OLHWR…..

 Devanagari/Hindi Model  Tablet PCs are no longer the focus in Microsoft.

Therefore further development is on hold

 We can start from where MS left and adapt it for

Sanskrit

12/9/2015 19 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-20
SLIDE 20

Complexity of Sanskrit handwritten texts

 Historical document with scanty

information on date/authorship

 Physical condition of the manuscript  Quality of the writing in the manuscript  Can be in multiple languages and

scripts

 Have non linguistic marks

12/9/2015 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland 20

slide-21
SLIDE 21

What if we develop the handwriting OCR for Sanskrit?

 Text Readers  Searches  inter-linking  Translation  Research, experimentation

12/9/2015 21 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-22
SLIDE 22

Next Steps…

 Promotion

 multimedia content creation  Electronic media  Films, documentaries

12/9/2015 22 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-23
SLIDE 23

Work done at JNU

Critical edition, translation and publication

  • f rare manuscripts

Digitize rare manuscripts

Efforts to promote ancient scripts

Computer Tools and resources for Sanskrit

Machine Translation

E-learning/multimedia presentation of texts

Research on fundamental texts

12/9/2015 23 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-24
SLIDE 24

Tools…

 Text To Speech for Sanskrit  NERs  Language analyzers and Generators,  Lexical Resources, Multimedia content  Corpora & Standards  Emotion detection

12/9/2015 24 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-25
SLIDE 25

Machine Translation

 English-Urdu MT released by Microsoft in Feb 2013  English-Sindhi is complete. To be released soon  SaHiT (Sanskrit Hindi Translator - JNU’s rule based

Sanskrit Hindi Translator) a simple rule based system for split-prose will be out this year

 SHMT (Sanskrit consortium system) – basic version is out  Sanskrit-English Translation (SETrans) being developed

using Microsoft Translator Hub platform

 English to Gujarati, Maithili, Bengali being developed

12/9/2015 25 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-26
SLIDE 26

Lexical Resources for interpretation

 Koshas - Amara, Apte, Halayudha, Mankha,

Medini, Nirukta, Nighantus, Ayurveda dictionary

 Textual Search – Vedas, Upanishadas,

Ayurveda, Mahabharata, Ramayana, Kalidasa

12/9/2015 26 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-27
SLIDE 27

Corpora & Standards

 ILCI consortium – 17 Indian languages (including English)

parallel corpora. Sanskrit is going to be added

 Tagged Sanskrit corpora, tagset (some of it already published

my LDC, U Penn)

 LDC (Univ. of Pennsylvania) – 8 languages Multimodal

corpora for training security systems (Indian English, Hindi, Urdu, Bangla, Tamil, Malayalam, Pushto, Dari)

12/9/2015 27 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-28
SLIDE 28

Manual Translation, experimentation, publication

 Manual Translation an documentation of

scientific procedures

 Laboratory Experimentations  Publications

12/9/2015 28 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-29
SLIDE 29

What after digitization of texts  collaborative research

12/9/2015 29 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-30
SLIDE 30

JNU-UMASSD collaboration

 University of Massachusetts Dartmouth initiative  Under an endowment from Mukesh & Priti Chatter for

exploring provable scientific concepts in Ancient Indian Texts

 We have translated the following texts in the area of

metallurgy

 Rasayana Saara (into English)  Rasasanketa Kalika (into Hindi)

 Two Ph.D. students at UMASSD are researching nano particles in

traditionally made bhasma

12/9/2015 30 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-31
SLIDE 31

12/9/2015 31

Promoting Heritage

 Workshops  Training programs  Sanskrit teaching/learning (Vag-Vardhini initiative

by research students of JNU)

 Courses, collaborative teaching/research  Content Creation and elearning/mutimedia  TV programs , and other electronic and print

media

Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-32
SLIDE 32

12/9/2015 32

Promoting non Devanagari ancient scripts

Sharada, Grantha, Brahmi

Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-33
SLIDE 33

Efforts for Brahmi, Sharada,

Grantha

 A workshop was organized in March

2012

 Experts from Grantha, Sharada and

Brahmi were invited

 Current status of digitization in each

script was discussed

12/9/2015

33 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-34
SLIDE 34

Current focus on Sharada,

 Subsequently an AoC (Agreement of

Collaboration) was signed with an NGO (Millennium India Education Foundation) was signed to accelerate the Sharada related activities

 We have prepared a detailed activity list

for next 5 years. Looking for potential support.

12/9/2015 34 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-35
SLIDE 35

Our Technology

 Server based Java delivery system.  The server named ‘Sanskrit’ (created and managed

by us and located in Sanskrit center) has two load- distributers named ‘Panini’ and ‘Patanjali’

 Limited crowd source based model for large

consortia based corpora and resource development

 Remote management/monitoring of project staff

and consortium partners

12/9/2015 35 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-36
SLIDE 36

Some Suggestions

Sanskrit Manuscripts Recognition Consortium (SMaRC) with members from the traditional institutes, technology institutes/groups and computational linguists

Major goals could be Manuscript OCR, proofing, content creation, dissemination, collaboration with various groups or institutes which have digitized/microfilmed Indian manuscripts to bring all of them on one platform

Each partner institute/organization will have a well crafted task-set to accomplish

Follow a sound methodology, include best experts in the world in the advisory committee

Create a stra panel with members from the best of our experts on manuscripts (in each area of traditional knowledge).

Create trained manpower by partnering with Sanskrit institutions and revising curriculum to include applied courses

12/9/2015 36 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-37
SLIDE 37

 Demo – http://sanskrit.jnu.ac.in

 Student projects on manuscripts  Critical edition/translation  Easy accessibility and search of digitized

text

 Text conversion  Popularization - E-learning  Research

12/9/2015 37 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland

slide-38
SLIDE 38

Thank you

12/9/2015 38 Talk at Institute of Linguistics, Adam Mickiwicz University, Poznan, Poland