CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks - PowerPoint PPT Presentation

CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks Sangjie Duanzhu, Cizhen Jiacuo, Cairang Jia Key Laboratory of Tibetan Information Processing and Machine Translation Qinghai Normal University May 2020 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 1 / 17

Outline CRF for tag inference May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) 5 Acknowledgment 4 Conclusion Results Datasets 3 Experiments Model architecture Outline Tagging schemes 2 Tibetan Word segmentation with neural networks Our work TWS researching background TWS researching background Tibetan Word Segmentation Word formation Tibetan alphabet and word-formation 1 Introduction to Tibetan Word Segmentation 2 / 17

Introduction to Tibetan Word Segmentation Tibetan alphabet and word-formation Tibetan alphabet Tibetan alphabet Sanskrit alphabet Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 3 / 17

Introduction to Tibetan Word Segmentation syllables, among them the fjrst contains 2 May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) this syllable contains 7 characters The syllable composition can get complexer, e.g. characters and the second contains 3 characters This word means “peace”, which is composed of 2 Word formation • Syllables are separated by a special character • A syllable is composed of one or more • A word is composed of syllables Word formation 4 / 17 character, ་ ( tseg )

Introduction to Tibetan Word Segmentation due to the fact that the errors propagate in a multi-stage NLP pipeline May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) There are many pots in the restaurant. Tibetan Word Segmentation Example • The performance of TWS would have a crucial impact on many download stream tasks, with Tibetan NLP workfmow • Tibetan Word segmentation (TWS) is usually the fjrst and essential sub-task to tackle delimiter between words in Tibetan • Difgerent from European languages such as English, there is no presence of explicit Tibetan Word Segmentation 5 / 17 ཇ་ཁང་ (restaurant)/ ནང་ (inside)/ ན་ (functional case) / བ་མ་ (pots)/ མང་ (many)/ ། (punctuation)/ ནམ་མཁ (sky)/ ར་ (functional case)/ འཕཱུར (fly)/ ། (punctuation)/ → Fly into the sky.

Introduction to Tibetan Word Segmentation TWS researching background TWS researching background • Given the signifjcance of TWS researchers began to address it using maximum matching methods back in 1999 [Tsering, 1999] • Dictionary-based, rule-based or hybrid of these two approaches became the main methods in this fjeld later on • Currently, traditional statistical models, e.g. HMM, CRF or EM are the primary choice of implementation for TWS system • In recently years, with the widespread adoption of deep learning methods in the NLP, Tibetan NLP community start to embrace the new paradigm shift • Some initial work has been done to explore TWS with neural network [Li et al., 2018] Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 6 / 17

Introduction to Tibetan Word Segmentation TWS researching background TWS researching background • Dictionary based methods heavily rely on dictionary, linguistic rules and other forms of knowledge hand-crafted with great care by linguistic experts • statistical methods hold strong assumptions on conditional independence and the input of discrete representation of basic language units, which Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 7 / 17

Introduction to Tibetan Word Segmentation TWS researching background TWS researching background • Dictionary based methods heavily rely on dictionary, linguistic rules and other forms of knowledge hand-crafted with great care by linguistic experts • statistical methods hold strong assumptions on conditional independence and the input of discrete representation of basic language units, which • limit capacity for feature selection • limit capacity for modeling contextual signals • lead to moderate amount of semantic information loss • impose constraints on the modeling capacity Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 7 / 17

Introduction to Tibetan Word Segmentation Our work Our work • We used pre-trained models for both character-level and syllable level contextual representations to better capture semantic information • Combination of CNN and Bi-LSTM network stack is used to fully capture sentence-level representation • A subsequent CRF layer is appended to serve as the inference component of our model, to tag syllables • In experiments, the accuracy, recall rate and F score reach 93.4% ,95.4%and 94.1% on test set, surpassing our base models by a large margin Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 8 / 17

Tibetan Word segmentation with neural networks Tagging schemes May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) BMESN without tsheg BMESN with tsheg ( Fly into the sky ) BMES Tagging schemes Tokenized sentence Tagged sentence Original sentence Difgerent tagging schemes for TWS • To eliminate this, an extra tag could be introduced to label agglutinated Tibetan syllables tagging syllables, thus Tibetan syllable is not necessarily the smallest language unit which requires • BMES tagging scheme is commonly used in both TWS and CWS tasks Tagging schemes 9 / 17 • In Tibetan many functional suffjxes such as འི འང ས འོ ར could be agglutinated with certain • BMES scheme will potentially produce a large amount of invalid Tibetan character combinations such as ནམ་མཁ ནམ་མཁར་འ�ར། ནམ་མཁ ར་ འ�ར ། ནམ [ B ] ་ [ M ] མཁ [ E ] ར [ B ] ་ [ E ] འ�ར [ S ] ། [ S ] ནམ [ B ] ་ [ M ] མཁར་ [ N ] འ�ར [ S ] ། [ S ] ནམ [ B ] མཁར [ N ] འ�ར [ S ] ། [ S ]

Tibetan Word segmentation with neural networks Model architecture May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) �� 10 / 17 inference layer to predict the corrected tag for given syllable Model architecture • CNN is applied on characters for given syllables to capture character-level information • The output of CNN is then fed into subsequent Bi-LSTM networks, which encode Tibetan sentences based on syllable level signals • The output of Bi-LSTM are passed into the fjnal �� B N S SOFTMAX/CRF SOFTMAX/CRF SOFTMAX/CRF CONCAT CONCAT CONCAT LSTM LSTM LSTM LSTM LSTM LSTM CONCAT CONCAT CONCAT ནམ མཁར འ&ར

Tibetan Word segmentation with neural networks features to model the conditional probability of the output y for a given input x May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) CRF for tag inference 11 / 17 • Implements sequential dependencies in the predictions, which allow unconstrained CRF for tag inference • Word segmentation could be formalized as a sequence labeling task which not only requires modeling of input token to its corresponding label, but also the dependencies between predicted labels y i − 1 y i y i +1 B N S ནམ མཁར འ&ར x i − 1 x i x i +1

Experiments Data set for pre-training syllable and character representations 32.23 6.03 10 Overall 27.61 7.34 26.79 32.29 5.97 10 Embedding type 7.26 Total Token Unique token Syllable embedding 14M 10206 Character embedding 60M 306 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 26.77 27.71 Datasets 27.60 Dataset Syllable tag distribution and data sizes for training/validation/test sets Dataset B (%) M (%) E (%) S (%) N (%) Data size (sentences,K) Training 7.35 Testing 26.79 32.29 5.98 150 Validation 27.66 7.29 26.83 32.33 5.89 10 12 / 17

Experiments Random 90.9 89.6 89.7 Pretrained 92.0 90.4 90.5 CNN+LSTM+CRF 92.5 LSTM+CRF 91.3 90.0 Pretrained 93.4 94.2 94.1 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 Random 87.5 Results 88.6 Results Experimental results of four types of models Models Embedding Accuracy (%) Recall (%) CRF - 89.0 86.6 89.5 LSTM+SOFTMAX Random 89.7 87.9 88.6 Pretrained 90.1 13 / 17 F 1 (%)

Conclusion • Recently, Transformer-based models in NLP have truly changed the way researchers work May 2020 CLSW 2020 Sangjie et.al 2020 (Qinghai Normal University) future work. models to encode Tibetan syllable or even character sequence, but we will leave this for with text data. There is potential to further improve our model by using Transformer tagging, and NER Conclusion Tibetan sequence labeling framework for Tibetan word segmentation, part-of-speech task. We plan to use this work as a basis in the future to study and implement a general • Due to limited labeled data, the model can only be used for Tibetan word segmentation CNN + LSTM + CRF neural architecture preform best on the test data set architectures, and compared it with traditional statistical methods, and fjnally verify the • In this work, we explore Tibetan Word Segmentation model with multiple neural network 14 / 17

CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks - PowerPoint PPT Presentation

CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks Sangjie Duanzhu, Cizhen Jiacuo, Cairang Jia Key Laboratory of Tibetan Information Processing and Machine Translation Qinghai Normal University May 2020 Sangjie et.al 2020

Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020 Tokenization Modern Deep

A pragmatic explanation of dou -support Mingming Liu Tsinghua University CLSW 2020 1 Where we

Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia

Mandarin Physical Contact Verbs: a Frame-based Constructional Approach Meichun Liu, Tianqi He,

IS HOME WHERE THE WORD VECTORS LEAD? A Corpus-based Diachronic Study of Jia Pei-Yi

Language ideology and indexicality of non-standard Cantonese in Hong Kong Vivian Y . Y . Yip

AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 AHMF 2020 National

8/17/2020 1 2 3 1 8/17/2020 4 5 6 2 8/17/2020 7 8 9 3 8/17/2020 10 11 12 4

1 30/06/2020 2 30/06/2020 3 30/06/2020 4 30/06/2020 5 30/06/2020 6 30/06/2020 7 Thanks

REOPENING PLAN UPDATE Board of Education meeting: August 3, 2020 10% 12% 14% 16% 18% 0% 2%

2020 Q2 BUDGET TO ACTUALS August 11, 2020 8/5/2020 1 Grant PUD | BUDGET TO ACTUALS Q2 2020

2020 Interim Results Presentation 26 FEBRUARY 2020 Progress on strategy 26 FEBRUARY 2020 2020

8/13/2020 R E T H I N K I N G H A P P I L Y E V E R A F T E R 1 2 1 8/13/2020 3 4 2

Q1 2020 RESULTS _ 14-May-2020 www.larespana.com 14 05 2020 Index 1. 2. 3. 4.

American Boating Congress May 14, 2020 2020 ABC Sponsors Thank You to our 2020 ABC Sponsors

H1 2020 results July 28, 2020 Market context was challenging over Q2 2020 Avg. spot power

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and

1/33 David Parnas 2006 August 08 12:15 slides Software Quality Research Laboratory -

DOCUMENT DIGITIZATION Rethinking it with Machine Learning Nischal Harohalli Padmanabha QConAI

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent

Architecture and Simplicity UNC COMP 523 Wed Sep 9, 2020 Prof. Jeff Terrell 1 / 31

Time Complexity: P and NP 17-0 Big-Oh Notation Recall that g = O ( f ) iff n

Intractable Problems [HMU06,Chp.10a] Time-Bounded Turing Machines Classes P and NP