CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks - - PowerPoint PPT Presentation

clsw 2020
SMART_READER_LITE
LIVE PREVIEW

CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks - - PowerPoint PPT Presentation

CLSW 2020 Revisiting Tibetan Word Segmentation with Neural Networks Sangjie Duanzhu, Cizhen Jiacuo, Cairang Jia Key Laboratory of Tibetan Information Processing and Machine Translation Qinghai Normal University May 2020 Sangjie et.al 2020


slide-1
SLIDE 1

CLSW 2020

Revisiting Tibetan Word Segmentation with Neural Networks Sangjie Duanzhu, Cizhen Jiacuo, Cairang Jia

Key Laboratory of Tibetan Information Processing and Machine Translation Qinghai Normal University

May 2020 Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 1 / 17

slide-2
SLIDE 2

Outline

Outline

1 Introduction to Tibetan Word Segmentation

Tibetan alphabet and word-formation Word formation Tibetan Word Segmentation TWS researching background TWS researching background Our work

2 Tibetan Word segmentation with neural networks

Tagging schemes Model architecture CRF for tag inference

3 Experiments

Datasets Results

4 Conclusion 5 Acknowledgment

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 2 / 17

slide-3
SLIDE 3

Introduction to Tibetan Word Segmentation Tibetan alphabet and word-formation

Tibetan alphabet

Tibetan alphabet Sanskrit alphabet

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 3 / 17

slide-4
SLIDE 4

Introduction to Tibetan Word Segmentation Word formation

Word formation

  • A word is composed of syllables
  • A syllable is composed of one or more

character

  • Syllables are separated by a special

character, ་ (tseg)

This word means “peace”, which is composed of 2 syllables, among them the fjrst contains 2 characters and the second contains 3 characters The syllable composition can get complexer, e.g. this syllable contains 7 characters

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 4 / 17

slide-5
SLIDE 5

Introduction to Tibetan Word Segmentation Tibetan Word Segmentation

Tibetan Word Segmentation

  • Difgerent from European languages such as English, there is no presence of explicit

delimiter between words in Tibetan

  • Tibetan Word segmentation (TWS) is usually the fjrst and essential sub-task to tackle

with Tibetan NLP workfmow

  • The performance of TWS would have a crucial impact on many download stream tasks,

due to the fact that the errors propagate in a multi-stage NLP pipeline Example ཇ་ཁང་ (restaurant)/ ནང་ (inside)/ ན་ (functional case) / བ་མ་ (pots)/ མང་ (many)/ ། (punctuation)/ There are many pots in the restaurant. ནམ་མཁ (sky)/ ར་ (functional case)/ འཕཱུར (fly)/ ། (punctuation)/ → Fly into the sky.

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 5 / 17

slide-6
SLIDE 6

Introduction to Tibetan Word Segmentation TWS researching background

TWS researching background

  • Given the signifjcance of TWS researchers began to address it using maximum matching

methods back in 1999 [Tsering, 1999]

  • Dictionary-based, rule-based or hybrid of these two approaches became the main methods

in this fjeld later on

  • Currently, traditional statistical models, e.g. HMM, CRF or EM are the primary choice of

implementation for TWS system

  • In recently years, with the widespread adoption of deep learning methods in the NLP,

Tibetan NLP community start to embrace the new paradigm shift

  • Some initial work has been done to explore TWS with neural network [Li et al., 2018]

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 6 / 17

slide-7
SLIDE 7

Introduction to Tibetan Word Segmentation TWS researching background

TWS researching background

  • Dictionary based methods heavily rely on dictionary, linguistic rules and other forms of

knowledge hand-crafted with great care by linguistic experts

  • statistical methods hold strong assumptions on conditional independence and the input of

discrete representation of basic language units, which

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 7 / 17

slide-8
SLIDE 8

Introduction to Tibetan Word Segmentation TWS researching background

TWS researching background

  • Dictionary based methods heavily rely on dictionary, linguistic rules and other forms of

knowledge hand-crafted with great care by linguistic experts

  • statistical methods hold strong assumptions on conditional independence and the input of

discrete representation of basic language units, which

  • limit capacity for feature selection
  • limit capacity for modeling contextual signals
  • lead to moderate amount of semantic information loss
  • impose constraints on the modeling capacity

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 7 / 17

slide-9
SLIDE 9

Introduction to Tibetan Word Segmentation Our work

Our work

  • We used pre-trained models for both character-level and syllable level contextual

representations to better capture semantic information

  • Combination of CNN and Bi-LSTM network stack is used to fully capture sentence-level

representation

  • A subsequent CRF layer is appended to serve as the inference component of our model,

to tag syllables

  • In experiments, the accuracy, recall rate and F score reach 93.4%,95.4%and 94.1% on

test set, surpassing our base models by a large margin

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 8 / 17

slide-10
SLIDE 10

Tibetan Word segmentation with neural networks Tagging schemes

Tagging schemes

  • BMES tagging scheme is commonly used in both TWS and CWS tasks
  • In Tibetan many functional suffjxes such as འི འང ས འོ ར could be agglutinated with certain

syllables, thus Tibetan syllable is not necessarily the smallest language unit which requires tagging

  • BMES scheme will potentially produce a large amount of invalid Tibetan character

combinations such as ནམ་མཁ

  • To eliminate this, an extra tag could be introduced to label agglutinated Tibetan syllables

Difgerent tagging schemes for TWS

Original sentence Tagging schemes Tokenized sentence Tagged sentence

ནམ་མཁར་འར།

BMES

ནམ་མཁ ར་ འར ། ནམ[B]་[M]མཁ[E]ར[B]་[E]འར[S]།[S]

(Fly into the sky) BMESN with tsheg

ནམ[B]་[M]མཁར་[N]འར[S]།[S]

BMESN without tsheg

ནམ[B]མཁར[N]འར[S]།[S]

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 9 / 17

slide-11
SLIDE 11

Tibetan Word segmentation with neural networks Model architecture

Model architecture

  • CNN is applied on characters for given syllables to

capture character-level information

  • The output of CNN is then fed into subsequent

Bi-LSTM networks, which encode Tibetan sentences based on syllable level signals

  • The output of Bi-LSTM are passed into the fjnal

inference layer to predict the corrected tag for given syllable

  • LSTM

CONCAT SOFTMAX/CRF B LSTM

ནམ

LSTM CONCAT SOFTMAX/CRF N LSTM

  • མཁར

CONCAT

  • LSTM

CONCAT SOFTMAX/CRF S LSTM CONCAT

  • འ&ར

CONCAT

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 10 / 17

slide-12
SLIDE 12

Tibetan Word segmentation with neural networks CRF for tag inference

CRF for tag inference

  • Word segmentation could be formalized as a sequence labeling task which not only

requires modeling of input token to its corresponding label, but also the dependencies between predicted labels

  • Implements sequential dependencies in the predictions, which allow unconstrained

features to model the conditional probability of the output y for a given input x

ནམ མཁར འ&ར

xi xi−1 xi+1

B N S

yi+1 yi−1 yi

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 11 / 17

slide-13
SLIDE 13

Experiments Datasets

Dataset

Syllable tag distribution and data sizes for training/validation/test sets Dataset B (%) M (%) E (%) S (%) N (%) Data size (sentences,K) Training 27.60 7.35 26.79 32.29 5.98 150 Validation 27.66 7.29 26.83 32.33 5.89 10 Testing 27.71 7.26 26.77 32.23 6.03 10 Overall 27.61 7.34 26.79 32.29 5.97 10 Data set for pre-training syllable and character representations Embedding type Total Token Unique token Syllable embedding 14M 10206 Character embedding 60M 306

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 12 / 17

slide-14
SLIDE 14

Experiments Results

Results

Experimental results of four types of models Models Embedding Accuracy (%) Recall (%) F1 (%) CRF

  • 89.0

86.6 89.5 LSTM+SOFTMAX Random 89.7 87.9 88.6 Pretrained 90.1 88.6 87.5 LSTM+CRF Random 90.9 89.6 89.7 Pretrained 92.0 90.4 90.5 CNN+LSTM+CRF Random 92.5 91.3 90.0 Pretrained 93.4 94.2 94.1

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 13 / 17

slide-15
SLIDE 15

Conclusion

Conclusion

  • In this work, we explore Tibetan Word Segmentation model with multiple neural network

architectures, and compared it with traditional statistical methods, and fjnally verify the CNN + LSTM + CRF neural architecture preform best on the test data set

  • Due to limited labeled data, the model can only be used for Tibetan word segmentation
  • task. We plan to use this work as a basis in the future to study and implement a general

Tibetan sequence labeling framework for Tibetan word segmentation, part-of-speech tagging, and NER

  • Recently, Transformer-based models in NLP have truly changed the way researchers work

with text data. There is potential to further improve our model by using Transformer models to encode Tibetan syllable or even character sequence, but we will leave this for future work.

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 14 / 17

slide-16
SLIDE 16

Bibliography

▶ Li, B., Liu, H., Long, C., & Wu, J. (2018). Deep learning based tibetan word segmentation methods. Computer Engineering and Design, (1), 194–198. ▶ Tsering, T. (1999). Design of an interative tibetan word segmentation and word registering system.

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 15 / 17

slide-17
SLIDE 17

Acknowledgment

Acknowledgment

  • This work was supported by National Natural Science Foundation of China (grant

numbers: 61063033,61662061) and The National Key Research and Development Program of China (grant number: 2017YFB1402200).

  • Thanks you CLSW organizers!
  • Thanks you the anonymous reviewers

Sangjie et.al 2020 (Qinghai Normal University) CLSW 2020 May 2020 16 / 17

slide-18
SLIDE 18

Thanks

Hope you are all well and safe

!གས་%་གནང་།

  • Sangjie et.al 2020 (Qinghai Normal University)

CLSW 2020 May 2020 17 / 17