for Sequence Tagging ADVISOR: JIA-LING, KOH SOURCE: CORR 2015 - - PowerPoint PPT Presentation

for sequence tagging
SMART_READER_LITE
LIVE PREVIEW

for Sequence Tagging ADVISOR: JIA-LING, KOH SOURCE: CORR 2015 - - PowerPoint PPT Presentation

1 Bidirectional LSTM-CRF Models for Sequence Tagging ADVISOR: JIA-LING, KOH SOURCE: CORR 2015 SPEAKER: SHAO-WEI, HUANG DATE: 2020/01/15 2 OUTLINE Introduction Method Experiment Conclusion INTRODUCTION 3 Sequence


slide-1
SLIDE 1

Bidirectional LSTM-CRF Models for Sequence Tagging

ADVISOR: JIA-LING, KOH SOURCE: CORR 2015 SPEAKER: SHAO-WEI, HUANG DATE: 2020/01/15

1

slide-2
SLIDE 2

⚫ Introduction ⚫ Method ⚫ Experiment ⚫ Conclusion

2

OUTLINE

slide-3
SLIDE 3

INTRODUCTION

➢ Sequence tagging:Tag each character(part)

in the sentence(sequence).

3

  • POS tagging

(Ex): She lives in Taiwan. (PRO) (V) (Prep) (N)

  • Chunking

(Ex): [Np He] [VP estimates] [Np the current account deficit] [VP will shrink] [PP to] [NP just 1.8 billion].

slide-4
SLIDE 4

INTRODUCTION

➢ Sequence tagging:Tag each character in the

sentence(sequence).

4

  • Name entity recognization:

(Ex): EU rejects German call to boycott British lamb .

(B-ORG) (O) (B-MISC) (O) (O) (O) (B-MISC) (O)

slide-5
SLIDE 5

OUTLINE

Introduction Method Experiment Conclusion

5

slide-6
SLIDE 6

6

METHOD

➢Simple RNN:

Simple RNN model

6

x(t) h(t-1) h(t) y(t)

slide-7
SLIDE 7

Input

METHOD

7

LSTM model

𝑕𝑢

前一時間點 的hidden

➢LSTM: ⚫ Fo Forget rget gate gate:根據上 一時間點的輸出與本 時間點的輸入,選擇 需要在細胞中遺忘多 少。 ⚫In Input put ga gate te:根據 上一時間點的輸出 與本時間點的輸入, 選擇需要在細胞中 新記憶多少。

Note:

  • σ is the sigmoid function.
  • ⊙ =

is the element- wise product.

⚫ Ou Output tput gay gaye:根據 細胞狀態和本時間點 的輸入與上一時間點 的輸出,決定要輸出 的ℎ𝑢。

slide-8
SLIDE 8

8

METHOD

➢LSTM:

https://www.itread01.com/content/1545027542.html 𝑕𝑢

LSTM model

7

slide-9
SLIDE 9

9

METHOD

➢Bi-LSTM:

Bi-LSTM model

8

slide-10
SLIDE 10

10

METHOD

➢CRF:Instead of modeling tagging decisions independently, CRF model them jointy.

CRF model

9 ➢ X = (𝑦1, 𝑦2, …, 𝑦𝑜) → an input sentence. y = (𝑧1, 𝑧2, …, 𝑧𝑜) → a sequence of predictions(tags). ➢ Score :

The score of the 𝑧𝑗𝑢ℎ tag of the 𝑗𝑢ℎ word in a sentence(independently).

0.7 0.1 0.1 0.1 0.1 0.1 0.1 0.7 0.1 0.7 0.1 0.1

P:

W1 W2 W3

tag1 tag2 tag3 tag4

A matrix of transition scores,𝐵𝑧𝑗,𝑧𝑗+1 represents the score of a transition from the tag 𝑧𝑗 to tag 𝑧𝑗+1.

A:

tag1 tag2 tag3 tag4 tag1 tag2 tag3 tag4

0.6 0.2 0.1 0.1 0.1 0.1 0.1 0.7 0.1 0.7 0.1 0.1 0.5 0.1 0.1 0.3

slide-11
SLIDE 11

11

METHOD CRF model

10 ➢Normalization : ➢Loss function : 𝑀𝑜𝑓𝑠 = -log(p(y|X))

Max

slide-12
SLIDE 12

12

METHOD CRF model、LSTM-CRF model、Bi-LSTM-CRF model

11

slide-13
SLIDE 13

13

METHOD BERT_Transformers

12 ➢Self Attention :

Reference:Attention is all you need

https://blog.csdn.net/jiaowoshouzi/article/details/89 073944

slide-14
SLIDE 14

14

METHOD BERT_Transformers

13 ➢Multi-Head Attention :

Reference:Attention is all you need

* =

slide-15
SLIDE 15

15

METHOD BERT model

14 ➢BERT model:

slide-16
SLIDE 16

16

METHOD BERT-CRF model

15 ➢BERT-CRF model:Connect CRF layer behind BERT's hidden layer.

Reference:Transfer learning for scientific data chain extraction in small chemical corpus with BERT-CRF model

EU rejects German call B-ORG O B-MISC O

slide-17
SLIDE 17

OUTLINE

Introduction Method Experiment Conclusion

16

slide-18
SLIDE 18

17

EXPERIMENT

Dataset

➢ Penn TreeBank (PTB) :POS tagging ➢ CoNLL 2000: chunking ➢ CoNLL 2003 :named entity tagging

slide-19
SLIDE 19

18

EXPERIMENT

Features

➢ Spelling features ➢ Context features:uni-gram、bi-gram、tri-gram ➢ Word embedding:Senna word enbedding(each word

corresponds to a 50-dimensional embedding vector.)

slide-20
SLIDE 20

19

EXPERIMENT

➢ Comparison with other networks:

Accuracy F1 F1

slide-21
SLIDE 21

20

EXPERIMENT

➢ Performance with only word feature:

Accuracy F1 F1

slide-22
SLIDE 22

OUTLINE

Introduction Method Experiment Conclusion

21

slide-23
SLIDE 23

22

CONCLUSION

➢ Systematically compare the performance of aforementioned

models.

➢ The first to apply a bidirectional LSTM CRF model to NLP

benchmark sequence tagging data sets.

➢ Show that BI-LSTMCRF model is robust and it has less

dependence on word embeddin.

➢ BERT+CRF model(proposed in another paper).