Joint Word Segmentation and pos-Tagging using a Single Perceptron - - PowerPoint PPT Presentation

joint word segmentation and pos tagging using a single
SMART_READER_LITE
LIVE PREVIEW

Joint Word Segmentation and pos-Tagging using a Single Perceptron - - PowerPoint PPT Presentation

Joint Word Segmentation and pos-Tagging using a Single Perceptron Yue Zhang and Stephen Clark Oxford University Computing Laboratory June 5, 2008 Oxford University Computing Laboratory Introduction of Chinese pos-tagging Chinese sentences


slide-1
SLIDE 1

Joint Word Segmentation and pos-Tagging using a Single Perceptron

Yue Zhang and Stephen Clark Oxford University Computing Laboratory June 5, 2008

Oxford University Computing Laboratory

slide-2
SLIDE 2

Introduction of Chinese pos-tagging

  • Chinese sentences are written as character sequences
  • I like reading
  • Word segmentation is a necessary step before pos-tagging

Input

  • Ilikereading

Segment I like reading Tag /PN /V /N I/PN like/V reading/N

  • The traditional approach treats word segmentation and pos-tagging as

two separate steps

Oxford University Computing Laboratory 1

slide-3
SLIDE 3

Two observations

  • Segmentation errors propagate to the step of pos-tagging

Input

  • Ilikereading

Segment Ili ke reading Tag /N /V /N Ili/N ke/V reading/N

  • information about pos helps to improve segmentation

/CD /M /N

  • r

/CD /JJ /CD

  • r

/CD /CD /CD /CD /CD

Oxford University Computing Laboratory 2

slide-4
SLIDE 4

Joint segmentation and tagging

  • The observations lead to the solution of joint segmentation and pos-

tagging Input

  • Ilikereading

Output /PN /V /N I/PN like/V reading/N

  • Consider segmentation and pos information simultaneously
  • The most appropriate output is chosen from all possible segmented and

tagged outputs

Oxford University Computing Laboratory 3

slide-5
SLIDE 5

Challenges

  • How to evaluate the correctness of outputs – the model
  • How to perform decoding – choose the best from all possible outputs

Difficulty in the large combined search space: O(2n−1T n). Dependending

  • n the feature set, dynamical programming can be inefficient too (which

is the case for this paper).

  • How to automatically train parameters in the model

Challenge in the training of features for segmentation and pos-tagging simultaneously.

Oxford University Computing Laboratory 4

slide-6
SLIDE 6

Existing solutions

Ng and Low (2004)

  • The model: maps joint segmentation and pos-tagging to a character

tagging problem, assigning two types of tags to each character to indicate segmentation and pos information respectively. /s PN /b V /e V /b N /e N (I) (like) (reading)

  • Decoding: beam search
  • Training: maximum entropy model for sequence labeling

Oxford University Computing Laboratory 5

slide-7
SLIDE 7

Existing solutions

Shi and Wang (2007)

  • The model:

take the N-best output from the word segmentor and pass them to a separate pos-tagger, ranking candidates by the overall probability score from the segmentor and tagger.

  • Decoding:

A* for word segmentation and dynamic programming for tagging.

  • Training: conditional random field for sequence labeling.

Oxford University Computing Laboratory 6

slide-8
SLIDE 8

Existing solutions

Potential disadvantage for both models above is the restriction of interaction between segmentation and pos information.

  • For the character based method, whole word information is not explicitly

associated with pos.

  • For the reranking method, interaction is limited to the best output list

from the word segmentor.

Oxford University Computing Laboratory 7

slide-9
SLIDE 9

Our proposed model

The motivation is not to pose any restriction on the interaction between word and pos information during processing.

  • The model: a linear model with both word segmentation and pos-tagging

features.

  • Decoding: a multiple-beam search algorithm.
  • Training: the generalized perceptron.

Oxford University Computing Laboratory 8

slide-10
SLIDE 10

The baseline

  • Word segmentor from our previous research (Zhang and Clark, 2007)
  • The perceptron pos-tagger from Collins (2002)

Oxford University Computing Laboratory 9

slide-11
SLIDE 11

The baseline word segmentor

  • Linear model trained by the generalized perceptron
  • Features are extracted from a word bigram context
  • Encompass both word and character information
  • Standard beam search decoder

Oxford University Computing Laboratory 10

slide-12
SLIDE 12

Features from the baseline segmentor

1 word w 2 word bigram w1w2 3 single-character word w 4 a word of length l with starting character c 5 a word of length l with ending character c 6 space-separated characters c1 and c2 7 character bigram c1c2 in any word 8 the first / last characters c1 / c2 of any word 9 word w immediately before character c 10 character c immediately before word w 11 the starting characters c1 and c2 of two consecutive words 12 the ending characters c1 and c2 of two consecutive words 13 a word of length l with previous word w 14 a word of length l with next word w

Oxford University Computing Laboratory 11

slide-13
SLIDE 13

The baseline pos-tagger

  • Linear model trained by the generalized perceptron
  • Features redefined for Chinese, including tag trigrams
  • Standard beam search decoder

Oxford University Computing Laboratory 12

slide-14
SLIDE 14

Features from the baseline pos-tagger

1, 2, 3 tag t with word w, tag bigram t1t2, tag trigram t1t2t3 4 tag t followed by word w 5 word w followed by tag t 6 word w with tag t and previous character c 7 word w with tag t and next character c 8 tag t on single-character word w in character trigram c1wc2 9 tag t on a word starting with char c 10 tag t on a word ending with char c 11 tag t on a word containing char c in the middle 12 tag t on a word starting with char c0 and containing char c 13 tag t on a word ending with char c0 and containing char c 14 tag t on a word containing repeated char cc 15 tag t on a word starting with character category g 16 tag t on a word ending with character category g

Oxford University Computing Laboratory 13

slide-15
SLIDE 15

The joint segmentor and pos-tagger

  • Linear model trained by the generalized perceptron
  • Features are the union of baseline segmentor and tagger features
  • Multiple beam search decoder

Oxford University Computing Laboratory 14

slide-16
SLIDE 16

The joint segmentor and pos-tagger

  • Formulation of the joint segmentation and tagging problem

Given an input sentence x, the output F(x) satisfies: F(x) = arg max

y∈GEN(x)

Score(y)

  • The model (denoting the global feature vector for y with Φ(y)):

Score(y) = Φ(y) · w

Oxford University Computing Laboratory 15

slide-17
SLIDE 17

The joint segmentor and pos-tagger

Inputs: training examples (xi, yi) Initialization: set w = 0 Algorithm: for t = 1..T, i = 1..N calculate zi = arg maxy∈GEN(xi) Φ(y) · w if zi = yi

  • w =

w + Φ(yi) − Φ(zi) Outputs: w

Oxford University Computing Laboratory 16

slide-18
SLIDE 18

The joint segmentor and pos-tagger

  • Decoding algorithm is one of the biggest challenges.

– Exact inference would be very slow even with dynamic programming – The standard beam search gave inferior accuracy

  • A multiple beam search decoding algorithm

– An agenda given to each character in the input sentence, recording the best segmented and pos-tagged candidates ending with the character – The input sentence is processed incrementally by characters – When each character is processed, all possible words ending with the character are considered, each possible being combined with previous partial candidates previous character to form new partial candidates – System returns the best item from the last agenda

Oxford University Computing Laboratory 17

slide-19
SLIDE 19

The joint segmentor and pos-tagger

A B C D E Oxford University Computing Laboratory 18

slide-20
SLIDE 20

The joint segmentor and pos-tagger

A B C D E

A/T1 A/T2

Oxford University Computing Laboratory 19

slide-21
SLIDE 21

The joint segmentor and pos-tagger

A B C D E

A/T1 A/T2 AB/T1 AB/T2 A/T1 B/T1 A/T2 B/T2

Oxford University Computing Laboratory 20

slide-22
SLIDE 22

The joint segmentor and pos-tagger

A B C D E

A/T1 A/T2 AB/T1 AB/T2 A/T1 B/T1 A/T2 B/T2 ABC/T1 ABC/T2

Oxford University Computing Laboratory 21

slide-23
SLIDE 23

The joint segmentor and pos-tagger

A B C D E

A/T1 A/T2 AB/T1 AB/T2 A/T1 B/T1 A/T2 B/T2 ABC/T1 ABC/T2 A/T1 BC/T1 A/T1 BC/T2 A/T2 BC/T1 A/T2 BC/T2

Oxford University Computing Laboratory 22

slide-24
SLIDE 24

The joint segmentor and pos-tagger

A B C D E

A/T1 A/T2 AB/T1 AB/T2 A/T1 B/T1 A/T2 B/T2 ABC/T2 A/T1 BC/T1 A/T2 BC/T1 A/T2 BC/T2 AB/T1 C/T1 A/T2 B/T2 C/T1

… Oxford University Computing Laboratory 23

slide-25
SLIDE 25

The joint segmentor and pos-tagger

A B C D E

A/T1 A/T2 AB/T1 AB/T2 A/T1 B/T1 A/T2 B/T2 A/T2 BC/T1 A/T2 BC/T2 AB/T1 C/T1 A/T2 B/T2 C/T1 … … … … … … …

… Oxford University Computing Laboratory 24

slide-26
SLIDE 26

Optimization techniques

  • The tag dictionary

– Frequent words – Closed-set tags

  • The maximum word length record for each tag
  • Only the best is stored among candidates in the same context.
  • All the above information are updated online

Oxford University Computing Laboratory 25

slide-27
SLIDE 27

Experiments

  • The experimental data: Chinese Treebank 4
  • Test set: 10-fold cross validation on Chinese Treebank 3
  • Development set: The rest of the data are used to determine the number
  • f training iterations, analyse the influence of various factors and draw

the distribution of typical errors

Oxford University Computing Laboratory 26

slide-28
SLIDE 28

The learning curves

0.88 0.89 0.9 0.91 0.92 1 2 3 4 5 6 7 8 9 10

Number of training iterations F-score

0.86 0.87 0.88 0.89 0.9 1 2 3 4 5 6 7 8 9 10

Number of training iterations F-score

Oxford University Computing Laboratory 27

slide-29
SLIDE 29

The learning curves

0.8 0.82 0.84 0.86 0.88 0.9 0.92 1 2 3 4 5 6 7 8 9 10

Number of training iterations F-score

segmentation accuracy

  • verall accuracy

Oxford University Computing Laboratory 28

slide-30
SLIDE 30

The influence of the tag dictionary

0.8 0.82 0.84 0.86 0.88 0.9 0.92 1 2 3 4 5 6 7 8 9 10

Number of training iterations Accuracy

segmentation (without tag dictionary) tagging (without tag dictionary) segmentation (with tag dictionary)

  • verall tagging (with tag dictionary)

Decoding time: 416sec vs 256sec.

Oxford University Computing Laboratory 29

slide-31
SLIDE 31

The analysis of typical errors in the joint tagging system

Tag Seg NN NR VV AD JJ CD NN 20.47 – 0.78 4.80 0.67 2.49 0.04 NR 5.95 3.61 – 0.19 0.04 0.07 VV 12.13 6.51 0.11 – 0.93 0.56 0.04 AD 3.24 0.30 0.71 – 0.33 0.22 JJ 3.09 0.93 0.15 0.26 0.26 – 0.04 CD 1.08 0.04 0.07 – Segmentation errors 51.47%

Oxford University Computing Laboratory 30

slide-32
SLIDE 32

Comparison with the baseline

Baseline Joint # SF TF TA SF TF TA 1 96.98 92.91 94.14 97.21 93.46 94.66 2 97.16 93.20 94.34 97.62 93.85 94.79 3 95.02 89.53 91.28 95.94 90.86 92.38 4 95.51 90.84 92.55 95.92 91.60 93.31 5 95.49 90.91 92.57 96.06 91.72 93.25 6 93.50 87.33 89.87 94.56 88.83 91.14 7 94.48 89.44 91.61 95.30 90.51 92.41 8 93.58 88.41 90.93 95.12 90.30 92.32 9 93.92 89.15 91.35 94.79 90.33 92.45 10 96.31 91.58 93.01 96.45 91.96 93.45 Av. 95.20 90.33 92.17 95.90 91.34 93.02

Oxford University Computing Laboratory 31

slide-33
SLIDE 33

Comparison with existing models

Model SF TF TA Baseline+ (Ng) 95.1 – 91.7 Joint+ (Ng) 95.2 – 91.9 Baseline+* (Shi) 95.85 91.67 – Joint+* (Shi) 96.05 91.86 – Baseline (ours) 95.20 90.33 92.17 Joint (ours) 95.90 91.34 93.02 + – knowledge about sepcial characters, * – knowledge from semantic net outside ctb.

Oxford University Computing Laboratory 32

slide-34
SLIDE 34

Conclusions and future work

  • Joint word segmentation and pos-tagging using a single linear model
  • The generalized training algorithm and a multiple beam decoder
  • It’s worth studying the loss from the beam and, with a properly defined

range of features, experiment with exact inference

  • There may be additional features that can improve joint accuracy; open

features.

Oxford University Computing Laboratory 33