[PDF] - MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR PDF Document

SLIDE 1 MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR CH CENTER Determini sti c P art-Of-Sp eec h T agging with Finite State T ransducers Emman uel Ro c he and Yv es Sc hab es Mitsubishi Electric Researc h Lab

ratories

201 Broadw a y , Cam bridge, MA 02139 e-mail: roche@merl.com and schabes@merl.com TR-94-07. V ersion 3.0 Marc h 1995 Abstract Sto c hastic approac hes to natural language pro cessing ha v e

ften

b een preferred to rule-based approac hes b ecause

f

their robustness and their automatic training capabilities. This w as the case for part-of-sp eec h tagging un til Brill sho w ed ho w state-of-the-art part-of-sp eec h tagging can b e ac hiev ed with a rule-based tagger b y inferring rules from a training corpus. Ho w ev er, curren t implemen- tations

f

the rule-based tagger run more slo wly than previous approac hes. In this pap er, w e presen t a nite-state tagger inspired b y the rule-based tagger whic h

p

erates in

ptimal

time in the sense that the time to assign tags to a sen tence corresp

nds

to the time required to follo w a single path in a deterministic nite-state mac hine. This result is ac hiev ed b y enco ding the application

f

the rules found in the tagger as a non-deterministic nite-state transducer and then turning it in to a deterministic transducer. The resulting deterministic transducer yields a part-of-sp eec h tagger whose sp eed is dominated b y the access time

f

mass storage devices. W e then generalize the tec hniques to the class

f

transformation-based systems. Publishe d in Computational Linguistics, June 1995 21(2), 227-253. This w

rk

ma y not b e copied

r

repro duced in whole

r

in part for an y commercial purp

se.

P ermission to cop y in whole

r

in part without pa ymen t

f

fee is gran ted for nonprot educational and researc h pur- p

ses

pro vided that all suc h whole

r

partial copies include the follo wing: a notice that suc h cop ying is b y p ermission

f

Mitsubishi Electric Researc h Lab

ratories
f

Cam bridge, Massac h usetts; an ac kno wledgmen t

f

the authors and individu al con tributions to the w

rk;

and all applicabl e p

rtions
f

the cop yrigh t notice. Cop ying, repro duction,

r

republishi ng for an y

ther

purp

se

shall require a license with pa ymen t

f

fee to Mitsubishi Electric Researc h Lab

ratories.

All righ ts reserv ed. Cop yrigh t c

Mitsubishi

Electric Researc h Lab

ratories,

1995 201 Broadw a y , Cam bridge, Massac h usetts 02139

SLIDE 2 Revisions history . 1. V ersion 1.0, Ma y 2nd 1994. 2. V ersion 1.1, June 16th 1994. 3. V ersion 1.2, June 22nd 1994. 4. V ersion 1.3, July 27th 1994. 5. V ersion 1.4, July 1994. 6. V ersion 2.0, Decem b er 9th 1994. 7. This v ersion is Revision 3.0

f

Date: 95/03 .

SLIDE 3 1 1 In tro duction Finite-state devices ha v e imp

rtan

t applications to man y areas

f

computer science, including pattern matc hing, databases and compiler tec hnology . Al- though their linguistic adequacy to natural language pro cessing has b een questioned in the past (Chomsky , 1964), there has recen tly b een a dramatic renew al

f

in terest in the application

f

nite-state devices to sev eral as- p ects

f

natural language pro cessing. This renew al

f

in terest is due to the sp eed and the compactness

f

nite-state represen tations. This eciency is explained b y t w

prop

erties: nite-state devices can b e made deterministic, and they can b e turned in to a minim al form. Suc h represen tations ha v e b een successfully applied to dieren t asp ects

f

natural language pro- cessing, suc h as morphological analysis and generation (Karttunen, Kaplan, and Zaenen, 1992; Clemenceau, 1993), parsing (Ro c he, 1993; T apanainen and V

utilainen,

1993), phonology (Lap

rte,

1993; Kaplan and Ka y , 1994) and sp eec h recognition (P ereira, Riley , and Sproat, 1994). Although nite- state mac hines ha v e b een used for part-of-sp eec h tagging (T apanainen and V

utilainen,

1993; Silb erztein, 1993), none

f

these approac hes has the same exibilit y as sto c hastic tec hniques. Unlik e sto c hastic approac hes to part-of- sp eec h tagging (Ch urc h, 1988; Kupiec, 1992; Cutting et al., 1992; Merialdo, 1990; DeRose, 1988; W eisc hedel et al., 1993), up to no w the kno wledge found in nite-state taggers has b een handcrafted and cannot b e automatically acquired. Recen tly , Brill (1992 ) describ ed a rule-based tagger whic h p erforms as w ell as taggers based up

n

probabilistic mo dels and whic h

v

ercomes the limitations common in rule-based approac hes to language pro cessing: it is robust and the rules are automatically acquired. In addition, the tagger requires drastically less space than sto c hastic taggers. Ho w ev er, curren t implemen tations

f

Brill's tagger are considerably slo w er than the

nes

based

n

probabilistic mo dels since it ma y require RC n elemen tary steps to tag an input

f

n w

rds

with R rules requiring at most C tok ens

f

con text. Although the sp eed

f

curren t part-of-sp eec h taggers is acceptable for in teractiv e systems where a sen tence at a time is b eing pro cessed, it is not adequate for applications where large b

dies
f

text need to b e tagged, suc h as in information retriev al, indexing applications and grammar c hec king systems. F urthermore, the space required for part-of-sp eec h taggers is also an issue in commerci al p ersonal computer applications suc h as grammar c hec k- MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 4 2 ing systems. In addition, part-of-sp eec h taggers are

ften

b eing coupled with a syn tactic analysis mo dule. Usually these t w

mo

dules are written in dif- feren t framew

rks,

making it v ery dicult to in tegrate in teractions b et w een the t w

mo

dules. In this pap er, w e design a tagger that requires n steps to tag a sen tence

f

length n, indep enden t

f

the n um b er

f

rules and the length

f

the con- text they require. The tagger is represen ted b y a nite-state transducer, a framew

rk

whic h can also b e the basis for syn tactic analysis. This nite-state tagger will also b e found useful com bined with

ther

language comp

nen

ts since it can b e naturally extended b y comp

sing

it with nite-state transducers whic h could enco de

ther

asp ects

f

natural language syn tax. Relying

n

algorithms and formal c haracterization describ ed in later sections, w e explain ho w eac h rule in Brill's tagger can b e view ed as a non- deterministic nite-state transducer. W e also sho w ho w the application

f

all rules in Brill's tagger is ac hiev ed b y comp

sing

eac h

f

these non- deterministic transducers and wh y non-determinism arises in this transducer. W e then pro v e the correctness

f

the general algorithm for determinizi ng (whenev er p

ssible)

nite-state transducers and w e successfully apply this algorithm to the previously

btained

non-deterministic transducer. The resulting deterministic transducer yields a part-of-sp eec h tagger whic h

p

erates in

ptimal

time in the sense that the time to assign tags to a sen tence corresp

nds

to the time required to follo w a single path in this deterministic nite-state mac hine. W e also sho w ho w the lexicon used b y the tagger can b e

ptimally

enco ded using a nite-state mac hine. The tec hniques used for the construction

f

the nite-state tagger are then formalized and mathematically pro v en correct. W e in tro duce a pro

f
f

soundness and completeness with a w

rst

case complexit y analysis for an algorithm for determinizi ng nite-state transducers. W e conclude b y pro ving ho w the metho d can b e applied to the class

f

transformation-based error-driv en systems. 2 Ov erview

f

Brill's T agger Brill's tagger is comprised

f

three parts, eac h

f

whic h is inferred from a training corpus: a lexical tagger, an unkno wn w

rd

tagger and a con textual tagger. F

r

purp

ses
f

exp

sition,

w e will p

stp
ne

the discussion

f

the MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 5 3 unkno wn w

rd

tagger and fo cus mainly

n

the con textual rule tagger, whic h is the core

f

the tagger. The lexical tagger initially tags eac h w

rd

with its most lik ely tag, esti- mated b y examining a large tagged corpus, without regard to con text. F

r

example, assuming that v bn is the most lik ely tag for the w

rd

\killed" and v bd for \shot", the lexical tagger migh t assign the follo wing part-of-sp eec h tags: 1 (1) Chapman/np killed/v bn John/np Lennon/np (2) John/np Lennon/np w as/bedz shot/v bd b y/by Chapman/np (3) He/pps witnessed/v bd Lennon/np killed/v bn b y/by Chapman/np Since the lexical tagger do es not use an y con textual information, man y w

rds

can b e tagged incorrectly . F

r

example, in (1), the w

rd

\killed" is erroneously tagged as a v erb in past participle form, and in (2), \shot" is incorrectly tagged as a v erb in past tense. Giv en the initial tagging

btained

b y the lexical tagger, the con textual tagger applies a sequence

f

rules in

rder

and attempts to remedy the errors made b y the initial tagging. F

r

example, the rules in Figure 1 migh t b e found in a con textual tagger. 1. v bn v bd PREVT A G np 2. v bd v bn NEXTT A G by Figure 1: Sample rules The rst rule sa ys to c hange tag v bn to v bd if the previous tag is np. The second rule sa ys to c hange v bd to tag v bn if the next tag is by . Once the rst rule is applied, the tag for \killed" in (1) and (3) is c hanged from v bn to v bd and the follo wing tagged sen tences are

btained:

(4) Chapman/np killed/v bd John/np Lennon/np 1 The notation for part-of-sp eec h tags is adapted from the

ne

used in the Bro wn Corpus (F rancis and Ku

cera,

1982): pps stands for third singular nominativ e pronoun, v bd for v erb in past tense, np for prop er noun, v bn for v erb in past participle form, by for the w

rd

\b y", at for determiner, nn for singular noun and bedz for the w

rd

\w as". MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 6 4 (5) John/np Lennon/np w as/bedz shot/v bd b y/by Chapman/np (6) He/pps witnessed/v bd Lennon/np killed/v bd b y/by Chapman/np And

nce

the second rule is applied, the tag for \shot" in (5) is c hanged from v bd to v bn, resulting in (8) and the tag for \killed" in (6) is c hanged bac k from v bd to v bn, resulting in (9): (7) Chapman/np killed/v bd John/np Lennon/np (8) John/np Lennon/np w as/bedz shot/v bn b y/by Chapman/np (9) He/pps witnessed/v bd Lennon/np killed/v bn b y/by Chapman/np It is relev an t to

ur

follo wing discussion to note that the application

f

the NEXTT A G rule m ust lo

k

ahead

ne

tok en in the sen tence b efore it can b e applied and that the application

f

t w

rules

ma y p erform a series

f
p

erations resulting in no net c hange. As w e will see in the next section, these t w

asp

ects are the source

f

lo cal non-determinism in Brill's tagger. The sequence

f

con textual rules is automatically inferred from a training corpus. A list

f

tagging errors (with their coun ts) is compiled b y comparing the

utput
f

the lexical tagger to the correct part-of-sp eec h assignmen t. Then, for eac h error, it is determined whic h instan tiation

f

a set

f

rule templates results in the greatest error reduction. Then the set

f

new errors caused b y applying the rule is computed and the pro cess is rep eated un til the error reduction drops b elo w a giv en threshold. After training

n

the Bro wn Corpus, using the set

f

con textual rule templates sho wn in Figure 2, 280 con textual rules are

btained.

The resulting rule-based tagger p erforms as w ell as the state-of-the-art taggers based up

n

probabilistic mo dels. It also

v

ercomes the limitations common in rule-based approac hes to language pro cessing: it is robust, and the rules are automatically acquired. In addition, the tagger requires drastically less space than sto c hastic taggers. Ho w ev er, as w e will see in the next section, Brill's tagger is inheren tly slo w. 3 Complexit y

f

Brill's T agger Once the lexical assignmen t is p erformed, in Brill's algorithm, eac h con textual rule acquired during the training phase is applied to eac h sen tence to b e MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 7 5 A B PREVT A G C c hange A to B if previous tag is C A B PREV1OR2OR3T A G C c hange A to B if previous

ne
r

t w

r

three tag is C A B PREV1OR2T A G C c hange A to B ifprevious

ne
r

t w

tag

is C A B NEXT1OR2T A G C c hange A to B if next

ne
r

t w

tag

is C A B NEXTT A G C c hange A to B if next tag is C A B SURR OUNDT A G C D c hange A to B if surrounding tags are C and D A B NEXTBIGRAM C D c hange A to B if next bigram tag is C D A B PREVBIGRAM C D c hange A to B if previous bigram tag is C D Figure 2: Con textual Rule T emplates tagged. F

r

eac h individual rule, the algorithm scans the input from left to righ t while attempting to matc h the rule. This simple algorithm is computationally inecien t for t w

reasons.

The rst reason for ineciency is the fact that an individual rule is matc hed at eac h tok en

f

the input, regardless

f

the fact that some

f

the curren t tok ens ma y ha v e b een previously examined when matc hing the same rule at a previous p

sition.

The algorithm treats eac h rule as a template

f

tags and slides it along the input,

ne

w

rd

at a time. Consider, for example, the rule A B PREVBIGRAM C C that c hanges tag A to tag B if the previous t w

tags

are C .

C D C C A C C A C D C C A C C A C D C C A C C A (1) (2) (3) ↔ * * ↔ ↔ ↔

Figure 3: P artial matc hes

f

A B PREVBIGRAM C C

n

the input C D C C A. When applied to the input C D C C A, the pattern C C A is matc hed three times, as sho wn in Figure 3. A t eac h step no record

f

previous partial matc hes

r

mismatc he s is remem b ered. In this example, C is compared with the second input tok en D during the rst and second steps, and therefore, the MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 8 6 second step could ha v e b een skipp ed b y remem b ering the comparisons from the rst step. This metho d is similar to a naiv e pattern matc hing algorithm. The second reason for ineciency is the p

ten

tial in teraction b et w een rules. F

r

example, when the rules in Figure 1 are applied to sen tence (3), the rst rule results in a c hange (6) whic h is undone b y the second rule as sho wn in (9). The algorithm ma y therefore p erform unnecessary computation. In summary , Brill's algorithm for implem en ting the con textual tagger ma y require RC n elemen tary steps to tag an input

f

n w

rds

with R con textual rules requiring at most C tok ens

f

con text. 4 Construction

f

the Finite-State T agger W e sho w ho w the function represen ted b y eac h con textual rule can b e represen ted as a non-deterministic nite-state transducer and ho w the sequen tial application

f

eac h con textual rule also corresp

nds

to a non-deterministic nite-state transducer b eing the result

f

the comp

sition
f

eac h individual transducer. W e will then turn the non-deterministic transducer in to a deterministic transducer. The resulting part-of-sp eec h tagger

p

erates in linear time indep enden t

f

the n um b er

f

rules and the length

f

the con text. The new tagger

p

erates in

ptimal

time in the sense that the time to assign tags to a sen tence corresp

nds

to the time required to follo w a single path in the resulting deterministic nite-state mac hine. Our w

rk

relies

n

t w

cen

tral notions: the notion

f

a nite-state transducer and the notion

f

a subsequen tial transducer. Informally sp eaking, a nite-state transducer is a nite-state automaton whose transitions are la- b eled b y pairs

f

sym b

ls.

The rst sym b

l

is the input and the second is the

utput.

Applying a nite-state transducer to an input consists

f

follo w- ing a path according to the input sym b

ls

while storing the

utput

sym b

ls,

the result b eing the sequence

f
utput

sym b

ls

stored. Section 8.1 formally denes the notion

f

transducer. Finite-state transducers can b e comp

sed,

in tersected, merged with the union

p

eration and sometimes determinized. Basically ,

ne

can manipulate nite-state transducers as easily as nite-state automata. Ho w ev er, whereas ev ery nite-state automaton is equiv alen t to some deterministic nite-state automaton, there are nite-state transducers that are not equiv alen t to an y deterministic nite-state transducer. T ransductions that can b e computed b y MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 9 7 some deterministic nite-state transducer are called subse quential functions. W e will see that the nal step

f

the compilation

f
ur

tagger consists

f

transforming a nite-state transducer in to an equiv alen t subsequen tial transducer. W e will use the follo wing notation when pictorially describing a nite- state transducer: nal states are depicted with t w

concen

tric circles;

rep-

resen ts the empt y string;

n

a transition from state i to state j , a=b indicates a transition

n

input sym b

l

a and

utput

sym b

l(s)

b; 2 a question mark (?)

n

an arc transition (for example lab eled ?=b)

riginating

at state i stands for an y input sym b

l

that do es not app ear as an input sym b

l
n

an y

ther
ut-

going arc from i. In this do cumen t, eac h depicted nite-state transducer will b e assumed to ha v e a single initial state, namely the leftmost state app earing in the gures (usually lab eled 0). W e are no w ready to construct the tagger. Giv en a set

f

rules, the tagger is constructed in four steps. The rst step consists

f

turning eac h con textual rule found in Brill's tagger in to a nite-state transducer. F

llo

wing the example discussed in Section 2, the functionalit y

f

the rule vbn vb d PREVT A G np is represen ted b y the transducer sho wn in Figure 4

n

the left.

np/np vbn/vbd 1 2 ?/? np/np ?/? vbn/vbd np/np 1

Figure 4: L eft: transducer T 1 represen ting the con textual rule vbn vb d PREVT A G np . R ight: lo cal extension LocE xt(T 1 )

f

T 1 2 When m ultiple

utput

sym b

ls

are emitted, a comma sym b

lizes

the concatenation

f

the

utput

sym b

ls.

MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 10 8 Eac h

f

the con textual rules is dened lo cally; that is, the transformation it describ es m ust b e applied at eac h p

sition
f

the input sequence. F

r

instance, the rule A B PREV1OR2T A G C, that c hanges A in to B if the previous tag

r

the

ne

b efore is C , m ust b e applied t wice

n

C A A (resulting in the

utput

C B B). As w e ha v e seen in the previous section, this metho d is not ecien t. The second step consists

f

turning the transducers pro duced b y the pre- ceding step in to transducers that

p

erate globally

n

the input in

ne

pass. This transformation is p erformed for eac h transducer asso ciated with eac h rule. Giv en a function f 1 that transforms, sa y , a in to b (i.e. f 1 (a) = b), w e w an t to extend it to a function f 2 suc h that f 2 (w ) = w where w is the w

rd

built from the w

rd

w where eac h

ccurrence
f

a has b een replaced b y b. W e sa y that f 2 is the lo c al extension 3

f

f 1 and w e write f 2 = LocE xt(f 1 ). Section 8.2 formally denes this notion and giv es an algorithm for computing the lo cal extension. Referring to the example

f

Section 2, the lo cal extension

f

the transducer for the rule vbn vb d PREVT A G np is sho wn to the righ t

f

Figure 4. Similarly , the transducer for the con textual rule vb d vbn NEXTT A G by and its lo cal extension are sho wn in Figure 5.

vbd/vbn by/by 1 2 ?/? vbd/vbd vbd/vbn by/by 1 ?/? by/by vbd/vbn vbd/vbd 2 3

Figure 5: L eft: transducer T 2 represen ting vb d vbn NEXTT A G by. R ight: lo cal extension LocE xt(T 2 )

f

T 2 The transducers

btained

in the previous step still need to b e applied

ne

after the

ther.

The third step com bines all transducers in to

ne

single 3 This notion w as in tro duced b y Ro c he (1993 ). MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 11 9 transducer. This corresp

nds

to the formal

p

eration

f

comp

sition

dened

n

transducers. The formalization

f

this notion and an algorithm for computing the comp

sed

transducer are w ell-kno wn and are describ ed

riginally

b y Elgot and Mezei (1965 ). Returning to

ur

running example

f

Section 2, the transducer

btained

b y comp

sing

the lo cal extension

f

T 2 (righ t in Figure 5) with the lo cal extension

f

T 1 (righ t in Figure 4) is sho wn in Figure 6.

np/np vbd/vbn vbd/vbd ?/? vbn/vbn vbn/vbd vbd/vbn vbd/vbd ?/? np/np 1 by/by 2 vbd/vbd vbd/vbn np/np by/by ?/? 3 4

Figure 6: Comp

sition

T 3 = LocE xt(T 1 )

LocE

xt(T 2 ) The fourth and nal step consists

f

transforming the nite-state transducer

btained

in the previous step in to an equiv alen t subsequen tial (deterministic) transducer. The transducer

btained

in the previous step ma y con tain some non-determinism. The fourth step tries to turn it in to a deterministic mac hine. This determinization is not alw a ys p

ssible

for an y giv en nite-state transducer. F

r

example, the transducer sho wn in Figure 7 is not equiv alen t to an y subsequen tial transducer. In tuitiv ely sp eaking, suc h a transducer has to lo

k

ahead an un b

unded

distance in

rder

to correctly generate the

utput.

This in tuition will b e formalized in Section 9.2. Ho w ev er, as pro v en in Section 10, the rules inferred in Brill's tagger can alw a ys b e turned in to a deterministic mac hine. Section 9.1 describ es an algorithm for determiniz ing nite-state transducers. This algorithm will not terminate when applied to transducers represen ting non-subsequen tial functions. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 12 10

a:a a:b a:a b:b 1 2 a:a c:c 3

Figure 7: Example

f

a transducer not equiv alen t to an y subsequen tial transducer. In

ur

running example, the transducer in Figure 6 has some non-deterministic paths. F

r

example, from state

n

input sym b

l

v bd, t w

p
ssible

emissions are p

ssible:

v bn (from to 2) and v bd (from to 3). This non-determinism is due to the rule vb d vb d NEXTT A G by, since this rule has to read the second sym b

l

b efore it can kno w whic h sym b

l

m ust b e emitted. The deterministic v ersion

f

the transducer T 3 is sho wn in Figure 8. Whenev er non-determinism arises in T 3 , the deterministic mac hine emits the empt y sym b

l

, and p

st-

p

nes

the emission

f

the

utput

sym b

l.

F

r

example, from the start state 0, the empt y string is emitted

n

input v bd, while the curren t state is set to 2. If the follo wing w

rd

is by , the t w

tok

en string vbn by is emitted (from 2 to 0),

therwise

v bd is emitted (dep ending

n

the input from 2 to 2

r

from 2 to 0). Using an appropriate implem en tation for nite-state transducers (see Sec- tion 11), the resulting part-of-sp eec h tagger

p

erates in linear time, indep enden t

f

the n um b er

f

rules and the length

f

the con text. The new tagger therefore

p

erates in

ptimal

time. W e ha v e sho wn ho w the con textual rules can b e implem en te d v ery ef- cien tly . W e no w turn

ur

atten tion to lexical assignmen t, the step that precedes the application

f

the con textual transducer. This step can also b e made v ery ecien t. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 13 11

?/? vbd/ε np/np ?/? np/np vbn/ε vbd/ε 1 np/vbd,np ?/vbd,? by/vbn,by vbd/vbd 2

Figure 8: Subsequen tial form for T 3 5 Lexical T agger The rst step

f

the tagging pro cess consists

f

lo

king

up eac h w

rd

in a dictionary . Since the dictionary is the largest part

f

the tagger in terms

f

space, a compact represen tation is crucial. Moreo v er, the lo

kup

pro cess has to b e v ery fast to

,
therwise

the impro v em en t in sp eed

f

the con textual manipulations w

uld

b e

f

little practical in terest. T

ac

hiev e high sp eed for this pro cedure, the dictionary is represen ted b y a deterministic nite-state automaton with b

th

lo w access time and small storage space. Supp

se
ne

w an ts to enco de the sample dictionary

f

Figure 9. The algorithm, as describ ed in (Revuz, 1991), consists

f

rst building a tree whose branc hes are lab eled b y letters and whose lea v es are lab eled b y a list

f

tags (suc h as nn vb) , and then minim izi ng it in to a directed acyclic graph (D A G). The result

f

applying this pro cedure to the sample dictionary

f

Figure 9 is the D A G

f

Figure 10. When a dictionary is represen ted as a D A G, lo

king

up a w

rd

in it consists simply

f

follo wing

ne

path in the D A G. The complexit y

f

the lo

kup

pro cedure dep ends

nly
n

the length

f

the w

rd;

in particular, it is indep enden t

f

the size

f

the MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 14 12 dictionary . ads nns bag nn v b bagged v bn v bd ba y ed v bn v bd bids nns Figure 9: Sample Dictionary

a d s b a g g y e d (nns) (nn,vb) (vbd,vbn) i

Figure 10: D A G represen tation

f

the dictionary found in Figure 9. The lexicon used in

ur

system enco des 54; 000 w

rds.

The corresp

nding

D A G tak es 360 Kb ytes

f

space and it pro vides an access time

f

12; 000 w

rds

p er second. 4 6 T agging unkno wn w

rds

The rule-based system describ ed b y Brill (1992 ) con tains a mo dule that

p-

erates after all the kno wn w

rds

| that is, w

rds

listed in the dictionary | ha v e b een tagged with their most frequen t tag, and b efore the set

f

con textual rules are applied. This mo dule guesses a tag for a w

rd

according to its sux (e.g. a w

rd

with an \ing" sux is lik ely to b e a v erb), its prex (e.g. a w

rd

starting with an upp ercase c haracter is lik ely to b e a prop er noun) and

ther

relev an t prop erties. This mo dule basically follo ws the same tec hniques as the

nes

used to implem en t the lexicon. Due to the similarit y

f

the metho ds used, w e do not pro vide further details ab

ut

this mo dule. 4 The size

f

the dictionary in ASCI I form is 742KB. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 15 13 7 Empirical Ev aluation The tagger w e constructed has an accuracy iden tical 5 to Brill's tagger

r

the

ne
f

statistical-based metho ds, ho w ev er it runs at a m uc h higher sp eed. The tagger runs nearly ten times faster than the fastest

f

the

ther

systems. Moreo v er, the nite-state tagger inherits from the rule-based system its compactness compared to a sto c hastic tagger. In fact, whereas sto c hastic taggers ha v e to store w

rd-tag,

bigram and trigram probabilities, the rule- based tagger and therefore the nite-state

ne
nly

ha v e to enco de a small n um b er

f

rules (b et w een 200 and 300). W e empirically compared

ur

tagger with Eric Brill's implem e n tation

f

his tagger, and with

ur

implem en tation

f

a trigram tagger adapted from the w

rk
f

Ch urc h (1988 ) that w e previously implem en ted for another purp

se.

W e ran the three programs

n

large les and pip ed their

utput

in to a le. In the times rep

rted,

w e included the time sp en t reading the input and writing the

utput.

Figure 11 summarizes the results. All taggers w ere trained

n

a p

rtion
f

the Bro wn corpus. The exp erimen ts w ere run

n

an HP720 with 32Mb ytes

f

memory . In

rder

to conduct a fair comparison, the dictionary lo

kup

part

f

the sto c hastic tagger has also b een implem en ted using the tec hniques describ ed in Section 5. All three taggers ha v e appro ximately the same precision (95%

f

the tags are correct) 6 . By design, the nite-state tagger pro duces the same

utput

as the rule-based tagger. The rule-based tagger | and the nite-state tagger | do not alw a ys pro duce the exact same tagging as the sto c hastic tagger (they don't mak e the same errors); ho w ev er, no signican t dierence in p erformance b et w een the systems w as detected. 7 Indep enden tly , Cutting et al. (1992 ) quote a p erformance

f

800 w

rds/second

for their part-of-sp eec h tagger based

n

hidden Mark

v

mo dels. The space required b y the nite-state tagger (815KB) is decomp

sed

as follo ws: 363KB for the dictionary , 440KB for the subsequen tial transducer and 12KB for the mo dule for unkno wn w

rds.

5 Our curren t implem en tation is functionally equiv alen t to the tagger as describ ed b y Brill (1992 ). Ho w ev er, the tagger could b e extended to include recen t impro v emen ts describ ed in more recen t pap ers (Brill, 1994). 6 F

r

ev aluation purp

ses,

w e randomly selected 90%

f

the Bro wn corpus for training purp

ses

and 10% for testing. W e used the Bro wn corpus set

f

part-of-sp eec h tags. 7 An extended discussion

f

the precision

f

the rule-based tagger can b e found in (Brill, 1992). MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 16 14 Sto c hastic T agger Rule-Based T agger Finite-State T agger Sp eed 1,200 w/s 500 w/s 10,800 w/s Space 2,158KB 379KB 815KB Figure 11: Ov erall p erformance comparison. The sp eed

f
ur

system is decomp

sed

in Figure 12. 8 dictionary lo

kup

unkno wn w

rds

con textual Sp eed 12,800 w/s 16,600 w/s 125,100 w/s P ercen t

f

the time 85% 6.5% 8.5% Figure 12: Sp eed

f

the dieren t parts

f

the program Our system reac hes a p erformance lev el in sp eed for whic h

ther

v ery lo w lev el factors (suc h as storage access) ma y dominate the computation. A t suc h sp eeds, the time sp en t reading the input le, breaking the le in to sen tences, and sen tences in to w

rds,

and writing the result in to a le is no longer negligible. 8 Finite-State T ransducers The metho ds used in the construction

f

the nite-state tagger describ ed in the previous sections w ere describ ed informally . In the follo wing section, the notions

f

nite-state transducers and the notion

f

lo cal extension are dened. W e also pro vide an algorithm for computing the lo cal extension

f

a nite-state transducer. Issues related to the determinization

f

nite-state transducers are discussed in the section follo wing this

ne.

8 In Figure 12, the dictionary lo

kup

includes reading the le, splitting it in to sen tences, lo

king

up eac h w

rd

in the dictionary and writing the nal result to a le. The dictionary lo

kup

and the tagging

f

unkno wn w

rds

tak e roughly the same amoun t

f

time, but since the second pro cedure

nly

applies

n

unkno wn w

rds

(around 10% in

ur

exp erimen ts) the p ercen tage

f

time it tak es is m uc h smaller. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 17 15 8.1 Denition

f

Finite-State T ransducers A nite-State tr ansduc er T is a 5-tuple (; Q; i; F ; E ) where:

is

a nite alphab et; Q is the set

f

states

r

v ertices; i 2 Q is the initial state; F

Q

is the set

f

nal states; E

Q
[

fg

Q

is the set

f

edges

r

transitions. F

r

instance, Figure 13 is the graphical represen tation

f

the transducer: T 1 = (fa; b; c; d; eg,f0; 1; 2; 3g,0,f3g,f(0; a; b; 1),(0; a; c; 2),(1; d; d; 3),(2; e; e; 3)g). A nite-state transducer T also denes a function

n

w

rds

in the follo w- ing w a y: the extended set

f

edges ^ E , the transitiv e closure

f

E , is dened b y the follo wing recursiv e relation:

if

e 2 E then e 2 ^ E

if

(q ; a; b; q ); (q ; a ; b ; q 00 ) 2 ^ E then (q ; aa ; bb ; q 00 ) 2 ^ E . Then the function f from

to
dened

b y f (w ) = w i 9q 2 F suc h that (i; w ; w ; q ) 2 ^ E is the function dened b y T . One sa ys that T represen ts f and writes f = jT j. The functions

n

w

rds

that are represen ted b y nite- state transducers are called r ational functions. If, for some input w , more than

ne
utput

is allo w ed (e.g. f (w ) = fw 1 ; w 2 ;

g)

then f is called a r ational tr ansduction. In the example

f

Figure 13, T 1 is dened b y jT 1 j(ad) = bd and jT 1 j(ae) = ce.

a/b a/c e/e h/h 1 2 3

Figure 13: T 1 : Example

f

Finite-State T ransducer Giv en a nite-state transducer T = (; Q; i; F ; E ), the follo wing additional notions are useful: its state tr ansition function d that maps Q

[

fg in to 2 Q dened b y d(q ; ag ) = fq 2 Qj9w 2

and

(q ; a; w ; q ) 2 E g; and its emission function

that

maps Q

[

fg

Q

in to 2

dened

b y

(q

; a; q ) = fw 2

j(q

; a; w ; ; q ) 2 E g. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 18 16 A nite-state transducer could b e seen as a nite-state automaton, eac h

f

whose lab el is a pair. In this resp ect, T 1 w

uld

b e deterministic ; ho w- ev er, since transducers are generally used to compute a function, a more relev an t denition

f

determinism consists

f

sa ying that b

th

the transition function d and the emission function

lead

to sets con taining at most

ne

elemen t, that is, jd(q ; a)j

1

and j (q ; a; q )j

1.

With this notion, if a nite-state transducer is deterministic,

ne

can apply the function to a giv en w

rd

b y deterministicall y follo wing a single path in the transducer. De- terministic transducers are called subse quential tr ansduc ers (Sc h } utzen b erger, 1977) 9 . Giv en a deterministic transducer, w e can dene the partial functions q

a

= q i d(q ; a) = fq g and q

a

= w i 9q 2 Q suc h that q

a

= q and

(q

; a; q ) = fw g. This leads to the denition

f

subse quential tr ansduc ers: a subsequen tial transducer T is a 7-tuple (; Q; i; F ; ; ; ) where: ; Q; i; F are dened as ab

v

e;

is

the deterministic state transition function that maps Q

n

Q,

ne

writes q

a

= q ;

is

the deterministic emission function that maps Q

n
,
ne

writes q

a

= w ; and the nal emission function

maps

F

n
,
ne

writes (q ) = w . F

r

instance, T 1 is not deterministic b ecause d(1; a) = fa; bg, but it is equiv alen t to T 2 represen ted Figure 14 in the sense that they represen t the same function, i.e jT 1 j = jT 2 j. T 2 is dened b y T 2 = (fa; b; c; h; eg; f0; 1; 2g; 0; f2g; ; ; ) where

a

= 1,

a

= , 1

h

= 2, 1

h

= bh, 1

e

= 2, 1

e

= ce and where (2) = .

a/ε e/ce h/bh 1 2

Figure 14: Subsequen tial T ransducer T 2 8.2 Lo cal Extension In this section, w e will see ho w a function whic h needs to b e applied at all input p

sitions

can b e transformed in to a global function that needs to b e applied

nce
n

the input. F

r

instance, consider T 3

f

Figure 15. It 9 A se quential tr ansduc er is a deterministic transducer for whic h all states are nal. Sequen tial transducers are also called gener alize d se quential machines (Eilen b erg, 1974). MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 19 17 represen ts the function f 3 = jT 3 j suc h that f 3 (ab) = bc and f 3 (bca) = dca. W e w an t to build the function that, giv en a w

rd

w , eac h time w con tains ab (i.e. ab is a factor

f

the w

rd)

(resp. bca), this factor is transformed in to its image bc (resp. dca). Supp

se

for instance that the input w

rd

is w = aabcab, as sho wn

n

Figure 16, and that the factors that are in dom(f 3 ) 10 can b e found according to t w

dieren

t factorizations: i.e. w 1 = a

w

2

c
w

2 11 where w 2 = ab and w 1 = aa

w

3

b

where w 3 = bca. The lo c al extension

f

f 3 will b e the function that tak es eac h p

ssible

factorization and transforms eac h factor according to f 3 , i.e. f 3 (w 2 ) = bc and f 3 (w 3 ) = dca, and lea v es the

ther

parts unc hanged; here this leads to t w

utputs:

abccbc according to the rst factorization, and aadcab according to the second factorization.

a/b b/d b/c 1 2 c/c a/a 4 3

Figure 15: T 3 : a nite-state transducer to b e extended a a b c a b a a b c a b b c b c a a b c a b d c a Figure 16: T

p:

input Midd le: rst factorization Bottom: second factorization The notion

f

lo cal extension is formalized through the follo wing denition. 10 dom(f ) denotes the domain

f

f , that is, the set

f

w

rds

that ha v e at least

ne
utput

through f . 11 If w 1 ; w 2 2

,

w 1

w

2 denotes the concatenation

f

w 1 and w 2 . It can also b e written w 1 w 2 . MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 20 18 Denition 1 If f is a r ational tr ansduction fr

m
to
,

the lo cal extension F = LocE xt(f ) is the r ational tr ansduction fr

m
n
de-

ne d in the fol lowing way: if u = a 1 b 1 a 2 b 2

a

n b n a n+1 2

then

v = a 1 b 1 a 2 b 2

a

n b n a n+1 2 F (u) if a i 2

(
dom(f

)

),

b i 2 dom(f ) and b i 2 f (b i ). 12 In tuitiv ely , if F = LocE xt(f ) and w 2

,

eac h factor

f

w in dom(f ) is transformed in to its image b y f and the remaining part

f

w is left unc hanged. If f is represen ted b y a nite-state transducer T and LocE xt(f ) is represen ted b y a nite-state transducer T ,

ne

writes T = LocE xt(T ). It could also b e seen that if

T

is the iden tit y function

n
(
dom(T

)

),

then LocE xt(T ) =

T
(T
T

)

.

13 Figure 20 giv es an algorithm that computes the lo cal extension directly . The idea is that an input w

rd

is pro cessed non-deterministically from left to righ t. Supp

se

for instance that w e ha v e the initial transducer T 4

f

Figure 17 and that w e w an t to build its lo cal extension T 5

f

Figure 18. When the input is read, if a curren t input letter cannot b e transformed at the rst state

f

T 4 (the letter c for instance), it is left unc hanged: this is expressed b y the lo

ping

transition

n

the initial state

f

T 5 lab eled ?=?. 14 On the

ther

hand, if the input sym b

l,

sa y a, can b e pro cessed at the initial state

f

T 4 ,

ne

do esn't kno w y et whether a will b e the b eginning

f

a w

rd

that can b e transformed (e.g. ab)

r

whether it will b e follo w ed b y a sequence whic h mak es it imp

ssible

to apply the transformation (e.g. ac). Hence

ne

has to en tertain t w

p
ssibilities,

namely (1) w e are pro cessing the input according to T 4 and the transitions should b e a=b,

r

(2) w e are within the iden tit y and the transition should b e a=a. This leads to t w

kind
f

states: the transduction states (mark ed tr ansduction in the algorithm) and the iden tit y states (mark ed identity in the algorithm). It can b e seen in Figure 18 that this leads to a transducer that has a cop y

f

the initial transducer and an additional part that pro cesses the iden tit y while making sure it could not 12 The dot `' stands for the concatenation

p

eration

n

strings. 13 In this last form ula, the concatenation

stands

for the concatenation

f

the graph

f

the function, that is for the concatenation

f

the transducers view ed as automata whose lab els are

f

the form a=b. 14 As explained b efore, a transition lab eled b y the sym b

l

? stands for all the transitions lab eled with a letter that do esn't app ear

n

an y

utgoing

arc from this state. A transition lab eled ?=? stands for all the diagonal pairs (a; a) s.t. a is not an input sym b

l
n

an y

utgoing

arc from this state. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 21 19 ha v e b een transformed. In

ther

w

rds,

the algorithm consists

f

building a cop y

f

the

riginal

transducer and at the same time the iden tit y function that

p

erates

n
dom(T

)

.

Let us no w see ho w the algorithm

f

Figure 20 applies step b y step to the transducer T 4

f

Figure 17, pro ducing the transducer T 5

f

Figure 18. In Figure 20, C [0] = (fig; identity )

f

line 1 states that the state

f

the transducer to b e built is

f

t yp e identity and refers to the initial state i =

f

T 4 . q represen ts the curren t state and n the curren t n um b er

f

states. In the lo

p

dof

gw

hil e(q < n),

ne

builds the transitions

f

eac h state

ne

after the

ther:

if the transition p

in

ts to a state not already built, a new state is added, th us incremen ting n. The program stops when all states ha v e b een insp ected and when no additional state is created. The n um b er

f

iterations is b

unded

b y 2 kT k2 , where kT k = jQj is the n um b er

f

states

f

the

riginal

transducer 15 . Line 3 sa ys that the curren t state within the lo

p

will b e q and that this state refers to the set

f

states S and is mark ed b y the t yp e ty pe. In

ur

example, at the rst

ccurrence
f

this line, S is instan tiated to f0g and ty pe = identity . Line 5 adds the curren t iden tit y state to the set

f

nal states and a transition to the initial state for all letters that do not app ear

n

an y

utgoing

arc from this state. Lines 6 to 11 build the transitions from and to the iden tit y states, k eeping trac k

f

where this leads in the

riginal

transducer. F

r

instance, a is a lab el that v eries the conditions

f

line 6. Th us a transition a=a is to b e added to the identity state 2 whic h refers to 1 (b ecause

f

the transition a=b

f

T 4 ) and to i = (b ecause it is p

ssible

to start the transduction T 4 from an y place

f

the iden tit y). Line 7 c hec ks that this state do esn't already exist and adds it if necessary . e = n + + means that the arriv al state for this transition, i.e. d(q ; w ), will b e the last added state and that the n um b er

f

states b eing built has to b e incremen ted. Line 11 actually builds the transition b et w een and e = 2 lab eled a=a. Line 12 through 17 describ e the fact that it is p

ssible

to start a transduction from an y identity state. Here

ne

transition is added to

ne

new state, i.e. a=b to 3. The next state to b e considered is 2 and it is built lik e state except that the sym b

l

b should blo c k the curren t

utput.

In fact, the state 1 means that w e already read a with a as

utput,

th us if

ne

reads b, this means that ab is at the curren t p

in

t, and since ab should b e transformed in to bc, the curren t iden tit y transformation (that is a ! a) should b e blo c k ed: this is expressed 15 In fact, Q

2

Qftr ansduction;ident ityg . Th us, q

2

2jQj . MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 22 20 b y the transition b=b that leads to state 1 (this state is a \trash" state; that is, it has no

utgoing

transition and it is not nal). The follo wing state is 3, whic h is mark ed as b eing

f

t yp e tr ansduction, whic h means that lines 19 through 27 should b e applied. This consists simply

f

cop ying the transitions

f

the

riginal

transducer. If the

riginal

state w as nal, as for 4 = (f2g; tr ansduction), an = transition to the

riginal

state is added (to get the b eha vior

f

T + ). The transducer T 6 = LocE xt(T 3 )

f

Figure 19 giv es a more complete (and sligh tly more complex) example

f

applying this algorithm.

a/b b/d b/c 1 2

Figure 17: Sample T ransducer T 4

a/b a/a {0} identity {1} transd. {2} transd. {0,1} identity a/b b/c ε/ε b/d 1 3 4 ?/? a/a b/b {} transd. 2 ?/?

Figure 18: Lo cal Extension T 5

f

T 4 : T 5 = LocE xt(T 4 ) MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 23 21

a/b a/a {0} identity {1} transd. {2} transd. {0,1} identity a/b b/c ε/ε {3} transd. {4} transd. {0,3} identity {0,4} identity b/d c/c c/c b/b b/d a/b a/a a/b b/d ?/? ?/? ?/? 2 3 4 7 8 ?/? 1 {} transd. 5 6 a/a b/b a/a b/b b/b a/a

Figure 19: Lo cal Extension T 6

f

T 3 : T 6 = LocE xt(T 3 ) MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 24 22 Local E xtension(T = (; Q ; i ; F ; E ); T = (; Q; i; F ; E )) 1 C [0] = (fig; identity ); q = 0; i = 0; F = ;; E = ;; Q = ;; C [1] = (;; tr ansduction); n = 2; 2 do f 3 (S; ty pe) = C [q ]; Q = Q [ fq g; 4 if (ty pe == identity ) 5 F = F [ fq g; E = E [ f(q ; ?; ?; i )g; 6 for eac h w 2

[

fg s.t. 9x 2 S , d(x; w ) 6= ; and 8y 2 S , d(y ; w ) \ F = ; 7 if (9r 2 [0; n

1]

suc h that C [r ] == (fig [ [ x2S d(x; w ); identity ) 8 e = r ; 9 else 10 C [e = n + +] = (fig [ [ x2S d(x; w ); identity ); 11 E = E [ f(q ; w ; w ; e)g; 12 for eac h (i; w ; w ; x) 2 E 13 if (9r 2 [0; n

1]

suc h that C [r ] == (fxg; tr ansduction) 14 e = r ; 15 else 16 C [e = n + +] = (fxg; tr ansduction); 17 E = E [ f(q ; w ; w ; e)g; 18 for eac h w 2

[

fg s.t. 9x 2 S d(x; w ) \ F 6= ; then E = E [ f(q ; w ; w ; 1)g; 19 else if (ty pe == tr ansduction) 20 if 9x 1 2 Q s.t. S == fx 1 g 21 if (x 1 2 F ) then E = E [ f(q ; ; ; 0)g; 22 for eac h (x 1 ; w ; w ; y ) 2 E 23 if (9r 2 [0; n

1]

suc h that C [r ] == (fy g; tr ansduction) 24 e = r ; 25 else 26 C [e = n + +] = (fy g; tr ansduction); 27 E = E [ f(q ; w ; w ; e)g; 28 q++; 29 gwhile(q < n); Figure 20: Lo cal Extension Algorithm. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 25 23 9 Determinization The basic idea b ehind the determinization algorithm comes from Mehry ar Mohri 16 . In this section, after giving a formalization

f

the algorithm, w e in- tro duce a pro

f
f

soundness and completeness with its w

rst

case complexit y analysis. 9.1 Determini zati

n

Algorithm In the follo wing, for w 1 ; w 2 2

,

w 1 ^ w 2 denotes the longest common prex

f

w 1 and w 2 . The nite-state transducers w e use in

ur

system ha v e the prop ert y that they can b e made deterministic; that is, there exists a subsequen tial transducer that represen ts the same function 17 . If T = (; Q; i; F ; E ) is suc h a nite-state transducer, the subsequen tial transducer T = (; Q ; i ; F ; ; ; ) dened as follo ws will b e later pro v ed equiv alen t to T :

Q
2

Q

.

In fact the determinization

f

the transducer is related to the determinizati

n
f

FSAs in the sense that it also in v

lv

es a p

w

er set construction. The dierence is that

ne

has to k eep trac k

f

the set

f

states

f

the

riginal

transducer

ne

migh t b e in and also

f

the w

rds

whose emission ha v e b een p

stp
ned.

F

r

instance, a state f(q 1 ; w 1 ); (q 2 ; w 2 )g means that this state corresp

nds

to a path that leads to q 1 and q 2 in the

riginal

transducer and that the emission

f

w 1 (resp. w 2 ) w as dela y ed for q 1 (resp. q 2 ).

i

= f(i; )g. There is no p

stp
ned

emission at the initial state.

the

emission function is dened b y: S

a

= ^ (q ;u)2S ^ q 2d(q ;a) u

(q

; a; q ) This means that, for a giv en sym b

l,

the set

f

p

ssible

emissions is

btained

b y concatenating the p

stp
ned

emissions with the emission 16 Mohri (1994b ) also giv es a formalization

f

the algorithm. 17 As

pp
sed

to automata, a large class

f

nite-state transducers, don't ha v e an y deterministic represen tation; they can't b e determinized. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 26 24 at the curren t state. Since

ne

w an ts the transition to b e deterministic , the actual emission is the longest common prex

f

this set.

the

state transition function is dened b y: S

a

= [ (q ;u)2S [ q 2d(q ;a) f(q ; (S

a)

1

u
(q

; a; q ))g Giv en u; v 2

,

u

v

denotes the concatenation

f

u and v and u 1

v

= w , if w is suc h that u

w

= v , u 1

v

= ; if no suc h w exists.

F

= fS 2 Q j9(q ; u) 2 S and q 2 F g

if

S 2 F , (S ) = u s.t. 9q 2 F s.t. (q ; u) 2 S . W e will see in the pro

f
f

correctness that

is

prop erly dened. The determinizati

n

algorithm

f

Figure 22 computes the ab

v

e subsequen tial transducer. Let us no w apply the determinization algorithm

f

Figure 22

n

the nite- state transducer T 1

f

Figure 13 and sho w ho w it builds the subsequen tial transducer T 5

f

Figure 21. Line 1

f

the algorithm builds the rst state and instan tiates it with the pair f(0; )g. q and n resp ectiv ely denote the curren t state and the n um b er

f

states ha ving b een built so far. A t line 5,

ne

tak es all the p

ssible

input sym b

ls

w ; here

nly

a is p

ssible.

w

f

line 6 is the

utput

sym b

l,

w =

(

^ q 2f1;2g

(0;

a; q )), th us w =

(0;

a; 1) ^

(0;

a; 2) = b ^ c = . Line 8 is then computed as follo ws: S = [ q 2f0g [ q 2f1;2g fq ;

1
(0;

a; q )g, th us S = f(1;

(0;

a; 1))g [ f(2;

(0;

a; 2)g = f(1; b); (2; c)g. Since no r v eries the condition

n

line 9, a new state e is created to whic h the transition lab eled a=w = a= p

in

ts and n is incremen te d. On line 15, the program go es to the construction

f

the transitions

f

state 1. On line 5, d and e are then t w

p
ssible

sym b

ls.

The rst sym b

l,

h, at line 6, is suc h that w is w = ^ q 2d(1;h)=f2g b

(1;

h; q )) = bh. Henceforth, the computation

f

line 8 leads to S = [ q 2f1g [ q 2f2g f(q ; (bh) 1

b
h)g

= f(2; )g. State 2 lab eled f(2; )g is th us added and a transition lab eled h=bh that p

in

ts to state 2 is also added. The transition for the input sym b

l

e is computed the same w a y . MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 27 25

h/bh a/ε e/ce (0,ε) (1,b) (2,c) (2,ε) 1 2

Figure 21: Subsequen tial transducer T 5 suc h that jT 5 j = jT 1 j D eter miniz eT r ansducer (T = (; Q ; i ; F ; ; ; ); T = (; Q; i; F ; E )) 1 i = 0; q = 0; n = 1; C [0] = f(0; )g; F = ;; Q = ;; 2 do f 3 S = C [q ]; Q = Q [ fq g; 4 if 9(q ; u) 2 S s.t. q 2 F then F = F [ fq g and (q ) = u; 5 foreac h w suc h that 9(q ; u) 2 S ) and d(q ; w ) 6= ; f 6 w = ^ (q ;u)2S ^ q 2d(q;w ) u

(q

; w ; q ) 7 q

w

= w ; 8 S = [ (q;u)2S [ q 2d(q ;w ) f(q ; w 01

u
(q

; w ; q ))g; 9 if 9r 2 [0; n

1]

suc h that C [r ] == S 10 e = r ; 11 else 12 C [e = n + +] = S ; 13 q

w

= e; 14 g 15 q + +; 16 gwhile(q < n); Figure 22: Determinization Algorithm MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 28 26 The subsequen tial transducer generated b y this algorithm could in turn b e minimi zed b y an algorithm describ ed in (Mohri, 1994a). Ho w ev er, in the case

f

the part-of-sp eec h tagger, the transducer is nearly minimal. 9.2 Pro

f
f

Correctness Although it is decidable whether a function is subsequen tial

r

not (Chorut, 1977), the determinizati

n

algorithm describ ed in the previous section do es not terminate when run

n

a non-subsequen tial function. Tw

issues

are addressed in this section. First, the pro

f
f

soundness: the fact that if the algorithm terminates, then the

utput

transducer is deterministic and represen ts the same function. Second, the pro

f
f

completeness: the algorithm terminates in the case

f

subsequen tial functions. Soundness and completeness are a consequence

f

the main prop

sition

whic h states that if a transducer T represen ts a subsequen tial function f , then the algorithm DeterminizeT r ansduc er describ ed in the previous section applied

n

T computes a subsequen tial transducer represen ting the same function. In

rder

to simplify the pro

fs,

w e will

nly

consider transducers that do not ha v e

input

transitions, that is E

Q
Q,

and also without loss

f

generalit y , transducers that are reduced and that are deterministic in the sense

f

nite-state automata 18 . In

rder

to pro v e this prop

sition,

w e need to establish some preliminary notations and lemmas. First w e extend the denition

f

the transition function d, the emission function

,

the deterministic transition function

and

the deterministic emission function

n

w

rds

in the classical w a y . W e then ha v e the follo wing prop erties: d(q ; ab) = [ q 2d(q ;a) d(q ; b)

(q

1 ; ab; q 2 ) = [ fq 2d(q 1 ;a)jq 2 2d(q ;b)g

(q

1 ; a; q )

(q

; b; q 2 ) q

ab

= (q

a)
b

q

ab

= (q

a)
(q
a)
b

18 A transducer denes an automaton whose lab els are the pairs \input/output"; this automaton is assumed to b e deterministic. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 29 27 F

r

the follo wing, it useful to note that if jT j is a function, then

is

a function to

.

The follo wing lemm a states an in v arian t that holds for eac h state S built within the algorithm. The lemma will later b e used for the pro

f
f

soundness. Lemma 1 L et I = C [0] b e the initial state. A t e ach iter ation

f

the \do" lo

p

in DeterminizeT ransducer, for e ach S = C [q ] and for e ach w 2

such

that I

w

= S , the fol lowing holds: (i) I

w

= ^ q 2d(i;w )

(i;

w ; q ) (ii) S = I

w

= f(q ; u)jq 2 d(i; w ) and u = (I

w

) 1

(i;

w ; q )g Pro

f.

(i) and (ii) are

b

viously true for S = I (since d(i; ) = i and

(i;

; i) = ) and w e will sho w that giv en some w 2

if

it is true for S = I

w

then it is also true for S 1 = S

a

= I

w

a for all a 2 . Assuming that (i) and (ii) hold for S and w , then for eac h a 2 : ^ q 2d(i;w );q 2d(q ;a)

(i;

w ; q )

(q

; a; q ) = (I

w

)

^

q 2d(i;w );q 2d(q ;a) ((I

w

) 1

(i;

w ; q ))

(q

; a; q ) = (I

w

)

^

(q ;u)2S =I w ;q 2d(q ;a) u

(q

; a; q ) = (I

w

)

(S
a)

= I

w
(I
w

)

a

= I

w

a This pro v es (i). W e no w turn to (ii). Assuming that (i) and (ii) hold for S and w , then for eac h a 2 , let S 1 = S

a;

the algorithm (line 8) is suc h that S 1 = f(q ; u )j9(q ; u) 2 S; q 2 d(q ; a) and u = (S

a)

1

u
(q

; a; q )g Let S 2 = f(q ; u )jq 2 d(i; w a) and u = (I

w

a) 1

(i;

w a; q )g W e sho w that S 1

S

2 . Let (q ; u ) 2 S 1 , then 9(q ; u) 2 S s.t. q 2 d(q ; a) and u = (S

a)

1

u
(q

; a; q ). Since u = (I

w

) 1

(i;

w ; q ), then MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 30 28 u = (S

a)

1

(I
w

) 1

(i;

w ; q )

(q

; a; q ), that is, u = (I

w

a) 1

(i;

w a; q ). Th us (q ; u ) 2 S 2 . Hence S 1

S

2 . W e no w sho w that S 2

S

1 . Let (q ; u ) 2 S 2 , and let q 2 d(i; w ) b e s.t. q 2 d(q ; a) and u = (I

w

) 1

(i;

w ; q ) then (q ; u) 2 S and since u = (I

w

a) 1

(i;

w a; q ) = (S

a)

1

u
(q

; a; q ), (q ; u ) 2 S 1 This concludes the pro

f
f

(ii). 2 The follo wing lemm a states a common prop ert y

f

the state S , whic h will b e used in the complexit y analysis

f

the algorithm. Lemma 2 Each S = C [q ] built within the \do" lo

p

is s.t. 8q 2 Q, ther e is at most

ne

p air (q ; w ) 2 S with q as rst element. Pro

f.

Supp

se

(q ; w 1 ) 2 S and (q ; w 2 ) 2 S , and let w b e s.t. I

w

= S . Then w 1 = (I

w

) 1

(i;

w ; q ) and w 2 = (I

w

) 1

(i;

w ; q ). Th us w 1 = w 2 . 2 The follo wing lemma will also b e used for soundness. It states that the nal state emission function is indeed a function. Lemma 3 F

r

e ach S built in the algorithm, if (q ; u); (q ; u ) 2 S , then q ; q 2 F ) u = u Pro

f.

Let S b e

ne

state set built in line 8

f

the algorithm. Supp

se

(q ; u); (q ; u ) 2 S and q , q 2 F . According to (ii)

f

lemm a 1, u = (I

w

) 1

(i;

w ; q ) and u = (I

w

) 1

(i;

w ; q ). Since jT j is a function and f (i; w ; q );

(i;

w ; q )g 2 jT j(w ) then

(i;

w ; q ) =

(i;

w ; q ), therefore u = u . 2 The follo wing lemma will b e used for completeness. Lemma 4 Given a tr ansduc er T r epr esenting a subse quential function, ther e exists a b

und

M s.t. for e ach S built at line 8, for e ach (q ; u) 2 S , juj

M

. W e rely

n

the follo wing theorem pro v en b y Chorut (1978 ): Theorem 1 A function f

n
is

subse quential i it has b

unde

d variations and for any r ational language L

,

f 1 (L) is also r ational. with the follo wing t w

denitions:

MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 31 29 Denition 2 The left distanc e b etwe en two strings u and v is ku; v k = juj + jv j

2ju

^ v j Denition 3 A function f

n
has

b

unde

d variations i for al l k

0,

ther e exists K

s.t.

u; v 2 dom(f ), ku; v k

k

) kf (u); f (v )k

K

Pro

f
f

lemm a 4: Let f = jT j. F

r

eac h q 2 Q let c(q ) b e a string w s.t. d(q ; w ) \ F 6= ; and s.t. jw j is minim al among suc h strings. Note that jc(q )j

kT

k where kT k is the n um b er

f

states in T . F

r

eac h q 2 Q let s(q ) 2 Q b e a state s.t. s(q ) 2 d(q ; c(q )) \ F . Let us further dene M 1 = max q 2Q j (q ; c(q ); s(q ))j M 2 = max q 2Q jc(q )j Since f is subsequen tial, it is

f

b

unded

v ariations, therefore there exists K s.t. if ku; v k

2M

2 then kf (u); f (v )k

K

. Let M = K + 2M 1 . Let S b e a state set built at line 8 , let w b e s.t. I

w

= S and

=

I

w

. Let (q 1 ; u) 2 S . Let (q 2 ; v ) 2 S b e s.t. u ^ v = . Suc h a pair alw a ys exists, since if not j ^ (q ;u )2S u j > th us j

^

(q ;u )2S u j = j ^ (q ;u )2S

u

j > jj Th us, b ecause

f

(ii) in lemma 1, j ^ q 2d(i;w )

(i;

w ; q )j > jI

w

j whic h con tradicts (i) in lemma 1. Let ! =

(q

1 ; c(q 1 ); s(q 1 )) and ! =

(q

2 ; c(q 2 ); s(q 2 )). Moreo v er, for an y a,b,c,d 2

,

ka; ck

kab;

cdk + jbj + jdj. In fact, kab; cdk = jabj + jcdj

2jab

^ cdj = jaj + jcj + jbj + jdj

2jab

^ cdj = ka; ck + 2ja ^ cj + jbj + jdj

2jab

^ cdj but jab ^ cdj

ja

^ cj + jbj + jdj and since kab; cdk = ka; ck

2(jab

^ cdj

ja

^ cj

jbj
jdj)
jbj
jdj
ne

has ka; ck

kab;

cdk + jbj + jdj. Therefore, in particular, juj

ku;

v k

ku!

; v ! k + j! j + j! j, th us juj

kf

(w

c(q

1 )); f (w

c(q

2 ))k + 2M 1 . But kw

c(q

1 ); w

c(q

2 )k

jc(q

1 )j + MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 32 30 jc(q 2 )j

2M

2 , th us kf (w

c(q

1 )); f (w

c(q

2 ))k

K

and therefore juj

K

+ 2M 1 = M . 2 The time is no w rip e for the main prop

sition

whic h pro v es soundness and completeness. Prop

sition

1 If a tr ansduc er T r epr esents a subse quential function f , then the algorithm DeterminizeT ransducer describ e d in the pr evious se ction applie d

n

T c

mputes

a subse quential tr ansduc er

r

epr esenting the same function. Pro

f.

The lemm a 4 sho ws that the algorithm alw a ys terminates if jT j is subsequen tial. Let us sho w that dom(j j)

dom(jT

j). Let w 2

s.t.

w is not in dom(jT j), then d(i; w ) \ F = ;. Th us, according to (ii)

f

lemm a 1, for all (q ; u) 2 I

w

, q is not in F , th us I

w

is not terminal and therefore w is not in dom( ). Con v ersely , let w 2 dom(jT j). There exists a unique q f 2 F s.t. jT j(w ) =

(i;

w ; q f ) and s.t. q f 2 d(i; w ). Therefore jT j(w ) = (I

w

)

((I
w

) 1

(i;

w ; q f )) and according to (ii)

f

lemma 1 (q f ; (I

w

) 1

(i;

w ; q f )) 2 I

w

and since q f 2 F , lemma 3 sho ws that (I

w

) = (I

w

) 1

(i;

w ; q f ), th us jT j(w ) = (I

w

)

(I
w

) = j j(w ). 2 9.3 W

rst-case

complexit y In this section w e giv e a w

rst-case

upp er b

und
f

the size

f

the subsequen tial transducer in term

f

the size

f

the input transducer. Let L = fw 2

s.t.

jw j

M

g where M is the b

und

dened in the pro

f
f

lemma 4. Since, according to lemm a 2, for eac h state set Q , for eac h q 2 Q, Q con tains at most

ne

pair (q ; w ), the maximal n um b er N

f

states built in the algorithm is smaller than the sum

f

the n um b er

f

functions from states to strings in L for eac h state set, that is N

X

Q 22 Q jLj jQ j w e th us ha v e N

2

jQj

jLj

jQj = 2 jQj

2

jQjlog 2 jLj and therefore N

2

jQj(1+log jLj) . MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 33 31 Moreo v er, jLj = 1 + jj +

+

jj M = jj M +1

1

jj

1

if jj > 1 and jLj = M + 1 if jj = 1. In this last form ula, M = K + 2M 1 as describ ed in lemma 4. Note that if P = max a2 j (q ; a; q )j is the maximal length

f

the simple transitions emissions, M 1

jQj
P

, th us M

K

+ 2

jQj
P

. Therefore, if jj > 1, the n um b er

f

states N is b

unded:

N

2

jQj(1+log jj (K +2jQjP +1 1 jj1 ) and if jj = 1, N

2

jQj(1+log (K +2jQjP +1) ) . 10 Subsequen tialit y

f

T ransformation-Based Systems The pro

f
f

correctness

f

the determinization algorithm and the fact that the algorithm terminates

n

the transducer enco ding Brill's tagger sho w that the nal function is subsequen tial and equiv alen t to Brill's

riginal

tagger. In this section, w e pro v e in general that an y transformation-based system, suc h as those used b y Brill, is a subsequen tial function. In

ther

w

rds,

an y transformation-based system can b e turned in to a deterministic nite-state transducer. W e dene transformation-based systems as follo ws. Denition 4 A tr ansformation-b ase d system is a nite se quenc e (f 1 ;

;

f n )

f

subse quential functions whose domains ar e b

unde

d. Applying a transformation-based system consists

f

taking the functions f i ,

ne

after the

ther,

and for eac h

f

them,

ne

lo

ks

for the rst p

sition

in the input at whic h it applies, and for the longest string starting at that p

sition,

transforms this string, go to the end

f

this string, and iterate un til the end

f

the input. It is not true that, in general, the lo cal extension

f

a subsequen tial function is subsequen tial 19 . F

r

instance, consider the function f a

f

Figure 23. The lo cal extension

f

the function f a is not a function. In fact, consider the input string daaaad, it can b e decomp

sed

either in to d

aaa
ad
r

in to 19 Ho w ev er, the lo cal extensions

f

the functions w e had to compute wer e subsequen tial. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 34 32

a:b a:b a:b

Figure 23: F unction f a da

aaa
d.

The rst decomp

sition

leads to the

utput

dbbbad and the second

ne

to the

utput

dabbbd. The in tended use

f

the rules in the tagger dened b y Brill is to apply eac h function from left to righ t. In addition, if sev eral decomp

sitions

are p

ssible,

the

ne

that

ccurs

rst is the

ne

c hosen. In

ur

previous example, it means that

nly

the

utput

dbbbad is generated. This notion is no w dened precisely . Let

b

e the rational function dened b y (a) = a for a 2 , ([ ) = (]) =

n

the additional sym b

ls

'[ ' and '] ' with

suc

h that (u

v

) = (u)

(v

). Denition 5 L et Y

and

X =

Y
,

a Y

de

c

mp
sition
f

x is a string y 2 X

([
Y
]
X

)

s.t.

(y ) = x F

r

instance, if Y = dom(f a ) = faaag, the set

f

Y

decomp
sitions
f

x = daaad is fd[aaa]ad; da[aaa ]dg. Denition 6 L et < b e a total

r

der

n
and

let

=
[

f[ ; ]g b e the alphab et

with

the two additional symb

ls

'[ ' and ']'. L et extend the

r

der > to

by

8a 2 , '[ '< a and a < '] '. < denes a lexic

gr

aphic

r

der

n
that

we also denote <. L et Y

and

x 2

,

the minimal Y

decomp
sition
f

x is the Y

de

c

mp
sition

which is minimal in (

;

<). F

r

instance, the minim al dom(f a )-decomp

sition
f

daaaad is d[aaa]ad. In fact, d[aaa]ad < da[aaa]d. Prop

sition

2 Given Y

+

nite, the function md Y that to e ach x 2

asso

ciates its minimal Y

de

c

mp
sition,

is subse quential and total. Pro

f.

Let de c b e dened b y de c (w ) = u

[
v
]
de

c((uv ) 1

w

) where u; v 2

are

s.t. v 2 Y , 9v 2

with

w = uv v and juj is minimal among suc h strings. The function md Y is total b ecause the function de c alw a ys returns an

utput

whic h is a Y

decomp
sition
f

w . MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 35 33 W e shall no w pro v e that the function is rational and then that it has b

unded

v ariations; this will pro v e according to theorem 1 that the function is subsequen tial. In the follo wing X =

Y
.

The transduction T Y that generates the set

f

Y

decomp
sitions

is dened b y T Y = Id X

(=[
Id

Y

=]
Id

X )

where

Id X (resp. Id Y ) stands for the iden tit y function

n

X (resp. Y ). F urthermore, the transduction T ;> that to eac h string w 2

asso

ciates the set

f

strings strictly greater than w , that is T ;> (w ) = fw 2

jw

< w g, is dened b y the transducer

f

Figure 24 in whic h A = f(x; x)jx 2 g, B = f(x; y ) 2

2

jx < y g, C =

2

, D = fg

and

E =

fg.

20

D B A E D C 1 D 2 E 3 D 4

Figure 24: T ransduction T ;> Therefore, the righ t-minim al Y

decomp
sition

function md Y is dened b y md Y = T Y

(T

;>

T

Y ) whic h pro v es that md Y is rational. Let k > 0. Let K = 6

k

+ 6

M

where M = max x2Y jxj. Let u, v 2

b

e s.t. ku; v k

k

. Let us consider t w

cases:

(i) ju ^ v j

M

and (ii) ju ^ v j > M . (i): ju ^ v j

M

, th us juj,jv j

ju

^ v j + ku; v k

M

+ k . Moreo v er, for eac h w 2

,

for eac h Y

decomp
sition

w

f

w , jw j

3
jw

j. In fact, Y do esn't con tain , th us the n um b er

f

[ (resp. ] ) in w is smaller than jw j. Therefore, jmd Y (u)j; jmd Y (v )j

3
(M

+ k ) th us kmd Y (u); md Y (v )k

K

. (ii): u ^ v =

!

with j! j = M . Let ,

b

e s.t. u = !

and

v = !

.

Let

,

! ,

,
00

, ! 00 and

00

b e s.t. md Y (u) =

!
,

md Y (v ) =

00

! 00

00

, ( ) = ( 00 ) = , (! ) = (! 00 ) = ! , ( ) =

and

( 00 ) =

.

Supp

se

that

6=
00

, for instance

<
00

. Let i b e the rst indice s.t. ( ) i < ( 00 ) i . 21 20 This construction is similar to the transduction built within the pro

f
f

Eilen b erg's cross section theorem (Eilen b erg, 1974). 21 (w ) i refers to the i th letter in w . MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 36 34 W e ha v e t w

p
ssible

situations: (ii.1) ( ) i = [ and

00

2

r

( 00 ) i = ] . In that case, since the length

f

the elemen t s in Y is smaller than M = j! j ,

ne

has

!

=

1

[ 2 ] 3 with j 1 j = i,

2

2 Y and

3

2

.

W e also ha v e

00

! 00 =

1
2
3

with ( 2 ) = ( 2 ) and the rst letter

f
2

is dieren t from [. Let

4

b e a Y

decomp
sition
f

( 3

00

), then

1

[ 2 ]

4

is a Y

decomp
sition
f

v strictly smaller than

1
2
3
00

= md Y (v ) whic h con tradicts the minimali t y

f

md Y (v ). The second situation is (ii.2): ( ) i 2

and

( 00 ) i = ] , then w e ha v e

!

=

1

[ 2

3

]

4

s.t. j 1 [

2

j = i and

00

! 00 =

1

[ 2 ] 3

4

s.t. ( 3 ) = ( 3 ) and ( 4 ) = ( 4 ). Let

5

b e a Y

decomp
sition
f
4
00

then

1

[ 2

3

]

5

is a Y

decomp
sition
f

v strictly smaller than

00

! 00

00

whic h leads to the same con tradiction. Therefore,

=
00

and since j j + j 00 j

3
(jj

+ j j) = 3

ku;

v k

3
k

, kmd Y (u); md Y (v )k

j!

j + j! 00 j + j j + j 00 j

2
M

+ 3

k
K

. This pro v es that md Y has b

unded

v ariations and therefore that it is subsequen tial. 2 W e can no w dene precisely what is the eect

f

a function when

ne

applies it from left to righ t, as w as done in the

riginal

tagger. Denition 7 If f is a r ational function, Y = dom(f )

+

, the right- minimal lo c al extension

f

f , denote d RmLocE xt(f ), is the c

mp
sition
f

a right-minimal Y

de

c

mp
sition

md Y with Id

([

=

f
]

=

Id
)
.

R mL

cExt

b eing the comp

sition
f

t w

subsequen

tial functions, it is itself subsequen tial, this pro v es the follo wing nal prop

sition

whic h states that giv en a rule-based system similar to Brill's system,

ne

can build a subsequen tial transducer that represen ts it: Prop

sition

3 If (f 1 ;

;

f n ) is a se quenc e

f

subse quential functions with b

unde

d domains then RmLocE xt(f 1 )

RmLocE

xt(f n ) is subse quential. W e ha v e pro v en in this section that

ur

tec hniques apply to the class

f

transformation-based systems. W e no w turn

ur

atten tion to the implem en- tation

f

nite-state transducers. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 37 35 11 Implemen tation

f

Finite-state T ransducers Once the nal nite-state transducer is computed, applying it to an input is straigh tforw ard: it consists

f

follo wing a unique path in the transducer whose left lab els corresp

nd

to the input. Ho w ev er, in

rder

to ha v e a complexit y fully indep enden t

f

the size

f

the grammar and in particular, indep enden t

f

the n um b er

f

transitions at eac h state,

ne

should carefully c ho

se

an appropriate represen tation for the transducer. In

ur

impleme n tation, the transitions can b e accessed randomly . The transducer is rst represen ted b y a t w

-dimensional

table whose ro ws are indexed b y the states and whose columns are indexed b y the alphab et

f

all p

ssible

input letters. The con- ten t

f

the table at line q and at column a is the w

rd

w suc h that the transition from q with the input lab el a

utputs

w . Since

nly

a few transitions are allo w ed from man y states, this table is v ery sparse and can b e compressed. This compression is ac hiev ed using a pro cedure for sparse data tables follo wing the metho d giv en b y T arjan and Y ao (1979 ). 12 Ac kno wledgmen ts W e thank Eric Brill for pro viding us with the co de

f

his tagger and for man y useful discussions. W e also thank Ara vind K. Joshi, Mark Lib erman and Mehry ar Mohri for v aluable discussions. W e thank the anon ymous review ers for man y helpful commen ts that led to impro v em e n ts in b

th

the con ten t and the presen tation

f

this pap er. 13 Conclusion The tec hniques describ ed in this pap er are more general than the problem

f

part-of-sp eec h tagging and are applicable to the class

f

problems dealing with lo cal transformation rules. W e sho w ed that an y transformation based program can b e transformed in to a deterministic nite-state transducer. This yields to

ptimal

time implemen tations

f

transformation based programs. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 38 36 As a case study , w e applied these tec hniques to the problem

f

part-of- sp eec h tagging and presen ted a nite-state tagger that requires n steps to tag a sen tence

f

length n, indep enden t

f

the n um b er

f

rules and the length

f

the con text they require. W e ac hiev ed this result b y represen ting the rules acquired for Brill's tagger as non-deterministic nite-state transducers. W e comp

sed

eac h

f

these non-deterministic transducers and turned the resulting transducer in to a deterministic transducer. The resulting deterministic transducer yields a part-of-sp eec h tagger whic h

p

erates in

ptimal

time in the sense that the time to assign tags to a sen tence corresp

nds

to the time required to follo w a single path in this deterministic nite-state mac hine. The tagger

utp

erforms in sp eed b

th

Brill's tagger and trigram-based taggers. Moreo v er, the nite-state tagger inherits from the rule-based system its compactness compared to a sto c hastic tagger. W e also pro v ed the correctness and the generalit y

f

the metho ds. W e b eliev e that this nite-state tagger will also b e found useful com bined with

ther

language comp

nen

ts, since it can b e naturally extended b y com- p

sing

it with nite-state transducers whic h could enco de

ther

asp ects

f

natural language syn tax. Bibliograph y Brill, Eric. 1992. A simple rule-based part

f

sp eec h tagger. In Thir d Con- fer enc e

n

Applie d Natur al L anguage Pr

c

essing, pages 152{155, T ren to, Italy . Brill, Eric. 1994. A rep

rt
f

recen t progress in transformation error-driv en learning. In AAAI'94, T enth National Confer enc e

n

A rticial Intel li- genc e. Chorut, Christian. 1977. Une caract

erisation

des fonctions s

equen

tielles et des fonctions sous-s

equen

tielles en tan t que relations rationnelles. The

r

etic al Computer Scienc e, 5:325{338. Chorut, Christian. 1978. Contribution

a

l'

etude

de quelques famil les r e- mar quables de fonctions r ationnel les. Ph.D. thesis, Univ ersit

e

P aris VI I (Th

ese

d'Etat). Chomsky , N. 1964. Syntactic Structur es. Mouton and Co., The Hague. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 39 37 Ch urc h, Kenneth W ard. 1988. A sto c hastic parts program and noun phrase parser for unrestricted text. In Se c

nd

Confer enc e

n

Applie d Natur al L anguage Pr

c

essing, Austin, T exas. Clemenceau, Da vid. 1993. Structur ation du L exique et R e c

nnaissanc

e de Mots D

eriv
es.

Ph.D. thesis, Univ ersit

e

P aris 7. Cutting, Doug, Julian Kupiec, Jan P ederson, and P enelop e Sibun. 1992. A practical part-of-sp eec h tagger. In Thir d Confer enc e

n

Applie d Natur al L anguage Pr

c

essing, pages 133{140, T ren to, Italy . DeRose, S.J. 1988. Grammatical category disam biguation b y statistical

ptimization.

Computational Linguistics, 14:31{39. Eilen b erg, Sam uel. 1974. A utomata, languages, and machines. Academic Press, New Y

rk.

Elgot, C. C. and J. E. Mezei. 1965. On relations dened b y generalized nite automata. IBM Journal

f

R ese ar ch and Development, 9:47{65, Jan uary . F rancis, W. Nelson and Henry Ku

cera.

1982. F r e quency A nalysis

f

English Usage. Hough ton Miin, Boston. Kaplan, Ronald M. and Martin Ka y . 1994. Regular mo dels

f

phonological rule systems. Computational Linguistics, 20(3):331{378 . Karttunen, Lauri, Ronald M. Kaplan, and Annie Zaenen. 1992. Tw

-lev

el morphology with comp

sition.

In Pr

c

e e dings

f

the 14 th International Confer enc e

n

Computational Linguistics (COLING'92). Kupiec, J. M. 1992. Robust part-of-sp eec h tagging using a hidden Mark

v

mo del. Computer Sp e e ch and L anguage, 6:225{242. Lap

rte,

Eric. 1993. Phon

etique

et transducteurs. T ec hnical rep

rt,

Univ ersit

e

P aris 7, June. Merialdo, Bernard. 1990. T agging text with a probabilistic mo del. T ec hnical Rep

rt

R C 15972, IBM Researc h Division. Mohri, Mehry ar. 1994a. Minimisation

f

sequen tial transducers. In Pr

c

e e d- ings

f

the Confer enc e

n

Computational Pattern Matching 1994. MERL-TR-94-07. V ersion 3.0 Marc h 1995

SLIDE 40 38 Mohri, Mehry ar. 1994b. On some applications

f

nite-state automata the-

ry

to natural language pro cessing. T ec hnical rep

rt,

Institut Gaspard Monge. P ereira, F ernando C. N., Mic hael Riley , and Ric hard W. Sproat. 1994. W eigh ted rational transductions and their application to h uman language pro cessing. In ARP A Workshop

n

Human L anguage T e chnolo gy. Mor- gan Kaufmann. Revuz, Dominique. 1991. Dictionnair es et L exiques, M

etho

des et A lgo- rithmes. Ph.D. thesis, Univ ersit

e

P aris 7. Ro c he, Emman uel. 1993. A nalyse Syntaxique T r ansformationel le du F r an cais p ar T r ansducteurs et L exique-Gr ammair e. Ph.D. thesis, Uni- v ersit

e

P aris 7, Jan uary . Sc h } utzen b erger, Marcel P aul. 1977. Sur une v arian te des fonctions sequen- tielles. The

r

etic al Computer Scienc e, 4:47{57. Silb erztein, Max. 1993. Dictionnair es Ele ctr

niques

et A nalyse L exic ale du F r an cais | L e Syst

eme

INTEX. Masson. T apanainen, P asi and A tro V

utilainen.

1993. Am biguit y resolution in a reductionistic parser. In Sixth Confer enc e

f

the Eur

p

e an Chapter

f

the A CL, Pr

c

e e dings

f

the Confer enc e, Utrec h t, April. T arjan, Rob ert Endre and Andrew Chi-Chih Y ao. 1979. Storing a sparse table. Communic ations

f

the A CM, 22(11):606{611 , No v em b er. W eisc hedel, Ralph, Marie Meteer, Ric hard Sc h w artz, Lance Ramsha w, and Je P alm ucci. 1993. Coping with am biguit y and unkno wn w

rds

through probabilistic mo dels. Computational Linguistics, 19(2):359{382 , June. MERL-TR-94-07. V ersion 3.0 Marc h 1995