MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR - - PDF document

mitsubishi electric resear ch labora tories cambridge
SMART_READER_LITE
LIVE PREVIEW

MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR - - PDF document

MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR CH CENTER Determini sti c P art-Of-Sp eec h T agging with Finite State T ransducers Emman uel Ro c he and Yv es Sc hab es Mitsubishi Electric Researc


slide-1
SLIDE 1 MITSUBISHI ELECTRIC RESEAR CH LABORA TORIES CAMBRIDGE RESEAR CH CENTER Determini sti c P art-Of-Sp eec h T agging with Finite State T ransducers Emman uel Ro c he and Yv es Sc hab es Mitsubishi Electric Researc h Lab
  • ratories
201 Broadw a y , Cam bridge, MA 02139 e-mail: roche@merl.com and schabes@merl.com TR-94-07. V ersion 3.0 Marc h 1995 Abstract Sto c hastic approac hes to natural language pro cessing ha v e
  • ften
b een preferred to rule-based approac hes b ecause
  • f
their robustness and their automatic train- ing capabilities. This w as the case for part-of-sp eec h tagging un til Brill sho w ed ho w state-of-the-art part-of-sp eec h tagging can b e ac hiev ed with a rule-based tagger b y inferring rules from a training corpus. Ho w ev er, curren t implemen- tations
  • f
the rule-based tagger run more slo wly than previous approac hes. In this pap er, w e presen t a nite-state tagger inspired b y the rule-based tagger whic h
  • p
erates in
  • ptimal
time in the sense that the time to assign tags to a sen tence corresp
  • nds
to the time required to follo w a single path in a determin- istic nite-state mac hine. This result is ac hiev ed b y enco ding the application
  • f
the rules found in the tagger as a non-deterministic nite-state transducer and then turning it in to a deterministic transducer. The resulting determinis- tic transducer yields a part-of-sp eec h tagger whose sp eed is dominated b y the access time
  • f
mass storage devices. W e then generalize the tec hniques to the class
  • f
transformation-based systems. Publishe d in Computational Linguistics, June 1995 21(2), 227-253. This w
  • rk
ma y not b e copied
  • r
repro duced in whole
  • r
in part for an y commercial purp
  • se.
P ermission to cop y in whole
  • r
in part without pa ymen t
  • f
fee is gran ted for nonprot educational and researc h pur- p
  • ses
pro vided that all suc h whole
  • r
partial copies include the follo wing: a notice that suc h cop ying is b y p ermission
  • f
Mitsubishi Electric Researc h Lab
  • ratories
  • f
Cam bridge, Massac h usetts; an ac kno wledgmen t
  • f
the authors and individu al con tributions to the w
  • rk;
and all applicabl e p
  • rtions
  • f
the cop yrigh t notice. Cop ying, repro duction,
  • r
republishi ng for an y
  • ther
purp
  • se
shall require a license with pa ymen t
  • f
fee to Mitsubishi Electric Researc h Lab
  • ratories.
All righ ts reserv ed. Cop yrigh t c
  • Mitsubishi
Electric Researc h Lab
  • ratories,
1995 201 Broadw a y , Cam bridge, Massac h usetts 02139
slide-2
SLIDE 2 Revisions history . 1. V ersion 1.0, Ma y 2nd 1994. 2. V ersion 1.1, June 16th 1994. 3. V ersion 1.2, June 22nd 1994. 4. V ersion 1.3, July 27th 1994. 5. V ersion 1.4, July 1994. 6. V ersion 2.0, Decem b er 9th 1994. 7. This v ersion is Revision 3.0
  • f
Date: 95/03 .
slide-3
SLIDE 3 1 1 In tro duction Finite-state devices ha v e imp
  • rtan
t applications to man y areas
  • f
computer science, including pattern matc hing, databases and compiler tec hnology . Al- though their linguistic adequacy to natural language pro cessing has b een questioned in the past (Chomsky , 1964), there has recen tly b een a dramatic renew al
  • f
in terest in the application
  • f
nite-state devices to sev eral as- p ects
  • f
natural language pro cessing. This renew al
  • f
in terest is due to the sp eed and the compactness
  • f
nite-state represen tations. This eciency is explained b y t w
  • prop
erties: nite-state devices can b e made determin- istic, and they can b e turned in to a minim al form. Suc h represen tations ha v e b een successfully applied to dieren t asp ects
  • f
natural language pro- cessing, suc h as morphological analysis and generation (Karttunen, Kaplan, and Zaenen, 1992; Clemenceau, 1993), parsing (Ro c he, 1993; T apanainen and V
  • utilainen,
1993), phonology (Lap
  • rte,
1993; Kaplan and Ka y , 1994) and sp eec h recognition (P ereira, Riley , and Sproat, 1994). Although nite- state mac hines ha v e b een used for part-of-sp eec h tagging (T apanainen and V
  • utilainen,
1993; Silb erztein, 1993), none
  • f
these approac hes has the same exibilit y as sto c hastic tec hniques. Unlik e sto c hastic approac hes to part-of- sp eec h tagging (Ch urc h, 1988; Kupiec, 1992; Cutting et al., 1992; Merialdo, 1990; DeRose, 1988; W eisc hedel et al., 1993), up to no w the kno wledge found in nite-state taggers has b een handcrafted and cannot b e automatically acquired. Recen tly , Brill (1992 ) describ ed a rule-based tagger whic h p erforms as w ell as taggers based up
  • n
probabilistic mo dels and whic h
  • v
ercomes the limitations common in rule-based approac hes to language pro cessing: it is robust and the rules are automatically acquired. In addition, the tagger requires drastically less space than sto c hastic taggers. Ho w ev er, curren t im- plemen tations
  • f
Brill's tagger are considerably slo w er than the
  • nes
based
  • n
probabilistic mo dels since it ma y require RC n elemen tary steps to tag an input
  • f
n w
  • rds
with R rules requiring at most C tok ens
  • f
con text. Although the sp eed
  • f
curren t part-of-sp eec h taggers is acceptable for in teractiv e systems where a sen tence at a time is b eing pro cessed, it is not adequate for applications where large b
  • dies
  • f
text need to b e tagged, suc h as in information retriev al, indexing applications and grammar c hec king sys- tems. F urthermore, the space required for part-of-sp eec h taggers is also an issue in commerci al p ersonal computer applications suc h as grammar c hec k- MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-4
SLIDE 4 2 ing systems. In addition, part-of-sp eec h taggers are
  • ften
b eing coupled with a syn tactic analysis mo dule. Usually these t w
  • mo
dules are written in dif- feren t framew
  • rks,
making it v ery dicult to in tegrate in teractions b et w een the t w
  • mo
dules. In this pap er, w e design a tagger that requires n steps to tag a sen tence
  • f
length n, indep enden t
  • f
the n um b er
  • f
rules and the length
  • f
the con- text they require. The tagger is represen ted b y a nite-state transducer, a framew
  • rk
whic h can also b e the basis for syn tactic analysis. This nite-state tagger will also b e found useful com bined with
  • ther
language comp
  • nen
ts since it can b e naturally extended b y comp
  • sing
it with nite-state trans- ducers whic h could enco de
  • ther
asp ects
  • f
natural language syn tax. Relying
  • n
algorithms and formal c haracterization describ ed in later sec- tions, w e explain ho w eac h rule in Brill's tagger can b e view ed as a non- deterministic nite-state transducer. W e also sho w ho w the application
  • f
all rules in Brill's tagger is ac hiev ed b y comp
  • sing
eac h
  • f
these non- deterministic transducers and wh y non-determinism arises in this transducer. W e then pro v e the correctness
  • f
the general algorithm for determinizi ng (whenev er p
  • ssible)
nite-state transducers and w e successfully apply this algorithm to the previously
  • btained
non-deterministic transducer. The re- sulting deterministic transducer yields a part-of-sp eec h tagger whic h
  • p
erates in
  • ptimal
time in the sense that the time to assign tags to a sen tence cor- resp
  • nds
to the time required to follo w a single path in this deterministic nite-state mac hine. W e also sho w ho w the lexicon used b y the tagger can b e
  • ptimally
enco ded using a nite-state mac hine. The tec hniques used for the construction
  • f
the nite-state tagger are then formalized and mathematically pro v en correct. W e in tro duce a pro
  • f
  • f
soundness and completeness with a w
  • rst
case complexit y analysis for an algorithm for determinizi ng nite-state transducers. W e conclude b y pro ving ho w the metho d can b e applied to the class
  • f
transformation-based error-driv en systems. 2 Ov erview
  • f
Brill's T agger Brill's tagger is comprised
  • f
three parts, eac h
  • f
whic h is inferred from a training corpus: a lexical tagger, an unkno wn w
  • rd
tagger and a con textual tagger. F
  • r
purp
  • ses
  • f
exp
  • sition,
w e will p
  • stp
  • ne
the discussion
  • f
the MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-5
SLIDE 5 3 unkno wn w
  • rd
tagger and fo cus mainly
  • n
the con textual rule tagger, whic h is the core
  • f
the tagger. The lexical tagger initially tags eac h w
  • rd
with its most lik ely tag, esti- mated b y examining a large tagged corpus, without regard to con text. F
  • r
example, assuming that v bn is the most lik ely tag for the w
  • rd
\killed" and v bd for \shot", the lexical tagger migh t assign the follo wing part-of-sp eec h tags: 1 (1) Chapman/np killed/v bn John/np Lennon/np (2) John/np Lennon/np w as/bedz shot/v bd b y/by Chapman/np (3) He/pps witnessed/v bd Lennon/np killed/v bn b y/by Chapman/np Since the lexical tagger do es not use an y con textual information, man y w
  • rds
can b e tagged incorrectly . F
  • r
example, in (1), the w
  • rd
\killed" is erroneously tagged as a v erb in past participle form, and in (2), \shot" is incorrectly tagged as a v erb in past tense. Giv en the initial tagging
  • btained
b y the lexical tagger, the con textual tagger applies a sequence
  • f
rules in
  • rder
and attempts to remedy the errors made b y the initial tagging. F
  • r
example, the rules in Figure 1 migh t b e found in a con textual tagger. 1. v bn v bd PREVT A G np 2. v bd v bn NEXTT A G by Figure 1: Sample rules The rst rule sa ys to c hange tag v bn to v bd if the previous tag is np. The second rule sa ys to c hange v bd to tag v bn if the next tag is by . Once the rst rule is applied, the tag for \killed" in (1) and (3) is c hanged from v bn to v bd and the follo wing tagged sen tences are
  • btained:
(4) Chapman/np killed/v bd John/np Lennon/np 1 The notation for part-of-sp eec h tags is adapted from the
  • ne
used in the Bro wn Corpus (F rancis and Ku
  • cera,
1982): pps stands for third singular nominativ e pronoun, v bd for v erb in past tense, np for prop er noun, v bn for v erb in past participle form, by for the w
  • rd
\b y", at for determiner, nn for singular noun and bedz for the w
  • rd
\w as". MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-6
SLIDE 6 4 (5) John/np Lennon/np w as/bedz shot/v bd b y/by Chapman/np (6) He/pps witnessed/v bd Lennon/np killed/v bd b y/by Chapman/np And
  • nce
the second rule is applied, the tag for \shot" in (5) is c hanged from v bd to v bn, resulting in (8) and the tag for \killed" in (6) is c hanged bac k from v bd to v bn, resulting in (9): (7) Chapman/np killed/v bd John/np Lennon/np (8) John/np Lennon/np w as/bedz shot/v bn b y/by Chapman/np (9) He/pps witnessed/v bd Lennon/np killed/v bn b y/by Chapman/np It is relev an t to
  • ur
follo wing discussion to note that the application
  • f
the NEXTT A G rule m ust lo
  • k
ahead
  • ne
tok en in the sen tence b efore it can b e applied and that the application
  • f
t w
  • rules
ma y p erform a series
  • f
  • p
erations resulting in no net c hange. As w e will see in the next section, these t w
  • asp
ects are the source
  • f
lo cal non-determinism in Brill's tagger. The sequence
  • f
con textual rules is automatically inferred from a training corpus. A list
  • f
tagging errors (with their coun ts) is compiled b y comparing the
  • utput
  • f
the lexical tagger to the correct part-of-sp eec h assignmen t. Then, for eac h error, it is determined whic h instan tiation
  • f
a set
  • f
rule templates results in the greatest error reduction. Then the set
  • f
new errors caused b y applying the rule is computed and the pro cess is rep eated un til the error reduction drops b elo w a giv en threshold. After training
  • n
the Bro wn Corpus, using the set
  • f
con textual rule tem- plates sho wn in Figure 2, 280 con textual rules are
  • btained.
The resulting rule-based tagger p erforms as w ell as the state-of-the-art taggers based up
  • n
probabilistic mo dels. It also
  • v
ercomes the limitations common in rule-based approac hes to language pro cessing: it is robust, and the rules are automat- ically acquired. In addition, the tagger requires drastically less space than sto c hastic taggers. Ho w ev er, as w e will see in the next section, Brill's tagger is inheren tly slo w. 3 Complexit y
  • f
Brill's T agger Once the lexical assignmen t is p erformed, in Brill's algorithm, eac h con tex- tual rule acquired during the training phase is applied to eac h sen tence to b e MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-7
SLIDE 7 5 A B PREVT A G C c hange A to B if previous tag is C A B PREV1OR2OR3T A G C c hange A to B if previous
  • ne
  • r
t w
  • r
three tag is C A B PREV1OR2T A G C c hange A to B ifprevious
  • ne
  • r
t w
  • tag
is C A B NEXT1OR2T A G C c hange A to B if next
  • ne
  • r
t w
  • tag
is C A B NEXTT A G C c hange A to B if next tag is C A B SURR OUNDT A G C D c hange A to B if surrounding tags are C and D A B NEXTBIGRAM C D c hange A to B if next bigram tag is C D A B PREVBIGRAM C D c hange A to B if previous bigram tag is C D Figure 2: Con textual Rule T emplates tagged. F
  • r
eac h individual rule, the algorithm scans the input from left to righ t while attempting to matc h the rule. This simple algorithm is computationally inecien t for t w
  • reasons.
The rst reason for ineciency is the fact that an individual rule is matc hed at eac h tok en
  • f
the input, regardless
  • f
the fact that some
  • f
the curren t tok ens ma y ha v e b een previously examined when matc hing the same rule at a previous p
  • sition.
The algorithm treats eac h rule as a template
  • f
tags and slides it along the input,
  • ne
w
  • rd
at a time. Consider, for example, the rule A B PREVBIGRAM C C that c hanges tag A to tag B if the previous t w
  • tags
are C .

C D C C A C C A C D C C A C C A C D C C A C C A (1) (2) (3) ↔ * * ↔ ↔ ↔

Figure 3: P artial matc hes
  • f
A B PREVBIGRAM C C
  • n
the input C D C C A. When applied to the input C D C C A, the pattern C C A is matc hed three times, as sho wn in Figure 3. A t eac h step no record
  • f
previous partial matc hes
  • r
mismatc he s is remem b ered. In this example, C is compared with the second input tok en D during the rst and second steps, and therefore, the MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-8
SLIDE 8 6 second step could ha v e b een skipp ed b y remem b ering the comparisons from the rst step. This metho d is similar to a naiv e pattern matc hing algorithm. The second reason for ineciency is the p
  • ten
tial in teraction b et w een rules. F
  • r
example, when the rules in Figure 1 are applied to sen tence (3), the rst rule results in a c hange (6) whic h is undone b y the second rule as sho wn in (9). The algorithm ma y therefore p erform unnecessary computation. In summary , Brill's algorithm for implem en ting the con textual tagger ma y require RC n elemen tary steps to tag an input
  • f
n w
  • rds
with R con textual rules requiring at most C tok ens
  • f
con text. 4 Construction
  • f
the Finite-State T agger W e sho w ho w the function represen ted b y eac h con textual rule can b e repre- sen ted as a non-deterministic nite-state transducer and ho w the sequen tial application
  • f
eac h con textual rule also corresp
  • nds
to a non-deterministic nite-state transducer b eing the result
  • f
the comp
  • sition
  • f
eac h individual transducer. W e will then turn the non-deterministic transducer in to a deter- ministic transducer. The resulting part-of-sp eec h tagger
  • p
erates in linear time indep enden t
  • f
the n um b er
  • f
rules and the length
  • f
the con text. The new tagger
  • p
erates in
  • ptimal
time in the sense that the time to assign tags to a sen tence corresp
  • nds
to the time required to follo w a single path in the resulting deterministic nite-state mac hine. Our w
  • rk
relies
  • n
t w
  • cen
tral notions: the notion
  • f
a nite-state trans- ducer and the notion
  • f
a subsequen tial transducer. Informally sp eaking, a nite-state transducer is a nite-state automaton whose transitions are la- b eled b y pairs
  • f
sym b
  • ls.
The rst sym b
  • l
is the input and the second is the
  • utput.
Applying a nite-state transducer to an input consists
  • f
follo w- ing a path according to the input sym b
  • ls
while storing the
  • utput
sym b
  • ls,
the result b eing the sequence
  • f
  • utput
sym b
  • ls
stored. Section 8.1 formally denes the notion
  • f
transducer. Finite-state transducers can b e comp
  • sed,
in tersected, merged with the union
  • p
eration and sometimes determinized. Basically ,
  • ne
can manipulate nite-state transducers as easily as nite-state automata. Ho w ev er, whereas ev ery nite-state automaton is equiv alen t to some deterministic nite-state automaton, there are nite-state transducers that are not equiv alen t to an y deterministic nite-state transducer. T ransductions that can b e computed b y MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-9
SLIDE 9 7 some deterministic nite-state transducer are called subse quential functions. W e will see that the nal step
  • f
the compilation
  • f
  • ur
tagger consists
  • f
transforming a nite-state transducer in to an equiv alen t subsequen tial transducer. W e will use the follo wing notation when pictorially describing a nite- state transducer: nal states are depicted with t w
  • concen
tric circles;
  • rep-
resen ts the empt y string;
  • n
a transition from state i to state j , a=b indicates a transition
  • n
input sym b
  • l
a and
  • utput
sym b
  • l(s)
b; 2 a question mark (?)
  • n
an arc transition (for example lab eled ?=b)
  • riginating
at state i stands for an y input sym b
  • l
that do es not app ear as an input sym b
  • l
  • n
an y
  • ther
  • ut-
going arc from i. In this do cumen t, eac h depicted nite-state transducer will b e assumed to ha v e a single initial state, namely the leftmost state app earing in the gures (usually lab eled 0). W e are no w ready to construct the tagger. Giv en a set
  • f
rules, the tagger is constructed in four steps. The rst step consists
  • f
turning eac h con textual rule found in Brill's tagger in to a nite-state transducer. F
  • llo
wing the example discussed in Section 2, the functionalit y
  • f
the rule vbn vb d PREVT A G np is represen ted b y the transducer sho wn in Figure 4
  • n
the left.

np/np vbn/vbd 1 2 ?/? np/np ?/? vbn/vbd np/np 1

Figure 4: L eft: transducer T 1 represen ting the con textual rule vbn vb d PREVT A G np . R ight: lo cal extension LocE xt(T 1 )
  • f
T 1 2 When m ultiple
  • utput
sym b
  • ls
are emitted, a comma sym b
  • lizes
the concatenation
  • f
the
  • utput
sym b
  • ls.
MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-10
SLIDE 10 8 Eac h
  • f
the con textual rules is dened lo cally; that is, the transformation it describ es m ust b e applied at eac h p
  • sition
  • f
the input sequence. F
  • r
instance, the rule A B PREV1OR2T A G C, that c hanges A in to B if the previous tag
  • r
the
  • ne
b efore is C , m ust b e applied t wice
  • n
C A A (resulting in the
  • utput
C B B). As w e ha v e seen in the previous section, this metho d is not ecien t. The second step consists
  • f
turning the transducers pro duced b y the pre- ceding step in to transducers that
  • p
erate globally
  • n
the input in
  • ne
pass. This transformation is p erformed for eac h transducer asso ciated with eac h rule. Giv en a function f 1 that transforms, sa y , a in to b (i.e. f 1 (a) = b), w e w an t to extend it to a function f 2 suc h that f 2 (w ) = w where w is the w
  • rd
built from the w
  • rd
w where eac h
  • ccurrence
  • f
a has b een replaced b y b. W e sa y that f 2 is the lo c al extension 3
  • f
f 1 and w e write f 2 = LocE xt(f 1 ). Section 8.2 formally denes this notion and giv es an algorithm for computing the lo cal extension. Referring to the example
  • f
Section 2, the lo cal extension
  • f
the trans- ducer for the rule vbn vb d PREVT A G np is sho wn to the righ t
  • f
Figure 4. Similarly , the transducer for the con textual rule vb d vbn NEXTT A G by and its lo cal extension are sho wn in Figure 5.

vbd/vbn by/by 1 2 ?/? vbd/vbd vbd/vbn by/by 1 ?/? by/by vbd/vbn vbd/vbd 2 3

Figure 5: L eft: transducer T 2 represen ting vb d vbn NEXTT A G by. R ight: lo cal extension LocE xt(T 2 )
  • f
T 2 The transducers
  • btained
in the previous step still need to b e applied
  • ne
after the
  • ther.
The third step com bines all transducers in to
  • ne
single 3 This notion w as in tro duced b y Ro c he (1993 ). MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-11
SLIDE 11 9 transducer. This corresp
  • nds
to the formal
  • p
eration
  • f
comp
  • sition
dened
  • n
transducers. The formalization
  • f
this notion and an algorithm for com- puting the comp
  • sed
transducer are w ell-kno wn and are describ ed
  • riginally
b y Elgot and Mezei (1965 ). Returning to
  • ur
running example
  • f
Section 2, the transducer
  • btained
b y comp
  • sing
the lo cal extension
  • f
T 2 (righ t in Figure 5) with the lo cal extension
  • f
T 1 (righ t in Figure 4) is sho wn in Figure 6.

np/np vbd/vbn vbd/vbd ?/? vbn/vbn vbn/vbd vbd/vbn vbd/vbd ?/? np/np 1 by/by 2 vbd/vbd vbd/vbn np/np by/by ?/? 3 4

Figure 6: Comp
  • sition
T 3 = LocE xt(T 1 )
  • LocE
xt(T 2 ) The fourth and nal step consists
  • f
transforming the nite-state trans- ducer
  • btained
in the previous step in to an equiv alen t subsequen tial (de- terministic) transducer. The transducer
  • btained
in the previous step ma y con tain some non-determinism. The fourth step tries to turn it in to a deter- ministic mac hine. This determinization is not alw a ys p
  • ssible
for an y giv en nite-state transducer. F
  • r
example, the transducer sho wn in Figure 7 is not equiv alen t to an y subsequen tial transducer. In tuitiv ely sp eaking, suc h a transducer has to lo
  • k
ahead an un b
  • unded
distance in
  • rder
to correctly generate the
  • utput.
This in tuition will b e formalized in Section 9.2. Ho w ev er, as pro v en in Section 10, the rules inferred in Brill's tagger can alw a ys b e turned in to a deterministic mac hine. Section 9.1 describ es an al- gorithm for determiniz ing nite-state transducers. This algorithm will not terminate when applied to transducers represen ting non-subsequen tial func- tions. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-12
SLIDE 12 10

a:a a:b a:a b:b 1 2 a:a c:c 3

Figure 7: Example
  • f
a transducer not equiv alen t to an y subsequen tial trans- ducer. In
  • ur
running example, the transducer in Figure 6 has some non-deterministic paths. F
  • r
example, from state
  • n
input sym b
  • l
v bd, t w
  • p
  • ssible
emissions are p
  • ssible:
v bn (from to 2) and v bd (from to 3). This non-determinism is due to the rule vb d vb d NEXTT A G by, since this rule has to read the second sym b
  • l
b efore it can kno w whic h sym b
  • l
m ust b e emitted. The deterministic v ersion
  • f
the transducer T 3 is sho wn in Figure 8. Whenev er non-determinism arises in T 3 , the deterministic mac hine emits the empt y sym b
  • l
, and p
  • st-
p
  • nes
the emission
  • f
the
  • utput
sym b
  • l.
F
  • r
example, from the start state 0, the empt y string is emitted
  • n
input v bd, while the curren t state is set to 2. If the follo wing w
  • rd
is by , the t w
  • tok
en string vbn by is emitted (from 2 to 0),
  • therwise
v bd is emitted (dep ending
  • n
the input from 2 to 2
  • r
from 2 to 0). Using an appropriate implem en tation for nite-state transducers (see Sec- tion 11), the resulting part-of-sp eec h tagger
  • p
erates in linear time, indep en- den t
  • f
the n um b er
  • f
rules and the length
  • f
the con text. The new tagger therefore
  • p
erates in
  • ptimal
time. W e ha v e sho wn ho w the con textual rules can b e implem en te d v ery ef- cien tly . W e no w turn
  • ur
atten tion to lexical assignmen t, the step that precedes the application
  • f
the con textual transducer. This step can also b e made v ery ecien t. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-13
SLIDE 13 11

?/? vbd/ε np/np ?/? np/np vbn/ε vbd/ε 1 np/vbd,np ?/vbd,? by/vbn,by vbd/vbd 2

Figure 8: Subsequen tial form for T 3 5 Lexical T agger The rst step
  • f
the tagging pro cess consists
  • f
lo
  • king
up eac h w
  • rd
in a dictionary . Since the dictionary is the largest part
  • f
the tagger in terms
  • f
space, a compact represen tation is crucial. Moreo v er, the lo
  • kup
pro cess has to b e v ery fast to
  • ,
  • therwise
the impro v em en t in sp eed
  • f
the con textual manipulations w
  • uld
b e
  • f
little practical in terest. T
  • ac
hiev e high sp eed for this pro cedure, the dictionary is represen ted b y a deterministic nite-state automaton with b
  • th
lo w access time and small storage space. Supp
  • se
  • ne
w an ts to enco de the sample dictionary
  • f
Figure 9. The algorithm, as describ ed in (Revuz, 1991), consists
  • f
rst building a tree whose branc hes are lab eled b y letters and whose lea v es are lab eled b y a list
  • f
tags (suc h as nn vb) , and then minim izi ng it in to a directed acyclic graph (D A G). The result
  • f
applying this pro cedure to the sample dictionary
  • f
Figure 9 is the D A G
  • f
Figure 10. When a dictionary is represen ted as a D A G, lo
  • king
up a w
  • rd
in it consists simply
  • f
follo wing
  • ne
path in the D A G. The complexit y
  • f
the lo
  • kup
pro cedure dep ends
  • nly
  • n
the length
  • f
the w
  • rd;
in particular, it is indep enden t
  • f
the size
  • f
the MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-14
SLIDE 14 12 dictionary . ads nns bag nn v b bagged v bn v bd ba y ed v bn v bd bids nns Figure 9: Sample Dictionary

a d s b a g g y e d (nns) (nn,vb) (vbd,vbn) i

Figure 10: D A G represen tation
  • f
the dictionary found in Figure 9. The lexicon used in
  • ur
system enco des 54; 000 w
  • rds.
The corresp
  • nding
D A G tak es 360 Kb ytes
  • f
space and it pro vides an access time
  • f
12; 000 w
  • rds
p er second. 4 6 T agging unkno wn w
  • rds
The rule-based system describ ed b y Brill (1992 ) con tains a mo dule that
  • p-
erates after all the kno wn w
  • rds
| that is, w
  • rds
listed in the dictionary | ha v e b een tagged with their most frequen t tag, and b efore the set
  • f
con tex- tual rules are applied. This mo dule guesses a tag for a w
  • rd
according to its sux (e.g. a w
  • rd
with an \ing" sux is lik ely to b e a v erb), its prex (e.g. a w
  • rd
starting with an upp ercase c haracter is lik ely to b e a prop er noun) and
  • ther
relev an t prop erties. This mo dule basically follo ws the same tec hniques as the
  • nes
used to implem en t the lexicon. Due to the similarit y
  • f
the metho ds used, w e do not pro vide further details ab
  • ut
this mo dule. 4 The size
  • f
the dictionary in ASCI I form is 742KB. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-15
SLIDE 15 13 7 Empirical Ev aluation The tagger w e constructed has an accuracy iden tical 5 to Brill's tagger
  • r
the
  • ne
  • f
statistical-based metho ds, ho w ev er it runs at a m uc h higher sp eed. The tagger runs nearly ten times faster than the fastest
  • f
the
  • ther
sys- tems. Moreo v er, the nite-state tagger inherits from the rule-based system its compactness compared to a sto c hastic tagger. In fact, whereas sto c hastic taggers ha v e to store w
  • rd-tag,
bigram and trigram probabilities, the rule- based tagger and therefore the nite-state
  • ne
  • nly
ha v e to enco de a small n um b er
  • f
rules (b et w een 200 and 300). W e empirically compared
  • ur
tagger with Eric Brill's implem e n tation
  • f
his tagger, and with
  • ur
implem en tation
  • f
a trigram tagger adapted from the w
  • rk
  • f
Ch urc h (1988 ) that w e previously implem en ted for another purp
  • se.
W e ran the three programs
  • n
large les and pip ed their
  • utput
in to a le. In the times rep
  • rted,
w e included the time sp en t reading the input and writing the
  • utput.
Figure 11 summarizes the results. All taggers w ere trained
  • n
a p
  • rtion
  • f
the Bro wn corpus. The exp erimen ts w ere run
  • n
an HP720 with 32Mb ytes
  • f
memory . In
  • rder
to conduct a fair comparison, the dictionary lo
  • kup
part
  • f
the sto c hastic tagger has also b een implem en ted using the tec hniques describ ed in Section 5. All three taggers ha v e appro ximately the same precision (95%
  • f
the tags are correct) 6 . By design, the nite-state tagger pro duces the same
  • utput
as the rule-based tagger. The rule-based tagger | and the nite-state tagger | do not alw a ys pro duce the exact same tagging as the sto c hastic tagger (they don't mak e the same errors); ho w ev er, no signican t dierence in p erformance b et w een the systems w as detected. 7 Indep enden tly , Cutting et al. (1992 ) quote a p erformance
  • f
800 w
  • rds/second
for their part-of-sp eec h tagger based
  • n
hidden Mark
  • v
mo dels. The space required b y the nite-state tagger (815KB) is decomp
  • sed
as follo ws: 363KB for the dictionary , 440KB for the subsequen tial transducer and 12KB for the mo dule for unkno wn w
  • rds.
5 Our curren t implem en tation is functionally equiv alen t to the tagger as describ ed b y Brill (1992 ). Ho w ev er, the tagger could b e extended to include recen t impro v emen ts de- scrib ed in more recen t pap ers (Brill, 1994). 6 F
  • r
ev aluation purp
  • ses,
w e randomly selected 90%
  • f
the Bro wn corpus for training purp
  • ses
and 10% for testing. W e used the Bro wn corpus set
  • f
part-of-sp eec h tags. 7 An extended discussion
  • f
the precision
  • f
the rule-based tagger can b e found in (Brill, 1992). MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-16
SLIDE 16 14 Sto c hastic T agger Rule-Based T agger Finite-State T agger Sp eed 1,200 w/s 500 w/s 10,800 w/s Space 2,158KB 379KB 815KB Figure 11: Ov erall p erformance comparison. The sp eed
  • f
  • ur
system is decomp
  • sed
in Figure 12. 8 dictionary lo
  • kup
unkno wn w
  • rds
con textual Sp eed 12,800 w/s 16,600 w/s 125,100 w/s P ercen t
  • f
the time 85% 6.5% 8.5% Figure 12: Sp eed
  • f
the dieren t parts
  • f
the program Our system reac hes a p erformance lev el in sp eed for whic h
  • ther
v ery lo w lev el factors (suc h as storage access) ma y dominate the computation. A t suc h sp eeds, the time sp en t reading the input le, breaking the le in to sen tences, and sen tences in to w
  • rds,
and writing the result in to a le is no longer negligible. 8 Finite-State T ransducers The metho ds used in the construction
  • f
the nite-state tagger describ ed in the previous sections w ere describ ed informally . In the follo wing section, the notions
  • f
nite-state transducers and the notion
  • f
lo cal extension are dened. W e also pro vide an algorithm for computing the lo cal extension
  • f
a nite-state transducer. Issues related to the determinization
  • f
nite-state transducers are discussed in the section follo wing this
  • ne.
8 In Figure 12, the dictionary lo
  • kup
includes reading the le, splitting it in to sen tences, lo
  • king
up eac h w
  • rd
in the dictionary and writing the nal result to a le. The dictionary lo
  • kup
and the tagging
  • f
unkno wn w
  • rds
tak e roughly the same amoun t
  • f
time, but since the second pro cedure
  • nly
applies
  • n
unkno wn w
  • rds
(around 10% in
  • ur
exp erimen ts) the p ercen tage
  • f
time it tak es is m uc h smaller. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-17
SLIDE 17 15 8.1 Denition
  • f
Finite-State T ransducers A nite-State tr ansduc er T is a 5-tuple (; Q; i; F ; E ) where:
  • is
a nite alphab et; Q is the set
  • f
states
  • r
v ertices; i 2 Q is the initial state; F
  • Q
is the set
  • f
nal states; E
  • Q
  • [
fg
  • Q
is the set
  • f
edges
  • r
transitions. F
  • r
instance, Figure 13 is the graphical represen tation
  • f
the transducer: T 1 = (fa; b; c; d; eg,f0; 1; 2; 3g,0,f3g,f(0; a; b; 1),(0; a; c; 2),(1; d; d; 3),(2; e; e; 3)g). A nite-state transducer T also denes a function
  • n
w
  • rds
in the follo w- ing w a y: the extended set
  • f
edges ^ E , the transitiv e closure
  • f
E , is dened b y the follo wing recursiv e relation:
  • if
e 2 E then e 2 ^ E
  • if
(q ; a; b; q ); (q ; a ; b ; q 00 ) 2 ^ E then (q ; aa ; bb ; q 00 ) 2 ^ E . Then the function f from
  • to
  • dened
b y f (w ) = w i 9q 2 F suc h that (i; w ; w ; q ) 2 ^ E is the function dened b y T . One sa ys that T represen ts f and writes f = jT j. The functions
  • n
w
  • rds
that are represen ted b y nite- state transducers are called r ational functions. If, for some input w , more than
  • ne
  • utput
is allo w ed (e.g. f (w ) = fw 1 ; w 2 ;
  • g)
then f is called a r ational tr ansduction. In the example
  • f
Figure 13, T 1 is dened b y jT 1 j(ad) = bd and jT 1 j(ae) = ce.

a/b a/c e/e h/h 1 2 3

Figure 13: T 1 : Example
  • f
Finite-State T ransducer Giv en a nite-state transducer T = (; Q; i; F ; E ), the follo wing addi- tional notions are useful: its state tr ansition function d that maps Q
  • [
fg in to 2 Q dened b y d(q ; ag ) = fq 2 Qj9w 2
  • and
(q ; a; w ; q ) 2 E g; and its emission function
  • that
maps Q
  • [
fg
  • Q
in to 2
  • dened
b y
  • (q
; a; q ) = fw 2
  • j(q
; a; w ; ; q ) 2 E g. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-18
SLIDE 18 16 A nite-state transducer could b e seen as a nite-state automaton, eac h
  • f
whose lab el is a pair. In this resp ect, T 1 w
  • uld
b e deterministic ; ho w- ev er, since transducers are generally used to compute a function, a more relev an t denition
  • f
determinism consists
  • f
sa ying that b
  • th
the transi- tion function d and the emission function
  • lead
to sets con taining at most
  • ne
elemen t, that is, jd(q ; a)j
  • 1
and j (q ; a; q )j
  • 1.
With this notion, if a nite-state transducer is deterministic,
  • ne
can apply the function to a giv en w
  • rd
b y deterministicall y follo wing a single path in the transducer. De- terministic transducers are called subse quential tr ansduc ers (Sc h } utzen b erger, 1977) 9 . Giv en a deterministic transducer, w e can dene the partial functions q
  • a
= q i d(q ; a) = fq g and q
  • a
= w i 9q 2 Q suc h that q
  • a
= q and
  • (q
; a; q ) = fw g. This leads to the denition
  • f
subse quential tr ansduc ers: a subsequen tial transducer T is a 7-tuple (; Q; i; F ; ; ; ) where: ; Q; i; F are dened as ab
  • v
e;
  • is
the deterministic state transition function that maps Q
  • n
Q,
  • ne
writes q
  • a
= q ;
  • is
the deterministic emission function that maps Q
  • n
  • ,
  • ne
writes q
  • a
= w ; and the nal emission function
  • maps
F
  • n
  • ,
  • ne
writes (q ) = w . F
  • r
instance, T 1 is not deterministic b ecause d(1; a) = fa; bg, but it is equiv alen t to T 2 represen ted Figure 14 in the sense that they represen t the same function, i.e jT 1 j = jT 2 j. T 2 is dened b y T 2 = (fa; b; c; h; eg; f0; 1; 2g; 0; f2g; ; ; ) where
  • a
= 1,
  • a
= , 1
  • h
= 2, 1
  • h
= bh, 1
  • e
= 2, 1
  • e
= ce and where (2) = .

a/ε e/ce h/bh 1 2

Figure 14: Subsequen tial T ransducer T 2 8.2 Lo cal Extension In this section, w e will see ho w a function whic h needs to b e applied at all input p
  • sitions
can b e transformed in to a global function that needs to b e applied
  • nce
  • n
the input. F
  • r
instance, consider T 3
  • f
Figure 15. It 9 A se quential tr ansduc er is a deterministic transducer for whic h all states are nal. Sequen tial transducers are also called gener alize d se quential machines (Eilen b erg, 1974). MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-19
SLIDE 19 17 represen ts the function f 3 = jT 3 j suc h that f 3 (ab) = bc and f 3 (bca) = dca. W e w an t to build the function that, giv en a w
  • rd
w , eac h time w con tains ab (i.e. ab is a factor
  • f
the w
  • rd)
(resp. bca), this factor is transformed in to its image bc (resp. dca). Supp
  • se
for instance that the input w
  • rd
is w = aabcab, as sho wn
  • n
Figure 16, and that the factors that are in dom(f 3 ) 10 can b e found according to t w
  • dieren
t factorizations: i.e. w 1 = a
  • w
2
  • c
  • w
2 11 where w 2 = ab and w 1 = aa
  • w
3
  • b
where w 3 = bca. The lo c al extension
  • f
f 3 will b e the function that tak es eac h p
  • ssible
factorization and transforms eac h factor according to f 3 , i.e. f 3 (w 2 ) = bc and f 3 (w 3 ) = dca, and lea v es the
  • ther
parts unc hanged; here this leads to t w
  • utputs:
abccbc according to the rst factorization, and aadcab according to the second factorization.

a/b b/d b/c 1 2 c/c a/a 4 3

Figure 15: T 3 : a nite-state transducer to b e extended a a b c a b a a b c a b b c b c a a b c a b d c a Figure 16: T
  • p:
input Midd le: rst factorization Bottom: second factoriza- tion The notion
  • f
lo cal extension is formalized through the follo wing deni- tion. 10 dom(f ) denotes the domain
  • f
f , that is, the set
  • f
w
  • rds
that ha v e at least
  • ne
  • utput
through f . 11 If w 1 ; w 2 2
  • ,
w 1
  • w
2 denotes the concatenation
  • f
w 1 and w 2 . It can also b e written w 1 w 2 . MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-20
SLIDE 20 18 Denition 1 If f is a r ational tr ansduction fr
  • m
  • to
  • ,
the lo cal ex- tension F = LocE xt(f ) is the r ational tr ansduction fr
  • m
  • n
  • de-
ne d in the fol lowing way: if u = a 1 b 1 a 2 b 2
  • a
n b n a n+1 2
  • then
v = a 1 b 1 a 2 b 2
  • a
n b n a n+1 2 F (u) if a i 2
  • (
  • dom(f
)
  • ),
b i 2 dom(f ) and b i 2 f (b i ). 12 In tuitiv ely , if F = LocE xt(f ) and w 2
  • ,
eac h factor
  • f
w in dom(f ) is transformed in to its image b y f and the remaining part
  • f
w is left unc hanged. If f is represen ted b y a nite-state transducer T and LocE xt(f ) is represen ted b y a nite-state transducer T ,
  • ne
writes T = LocE xt(T ). It could also b e seen that if
  • T
is the iden tit y function
  • n
  • (
  • dom(T
)
  • ),
then LocE xt(T ) =
  • T
  • (T
  • T
)
  • .
13 Figure 20 giv es an algorithm that computes the lo cal extension directly . The idea is that an input w
  • rd
is pro cessed non-deterministically from left to righ t. Supp
  • se
for instance that w e ha v e the initial transducer T 4
  • f
Figure 17 and that w e w an t to build its lo cal extension T 5
  • f
Figure 18. When the input is read, if a curren t input letter cannot b e transformed at the rst state
  • f
T 4 (the letter c for instance), it is left unc hanged: this is expressed b y the lo
  • ping
transition
  • n
the initial state
  • f
T 5 lab eled ?=?. 14 On the
  • ther
hand, if the input sym b
  • l,
sa y a, can b e pro cessed at the initial state
  • f
T 4 ,
  • ne
do esn't kno w y et whether a will b e the b eginning
  • f
a w
  • rd
that can b e transformed (e.g. ab)
  • r
whether it will b e follo w ed b y a sequence whic h mak es it imp
  • ssible
to apply the transformation (e.g. ac). Hence
  • ne
has to en tertain t w
  • p
  • ssibilities,
namely (1) w e are pro cessing the input according to T 4 and the transitions should b e a=b,
  • r
(2) w e are within the iden tit y and the transition should b e a=a. This leads to t w
  • kind
  • f
states: the transduction states (mark ed tr ansduction in the algorithm) and the iden tit y states (mark ed identity in the algorithm). It can b e seen in Figure 18 that this leads to a transducer that has a cop y
  • f
the initial transducer and an additional part that pro cesses the iden tit y while making sure it could not 12 The dot `' stands for the concatenation
  • p
eration
  • n
strings. 13 In this last form ula, the concatenation
  • stands
for the concatenation
  • f
the graph
  • f
the function, that is for the concatenation
  • f
the transducers view ed as automata whose lab els are
  • f
the form a=b. 14 As explained b efore, a transition lab eled b y the sym b
  • l
? stands for all the transitions lab eled with a letter that do esn't app ear
  • n
an y
  • utgoing
arc from this state. A transition lab eled ?=? stands for all the diagonal pairs (a; a) s.t. a is not an input sym b
  • l
  • n
an y
  • utgoing
arc from this state. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-21
SLIDE 21 19 ha v e b een transformed. In
  • ther
w
  • rds,
the algorithm consists
  • f
building a cop y
  • f
the
  • riginal
transducer and at the same time the iden tit y function that
  • p
erates
  • n
  • dom(T
)
  • .
Let us no w see ho w the algorithm
  • f
Figure 20 applies step b y step to the transducer T 4
  • f
Figure 17, pro ducing the transducer T 5
  • f
Figure 18. In Figure 20, C [0] = (fig; identity )
  • f
line 1 states that the state
  • f
the transducer to b e built is
  • f
t yp e identity and refers to the initial state i =
  • f
T 4 . q represen ts the curren t state and n the curren t n um b er
  • f
states. In the lo
  • p
dof
  • gw
hil e(q < n),
  • ne
builds the transitions
  • f
eac h state
  • ne
after the
  • ther:
if the transition p
  • in
ts to a state not already built, a new state is added, th us incremen ting n. The program stops when all states ha v e b een insp ected and when no additional state is created. The n um b er
  • f
iterations is b
  • unded
b y 2 kT k2 , where kT k = jQj is the n um b er
  • f
states
  • f
the
  • riginal
transducer 15 . Line 3 sa ys that the curren t state within the lo
  • p
will b e q and that this state refers to the set
  • f
states S and is mark ed b y the t yp e ty pe. In
  • ur
example, at the rst
  • ccurrence
  • f
this line, S is instan tiated to f0g and ty pe = identity . Line 5 adds the curren t iden tit y state to the set
  • f
nal states and a transition to the initial state for all letters that do not app ear
  • n
an y
  • utgoing
arc from this state. Lines 6 to 11 build the transitions from and to the iden tit y states, k eeping trac k
  • f
where this leads in the
  • riginal
transducer. F
  • r
instance, a is a lab el that v eries the conditions
  • f
line 6. Th us a transition a=a is to b e added to the identity state 2 whic h refers to 1 (b ecause
  • f
the transition a=b
  • f
T 4 ) and to i = (b ecause it is p
  • ssible
to start the transduction T 4 from an y place
  • f
the iden tit y). Line 7 c hec ks that this state do esn't already exist and adds it if necessary . e = n + + means that the arriv al state for this transition, i.e. d(q ; w ), will b e the last added state and that the n um b er
  • f
states b eing built has to b e incremen ted. Line 11 actually builds the transition b et w een and e = 2 lab eled a=a. Line 12 through 17 describ e the fact that it is p
  • ssible
to start a transduction from an y identity state. Here
  • ne
transition is added to
  • ne
new state, i.e. a=b to 3. The next state to b e considered is 2 and it is built lik e state except that the sym b
  • l
b should blo c k the curren t
  • utput.
In fact, the state 1 means that w e already read a with a as
  • utput,
th us if
  • ne
reads b, this means that ab is at the curren t p
  • in
t, and since ab should b e transformed in to bc, the curren t iden tit y transformation (that is a ! a) should b e blo c k ed: this is expressed 15 In fact, Q
  • 2
Qftr ansduction;ident ityg . Th us, q
  • 2
2jQj . MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-22
SLIDE 22 20 b y the transition b=b that leads to state 1 (this state is a \trash" state; that is, it has no
  • utgoing
transition and it is not nal). The follo wing state is 3, whic h is mark ed as b eing
  • f
t yp e tr ansduction, whic h means that lines 19 through 27 should b e applied. This consists simply
  • f
cop ying the transitions
  • f
the
  • riginal
transducer. If the
  • riginal
state w as nal, as for 4 = (f2g; tr ansduction), an = transition to the
  • riginal
state is added (to get the b eha vior
  • f
T + ). The transducer T 6 = LocE xt(T 3 )
  • f
Figure 19 giv es a more complete (and sligh tly more complex) example
  • f
applying this algorithm.

a/b b/d b/c 1 2

Figure 17: Sample T ransducer T 4

a/b a/a {0} identity {1} transd. {2} transd. {0,1} identity a/b b/c ε/ε b/d 1 3 4 ?/? a/a b/b {} transd. 2 ?/?

Figure 18: Lo cal Extension T 5
  • f
T 4 : T 5 = LocE xt(T 4 ) MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-23
SLIDE 23 21

a/b a/a {0} identity {1} transd. {2} transd. {0,1} identity a/b b/c ε/ε {3} transd. {4} transd. {0,3} identity {0,4} identity b/d c/c c/c b/b b/d a/b a/a a/b b/d ?/? ?/? ?/? 2 3 4 7 8 ?/? 1 {} transd. 5 6 a/a b/b a/a b/b b/b a/a

Figure 19: Lo cal Extension T 6
  • f
T 3 : T 6 = LocE xt(T 3 ) MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-24
SLIDE 24 22 Local E xtension(T = (; Q ; i ; F ; E ); T = (; Q; i; F ; E )) 1 C [0] = (fig; identity ); q = 0; i = 0; F = ;; E = ;; Q = ;; C [1] = (;; tr ansduction); n = 2; 2 do f 3 (S; ty pe) = C [q ]; Q = Q [ fq g; 4 if (ty pe == identity ) 5 F = F [ fq g; E = E [ f(q ; ?; ?; i )g; 6 for eac h w 2
  • [
fg s.t. 9x 2 S , d(x; w ) 6= ; and 8y 2 S , d(y ; w ) \ F = ; 7 if (9r 2 [0; n
  • 1]
suc h that C [r ] == (fig [ [ x2S d(x; w ); identity ) 8 e = r ; 9 else 10 C [e = n + +] = (fig [ [ x2S d(x; w ); identity ); 11 E = E [ f(q ; w ; w ; e)g; 12 for eac h (i; w ; w ; x) 2 E 13 if (9r 2 [0; n
  • 1]
suc h that C [r ] == (fxg; tr ansduction) 14 e = r ; 15 else 16 C [e = n + +] = (fxg; tr ansduction); 17 E = E [ f(q ; w ; w ; e)g; 18 for eac h w 2
  • [
fg s.t. 9x 2 S d(x; w ) \ F 6= ; then E = E [ f(q ; w ; w ; 1)g; 19 else if (ty pe == tr ansduction) 20 if 9x 1 2 Q s.t. S == fx 1 g 21 if (x 1 2 F ) then E = E [ f(q ; ; ; 0)g; 22 for eac h (x 1 ; w ; w ; y ) 2 E 23 if (9r 2 [0; n
  • 1]
suc h that C [r ] == (fy g; tr ansduction) 24 e = r ; 25 else 26 C [e = n + +] = (fy g; tr ansduction); 27 E = E [ f(q ; w ; w ; e)g; 28 q++; 29 gwhile(q < n); Figure 20: Lo cal Extension Algorithm. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-25
SLIDE 25 23 9 Determinization The basic idea b ehind the determinization algorithm comes from Mehry ar Mohri 16 . In this section, after giving a formalization
  • f
the algorithm, w e in- tro duce a pro
  • f
  • f
soundness and completeness with its w
  • rst
case complexit y analysis. 9.1 Determini zati
  • n
Algorithm In the follo wing, for w 1 ; w 2 2
  • ,
w 1 ^ w 2 denotes the longest common prex
  • f
w 1 and w 2 . The nite-state transducers w e use in
  • ur
system ha v e the prop ert y that they can b e made deterministic; that is, there exists a subsequen tial trans- ducer that represen ts the same function 17 . If T = (; Q; i; F ; E ) is suc h a nite-state transducer, the subsequen tial transducer T = (; Q ; i ; F ; ; ; ) dened as follo ws will b e later pro v ed equiv alen t to T :
  • Q
  • 2
Q
  • .
In fact the determinization
  • f
the transducer is related to the determinizati
  • n
  • f
FSAs in the sense that it also in v
  • lv
es a p
  • w
er set construction. The dierence is that
  • ne
has to k eep trac k
  • f
the set
  • f
states
  • f
the
  • riginal
transducer
  • ne
migh t b e in and also
  • f
the w
  • rds
whose emission ha v e b een p
  • stp
  • ned.
F
  • r
instance, a state f(q 1 ; w 1 ); (q 2 ; w 2 )g means that this state corresp
  • nds
to a path that leads to q 1 and q 2 in the
  • riginal
transducer and that the emission
  • f
w 1 (resp. w 2 ) w as dela y ed for q 1 (resp. q 2 ).
  • i
= f(i; )g. There is no p
  • stp
  • ned
emission at the initial state.
  • the
emission function is dened b y: S
  • a
= ^ (q ;u)2S ^ q 2d(q ;a) u
  • (q
; a; q ) This means that, for a giv en sym b
  • l,
the set
  • f
p
  • ssible
emissions is
  • btained
b y concatenating the p
  • stp
  • ned
emissions with the emission 16 Mohri (1994b ) also giv es a formalization
  • f
the algorithm. 17 As
  • pp
  • sed
to automata, a large class
  • f
nite-state transducers, don't ha v e an y deterministic represen tation; they can't b e determinized. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-26
SLIDE 26 24 at the curren t state. Since
  • ne
w an ts the transition to b e deterministic , the actual emission is the longest common prex
  • f
this set.
  • the
state transition function is dened b y: S
  • a
= [ (q ;u)2S [ q 2d(q ;a) f(q ; (S
  • a)
1
  • u
  • (q
; a; q ))g Giv en u; v 2
  • ,
u
  • v
denotes the concatenation
  • f
u and v and u 1
  • v
= w , if w is suc h that u
  • w
= v , u 1
  • v
= ; if no suc h w exists.
  • F
= fS 2 Q j9(q ; u) 2 S and q 2 F g
  • if
S 2 F , (S ) = u s.t. 9q 2 F s.t. (q ; u) 2 S . W e will see in the pro
  • f
  • f
correctness that
  • is
prop erly dened. The determinizati
  • n
algorithm
  • f
Figure 22 computes the ab
  • v
e subse- quen tial transducer. Let us no w apply the determinization algorithm
  • f
Figure 22
  • n
the nite- state transducer T 1
  • f
Figure 13 and sho w ho w it builds the subsequen tial transducer T 5
  • f
Figure 21. Line 1
  • f
the algorithm builds the rst state and instan tiates it with the pair f(0; )g. q and n resp ectiv ely denote the curren t state and the n um b er
  • f
states ha ving b een built so far. A t line 5,
  • ne
tak es all the p
  • ssible
input sym b
  • ls
w ; here
  • nly
a is p
  • ssible.
w
  • f
line 6 is the
  • utput
sym b
  • l,
w =
  • (
^ q 2f1;2g
  • (0;
a; q )), th us w =
  • (0;
a; 1) ^
  • (0;
a; 2) = b ^ c = . Line 8 is then computed as follo ws: S = [ q 2f0g [ q 2f1;2g fq ;
  • 1
  • (0;
a; q )g, th us S = f(1;
  • (0;
a; 1))g [ f(2;
  • (0;
a; 2)g = f(1; b); (2; c)g. Since no r v eries the condition
  • n
line 9, a new state e is created to whic h the transition lab eled a=w = a= p
  • in
ts and n is incremen te d. On line 15, the program go es to the construction
  • f
the transitions
  • f
state 1. On line 5, d and e are then t w
  • p
  • ssible
sym b
  • ls.
The rst sym b
  • l,
h, at line 6, is suc h that w is w = ^ q 2d(1;h)=f2g b
  • (1;
h; q )) = bh. Henceforth, the computation
  • f
line 8 leads to S = [ q 2f1g [ q 2f2g f(q ; (bh) 1
  • b
  • h)g
= f(2; )g. State 2 lab eled f(2; )g is th us added and a transition lab eled h=bh that p
  • in
ts to state 2 is also added. The transition for the input sym b
  • l
e is computed the same w a y . MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-27
SLIDE 27 25

h/bh a/ε e/ce (0,ε) (1,b) (2,c) (2,ε) 1 2

Figure 21: Subsequen tial transducer T 5 suc h that jT 5 j = jT 1 j D eter miniz eT r ansducer (T = (; Q ; i ; F ; ; ; ); T = (; Q; i; F ; E )) 1 i = 0; q = 0; n = 1; C [0] = f(0; )g; F = ;; Q = ;; 2 do f 3 S = C [q ]; Q = Q [ fq g; 4 if 9(q ; u) 2 S s.t. q 2 F then F = F [ fq g and (q ) = u; 5 foreac h w suc h that 9(q ; u) 2 S ) and d(q ; w ) 6= ; f 6 w = ^ (q ;u)2S ^ q 2d(q;w ) u
  • (q
; w ; q ) 7 q
  • w
= w ; 8 S = [ (q;u)2S [ q 2d(q ;w ) f(q ; w 01
  • u
  • (q
; w ; q ))g; 9 if 9r 2 [0; n
  • 1]
suc h that C [r ] == S 10 e = r ; 11 else 12 C [e = n + +] = S ; 13 q
  • w
= e; 14 g 15 q + +; 16 gwhile(q < n); Figure 22: Determinization Algorithm MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-28
SLIDE 28 26 The subsequen tial transducer generated b y this algorithm could in turn b e minimi zed b y an algorithm describ ed in (Mohri, 1994a). Ho w ev er, in the case
  • f
the part-of-sp eec h tagger, the transducer is nearly minimal. 9.2 Pro
  • f
  • f
Correctness Although it is decidable whether a function is subsequen tial
  • r
not (Chorut, 1977), the determinizati
  • n
algorithm describ ed in the previous section do es not terminate when run
  • n
a non-subsequen tial function. Tw
  • issues
are addressed in this section. First, the pro
  • f
  • f
soundness: the fact that if the algorithm terminates, then the
  • utput
transducer is deter- ministic and represen ts the same function. Second, the pro
  • f
  • f
completeness: the algorithm terminates in the case
  • f
subsequen tial functions. Soundness and completeness are a consequence
  • f
the main prop
  • sition
whic h states that if a transducer T represen ts a subsequen tial function f , then the algorithm DeterminizeT r ansduc er describ ed in the previous section applied
  • n
T computes a subsequen tial transducer represen ting the same function. In
  • rder
to simplify the pro
  • fs,
w e will
  • nly
consider transducers that do not ha v e
  • input
transitions, that is E
  • Q
  • Q,
and also without loss
  • f
generalit y , transducers that are reduced and that are deterministic in the sense
  • f
nite-state automata 18 . In
  • rder
to pro v e this prop
  • sition,
w e need to establish some preliminary notations and lemmas. First w e extend the denition
  • f
the transition function d, the emission function
  • ,
the deterministic transition function
  • and
the deterministic emission function
  • n
w
  • rds
in the classical w a y . W e then ha v e the follo wing prop erties: d(q ; ab) = [ q 2d(q ;a) d(q ; b)
  • (q
1 ; ab; q 2 ) = [ fq 2d(q 1 ;a)jq 2 2d(q ;b)g
  • (q
1 ; a; q )
  • (q
; b; q 2 ) q
  • ab
= (q
  • a)
  • b
q
  • ab
= (q
  • a)
  • (q
  • a)
  • b
18 A transducer denes an automaton whose lab els are the pairs \input/output"; this automaton is assumed to b e deterministic. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-29
SLIDE 29 27 F
  • r
the follo wing, it useful to note that if jT j is a function, then
  • is
a function to
  • .
The follo wing lemm a states an in v arian t that holds for eac h state S built within the algorithm. The lemma will later b e used for the pro
  • f
  • f
soundness. Lemma 1 L et I = C [0] b e the initial state. A t e ach iter ation
  • f
the \do" lo
  • p
in DeterminizeT ransducer, for e ach S = C [q ] and for e ach w 2
  • such
that I
  • w
= S , the fol lowing holds: (i) I
  • w
= ^ q 2d(i;w )
  • (i;
w ; q ) (ii) S = I
  • w
= f(q ; u)jq 2 d(i; w ) and u = (I
  • w
) 1
  • (i;
w ; q )g Pro
  • f.
(i) and (ii) are
  • b
viously true for S = I (since d(i; ) = i and
  • (i;
; i) = ) and w e will sho w that giv en some w 2
  • if
it is true for S = I
  • w
then it is also true for S 1 = S
  • a
= I
  • w
a for all a 2 . Assuming that (i) and (ii) hold for S and w , then for eac h a 2 : ^ q 2d(i;w );q 2d(q ;a)
  • (i;
w ; q )
  • (q
; a; q ) = (I
  • w
)
  • ^
q 2d(i;w );q 2d(q ;a) ((I
  • w
) 1
  • (i;
w ; q ))
  • (q
; a; q ) = (I
  • w
)
  • ^
(q ;u)2S =I w ;q 2d(q ;a) u
  • (q
; a; q ) = (I
  • w
)
  • (S
  • a)
= I
  • w
  • (I
  • w
)
  • a
= I
  • w
a This pro v es (i). W e no w turn to (ii). Assuming that (i) and (ii) hold for S and w , then for eac h a 2 , let S 1 = S
  • a;
the algorithm (line 8) is suc h that S 1 = f(q ; u )j9(q ; u) 2 S; q 2 d(q ; a) and u = (S
  • a)
1
  • u
  • (q
; a; q )g Let S 2 = f(q ; u )jq 2 d(i; w a) and u = (I
  • w
a) 1
  • (i;
w a; q )g W e sho w that S 1
  • S
2 . Let (q ; u ) 2 S 1 , then 9(q ; u) 2 S s.t. q 2 d(q ; a) and u = (S
  • a)
1
  • u
  • (q
; a; q ). Since u = (I
  • w
) 1
  • (i;
w ; q ), then MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-30
SLIDE 30 28 u = (S
  • a)
1
  • (I
  • w
) 1
  • (i;
w ; q )
  • (q
; a; q ), that is, u = (I
  • w
a) 1
  • (i;
w a; q ). Th us (q ; u ) 2 S 2 . Hence S 1
  • S
2 . W e no w sho w that S 2
  • S
1 . Let (q ; u ) 2 S 2 , and let q 2 d(i; w ) b e s.t. q 2 d(q ; a) and u = (I
  • w
) 1
  • (i;
w ; q ) then (q ; u) 2 S and since u = (I
  • w
a) 1
  • (i;
w a; q ) = (S
  • a)
1
  • u
  • (q
; a; q ), (q ; u ) 2 S 1 This concludes the pro
  • f
  • f
(ii). 2 The follo wing lemm a states a common prop ert y
  • f
the state S , whic h will b e used in the complexit y analysis
  • f
the algorithm. Lemma 2 Each S = C [q ] built within the \do" lo
  • p
is s.t. 8q 2 Q, ther e is at most
  • ne
p air (q ; w ) 2 S with q as rst element. Pro
  • f.
Supp
  • se
(q ; w 1 ) 2 S and (q ; w 2 ) 2 S , and let w b e s.t. I
  • w
= S . Then w 1 = (I
  • w
) 1
  • (i;
w ; q ) and w 2 = (I
  • w
) 1
  • (i;
w ; q ). Th us w 1 = w 2 . 2 The follo wing lemma will also b e used for soundness. It states that the nal state emission function is indeed a function. Lemma 3 F
  • r
e ach S built in the algorithm, if (q ; u); (q ; u ) 2 S , then q ; q 2 F ) u = u Pro
  • f.
Let S b e
  • ne
state set built in line 8
  • f
the algorithm. Supp
  • se
(q ; u); (q ; u ) 2 S and q , q 2 F . According to (ii)
  • f
lemm a 1, u = (I
  • w
) 1
  • (i;
w ; q ) and u = (I
  • w
) 1
  • (i;
w ; q ). Since jT j is a function and f (i; w ; q );
  • (i;
w ; q )g 2 jT j(w ) then
  • (i;
w ; q ) =
  • (i;
w ; q ), therefore u = u . 2 The follo wing lemma will b e used for completeness. Lemma 4 Given a tr ansduc er T r epr esenting a subse quential function, ther e exists a b
  • und
M s.t. for e ach S built at line 8, for e ach (q ; u) 2 S , juj
  • M
. W e rely
  • n
the follo wing theorem pro v en b y Chorut (1978 ): Theorem 1 A function f
  • n
  • is
subse quential i it has b
  • unde
d variations and for any r ational language L
  • ,
f 1 (L) is also r ational. with the follo wing t w
  • denitions:
MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-31
SLIDE 31 29 Denition 2 The left distanc e b etwe en two strings u and v is ku; v k = juj + jv j
  • 2ju
^ v j Denition 3 A function f
  • n
  • has
b
  • unde
d variations i for al l k
  • 0,
ther e exists K
  • s.t.
u; v 2 dom(f ), ku; v k
  • k
) kf (u); f (v )k
  • K
Pro
  • f
  • f
lemm a 4: Let f = jT j. F
  • r
eac h q 2 Q let c(q ) b e a string w s.t. d(q ; w ) \ F 6= ; and s.t. jw j is minim al among suc h strings. Note that jc(q )j
  • kT
k where kT k is the n um b er
  • f
states in T . F
  • r
eac h q 2 Q let s(q ) 2 Q b e a state s.t. s(q ) 2 d(q ; c(q )) \ F . Let us further dene M 1 = max q 2Q j (q ; c(q ); s(q ))j M 2 = max q 2Q jc(q )j Since f is subsequen tial, it is
  • f
b
  • unded
v ariations, therefore there exists K s.t. if ku; v k
  • 2M
2 then kf (u); f (v )k
  • K
. Let M = K + 2M 1 . Let S b e a state set built at line 8 , let w b e s.t. I
  • w
= S and
  • =
I
  • w
. Let (q 1 ; u) 2 S . Let (q 2 ; v ) 2 S b e s.t. u ^ v = . Suc h a pair alw a ys exists, since if not j ^ (q ;u )2S u j > th us j
  • ^
(q ;u )2S u j = j ^ (q ;u )2S
  • u
j > jj Th us, b ecause
  • f
(ii) in lemma 1, j ^ q 2d(i;w )
  • (i;
w ; q )j > jI
  • w
j whic h con tradicts (i) in lemma 1. Let ! =
  • (q
1 ; c(q 1 ); s(q 1 )) and ! =
  • (q
2 ; c(q 2 ); s(q 2 )). Moreo v er, for an y a,b,c,d 2
  • ,
ka; ck
  • kab;
cdk + jbj + jdj. In fact, kab; cdk = jabj + jcdj
  • 2jab
^ cdj = jaj + jcj + jbj + jdj
  • 2jab
^ cdj = ka; ck + 2ja ^ cj + jbj + jdj
  • 2jab
^ cdj but jab ^ cdj
  • ja
^ cj + jbj + jdj and since kab; cdk = ka; ck
  • 2(jab
^ cdj
  • ja
^ cj
  • jbj
  • jdj)
  • jbj
  • jdj
  • ne
has ka; ck
  • kab;
cdk + jbj + jdj. Therefore, in particular, juj
  • ku;
v k
  • ku!
; v ! k + j! j + j! j, th us juj
  • kf
(w
  • c(q
1 )); f (w
  • c(q
2 ))k + 2M 1 . But kw
  • c(q
1 ); w
  • c(q
2 )k
  • jc(q
1 )j + MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-32
SLIDE 32 30 jc(q 2 )j
  • 2M
2 , th us kf (w
  • c(q
1 )); f (w
  • c(q
2 ))k
  • K
and therefore juj
  • K
+ 2M 1 = M . 2 The time is no w rip e for the main prop
  • sition
whic h pro v es soundness and completeness. Prop
  • sition
1 If a tr ansduc er T r epr esents a subse quential function f , then the algorithm DeterminizeT ransducer describ e d in the pr evious se ction ap- plie d
  • n
T c
  • mputes
a subse quential tr ansduc er
  • r
epr esenting the same func- tion. Pro
  • f.
The lemm a 4 sho ws that the algorithm alw a ys terminates if jT j is subsequen tial. Let us sho w that dom(j j)
  • dom(jT
j). Let w 2
  • s.t.
w is not in dom(jT j), then d(i; w ) \ F = ;. Th us, according to (ii)
  • f
lemm a 1, for all (q ; u) 2 I
  • w
, q is not in F , th us I
  • w
is not terminal and therefore w is not in dom( ). Con v ersely , let w 2 dom(jT j). There exists a unique q f 2 F s.t. jT j(w ) =
  • (i;
w ; q f ) and s.t. q f 2 d(i; w ). Therefore jT j(w ) = (I
  • w
)
  • ((I
  • w
) 1
  • (i;
w ; q f )) and according to (ii)
  • f
lemma 1 (q f ; (I
  • w
) 1
  • (i;
w ; q f )) 2 I
  • w
and since q f 2 F , lemma 3 sho ws that (I
  • w
) = (I
  • w
) 1
  • (i;
w ; q f ), th us jT j(w ) = (I
  • w
)
  • (I
  • w
) = j j(w ). 2 9.3 W
  • rst-case
complexit y In this section w e giv e a w
  • rst-case
upp er b
  • und
  • f
the size
  • f
the subse- quen tial transducer in term
  • f
the size
  • f
the input transducer. Let L = fw 2
  • s.t.
jw j
  • M
g where M is the b
  • und
dened in the pro
  • f
  • f
lemma 4. Since, according to lemm a 2, for eac h state set Q , for eac h q 2 Q, Q con tains at most
  • ne
pair (q ; w ), the maximal n um b er N
  • f
states built in the algorithm is smaller than the sum
  • f
the n um b er
  • f
functions from states to strings in L for eac h state set, that is N
  • X
Q 22 Q jLj jQ j w e th us ha v e N
  • 2
jQj
  • jLj
jQj = 2 jQj
  • 2
jQjlog 2 jLj and therefore N
  • 2
jQj(1+log jLj) . MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-33
SLIDE 33 31 Moreo v er, jLj = 1 + jj +
  • +
jj M = jj M +1
  • 1
jj
  • 1
if jj > 1 and jLj = M + 1 if jj = 1. In this last form ula, M = K + 2M 1 as describ ed in lemma 4. Note that if P = max a2 j (q ; a; q )j is the maximal length
  • f
the simple transitions emissions, M 1
  • jQj
  • P
, th us M
  • K
+ 2
  • jQj
  • P
. Therefore, if jj > 1, the n um b er
  • f
states N is b
  • unded:
N
  • 2
jQj(1+log jj (K +2jQjP +1 1 jj1 ) and if jj = 1, N
  • 2
jQj(1+log (K +2jQjP +1) ) . 10 Subsequen tialit y
  • f
T ransformation-Based Systems The pro
  • f
  • f
correctness
  • f
the determinization algorithm and the fact that the algorithm terminates
  • n
the transducer enco ding Brill's tagger sho w that the nal function is subsequen tial and equiv alen t to Brill's
  • riginal
tagger. In this section, w e pro v e in general that an y transformation-based system, suc h as those used b y Brill, is a subsequen tial function. In
  • ther
w
  • rds,
an y transformation-based system can b e turned in to a deterministic nite-state transducer. W e dene transformation-based systems as follo ws. Denition 4 A tr ansformation-b ase d system is a nite se quenc e (f 1 ;
  • ;
f n )
  • f
subse quential functions whose domains ar e b
  • unde
d. Applying a transformation-based system consists
  • f
taking the functions f i ,
  • ne
after the
  • ther,
and for eac h
  • f
them,
  • ne
lo
  • ks
for the rst p
  • sition
in the input at whic h it applies, and for the longest string starting at that p
  • sition,
transforms this string, go to the end
  • f
this string, and iterate un til the end
  • f
the input. It is not true that, in general, the lo cal extension
  • f
a subsequen tial func- tion is subsequen tial 19 . F
  • r
instance, consider the function f a
  • f
Figure 23. The lo cal extension
  • f
the function f a is not a function. In fact, consider the input string daaaad, it can b e decomp
  • sed
either in to d
  • aaa
  • ad
  • r
in to 19 Ho w ev er, the lo cal extensions
  • f
the functions w e had to compute wer e subsequen tial. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-34
SLIDE 34 32

a:b a:b a:b

Figure 23: F unction f a da
  • aaa
  • d.
The rst decomp
  • sition
leads to the
  • utput
dbbbad and the second
  • ne
to the
  • utput
dabbbd. The in tended use
  • f
the rules in the tagger dened b y Brill is to apply eac h function from left to righ t. In addition, if sev eral decomp
  • sitions
are p
  • ssible,
the
  • ne
that
  • ccurs
rst is the
  • ne
c hosen. In
  • ur
previous example, it means that
  • nly
the
  • utput
dbbbad is generated. This notion is no w dened precisely . Let
  • b
e the rational function dened b y (a) = a for a 2 , ([ ) = (]) =
  • n
the additional sym b
  • ls
'[ ' and '] ' with
  • suc
h that (u
  • v
) = (u)
  • (v
). Denition 5 L et Y
  • and
X =
  • Y
  • ,
a Y
  • de
c
  • mp
  • sition
  • f
x is a string y 2 X
  • ([
  • Y
  • ]
  • X
)
  • s.t.
(y ) = x F
  • r
instance, if Y = dom(f a ) = faaag, the set
  • f
Y
  • decomp
  • sitions
  • f
x = daaad is fd[aaa]ad; da[aaa ]dg. Denition 6 L et < b e a total
  • r
der
  • n
  • and
let
  • =
  • [
f[ ; ]g b e the alphab et
  • with
the two additional symb
  • ls
'[ ' and ']'. L et extend the
  • r
der > to
  • by
8a 2 , '[ '< a and a < '] '. < denes a lexic
  • gr
aphic
  • r
der
  • n
  • that
we also denote <. L et Y
  • and
x 2
  • ,
the minimal Y
  • decomp
  • sition
  • f
x is the Y
  • de
c
  • mp
  • sition
which is minimal in (
  • ;
<). F
  • r
instance, the minim al dom(f a )-decomp
  • sition
  • f
daaaad is d[aaa]ad. In fact, d[aaa]ad < da[aaa]d. Prop
  • sition
2 Given Y
  • +
nite, the function md Y that to e ach x 2
  • asso
ciates its minimal Y
  • de
c
  • mp
  • sition,
is subse quential and total. Pro
  • f.
Let de c b e dened b y de c (w ) = u
  • [
  • v
  • ]
  • de
c((uv ) 1
  • w
) where u; v 2
  • are
s.t. v 2 Y , 9v 2
  • with
w = uv v and juj is minimal among suc h strings. The function md Y is total b ecause the function de c alw a ys returns an
  • utput
whic h is a Y
  • decomp
  • sition
  • f
w . MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-35
SLIDE 35 33 W e shall no w pro v e that the function is rational and then that it has b
  • unded
v ariations; this will pro v e according to theorem 1 that the function is subsequen tial. In the follo wing X =
  • Y
  • .
The transduction T Y that generates the set
  • f
Y
  • decomp
  • sitions
is dened b y T Y = Id X
  • (=[
  • Id
Y
  • =]
  • Id
X )
  • where
Id X (resp. Id Y ) stands for the iden tit y function
  • n
X (resp. Y ). F urthermore, the transduction T ;> that to eac h string w 2
  • asso
ciates the set
  • f
strings strictly greater than w , that is T ;> (w ) = fw 2
  • jw
< w g, is dened b y the transducer
  • f
Figure 24 in whic h A = f(x; x)jx 2 g, B = f(x; y ) 2
  • 2
jx < y g, C =
  • 2
, D = fg
  • and
E =
  • fg.
20

D B A E D C 1 D 2 E 3 D 4

Figure 24: T ransduction T ;> Therefore, the righ t-minim al Y
  • decomp
  • sition
function md Y is dened b y md Y = T Y
  • (T
;>
  • T
Y ) whic h pro v es that md Y is rational. Let k > 0. Let K = 6
  • k
+ 6
  • M
where M = max x2Y jxj. Let u, v 2
  • b
e s.t. ku; v k
  • k
. Let us consider t w
  • cases:
(i) ju ^ v j
  • M
and (ii) ju ^ v j > M . (i): ju ^ v j
  • M
, th us juj,jv j
  • ju
^ v j + ku; v k
  • M
+ k . Moreo v er, for eac h w 2
  • ,
for eac h Y
  • decomp
  • sition
w
  • f
w , jw j
  • 3
  • jw
j. In fact, Y do esn't con tain , th us the n um b er
  • f
[ (resp. ] ) in w is smaller than jw j. Therefore, jmd Y (u)j; jmd Y (v )j
  • 3
  • (M
+ k ) th us kmd Y (u); md Y (v )k
  • K
. (ii): u ^ v =
  • !
with j! j = M . Let ,
  • b
e s.t. u = !
  • and
v = !
  • .
Let
  • ,
! ,
  • ,
  • 00
, ! 00 and
  • 00
b e s.t. md Y (u) =
  • !
  • ,
md Y (v ) =
  • 00
! 00
  • 00
, ( ) = ( 00 ) = , (! ) = (! 00 ) = ! , ( ) =
  • and
( 00 ) =
  • .
Supp
  • se
that
  • 6=
  • 00
, for instance
  • <
  • 00
. Let i b e the rst indice s.t. ( ) i < ( 00 ) i . 21 20 This construction is similar to the transduction built within the pro
  • f
  • f
Eilen b erg's cross section theorem (Eilen b erg, 1974). 21 (w ) i refers to the i th letter in w . MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-36
SLIDE 36 34 W e ha v e t w
  • p
  • ssible
situations: (ii.1) ( ) i = [ and
  • 00
2
  • r
( 00 ) i = ] . In that case, since the length
  • f
the elemen t s in Y is smaller than M = j! j ,
  • ne
has
  • !
=
  • 1
[ 2 ] 3 with j 1 j = i,
  • 2
2 Y and
  • 3
2
  • .
W e also ha v e
  • 00
! 00 =
  • 1
  • 2
  • 3
with ( 2 ) = ( 2 ) and the rst letter
  • f
  • 2
is dieren t from [. Let
  • 4
b e a Y
  • decomp
  • sition
  • f
( 3
  • 00
), then
  • 1
[ 2 ]
  • 4
is a Y
  • decomp
  • sition
  • f
v strictly smaller than
  • 1
  • 2
  • 3
  • 00
= md Y (v ) whic h con tradicts the minimali t y
  • f
md Y (v ). The second situation is (ii.2): ( ) i 2
  • and
( 00 ) i = ] , then w e ha v e
  • !
=
  • 1
[ 2
  • 3
]
  • 4
s.t. j 1 [
  • 2
j = i and
  • 00
! 00 =
  • 1
[ 2 ] 3
  • 4
s.t. ( 3 ) = ( 3 ) and ( 4 ) = ( 4 ). Let
  • 5
b e a Y
  • decomp
  • sition
  • f
  • 4
  • 00
then
  • 1
[ 2
  • 3
]
  • 5
is a Y
  • decomp
  • sition
  • f
v strictly smaller than
  • 00
! 00
  • 00
whic h leads to the same con tradiction. Therefore,
  • =
  • 00
and since j j + j 00 j
  • 3
  • (jj
+ j j) = 3
  • ku;
v k
  • 3
  • k
, kmd Y (u); md Y (v )k
  • j!
j + j! 00 j + j j + j 00 j
  • 2
  • M
+ 3
  • k
  • K
. This pro v es that md Y has b
  • unded
v ariations and therefore that it is subsequen tial. 2 W e can no w dene precisely what is the eect
  • f
a function when
  • ne
applies it from left to righ t, as w as done in the
  • riginal
tagger. Denition 7 If f is a r ational function, Y = dom(f )
  • +
, the right- minimal lo c al extension
  • f
f , denote d RmLocE xt(f ), is the c
  • mp
  • sition
  • f
a right-minimal Y
  • de
c
  • mp
  • sition
md Y with Id
  • ([
=
  • f
  • ]
=
  • Id
  • )
  • .
R mL
  • cExt
b eing the comp
  • sition
  • f
t w
  • subsequen
tial functions, it is itself subsequen tial, this pro v es the follo wing nal prop
  • sition
whic h states that giv en a rule-based system similar to Brill's system,
  • ne
can build a subsequen tial transducer that represen ts it: Prop
  • sition
3 If (f 1 ;
  • ;
f n ) is a se quenc e
  • f
subse quential functions with b
  • unde
d domains then RmLocE xt(f 1 )
  • RmLocE
xt(f n ) is subse quential. W e ha v e pro v en in this section that
  • ur
tec hniques apply to the class
  • f
transformation-based systems. W e no w turn
  • ur
atten tion to the implem en- tation
  • f
nite-state transducers. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-37
SLIDE 37 35 11 Implemen tation
  • f
Finite-state T ransduc- ers Once the nal nite-state transducer is computed, applying it to an input is straigh tforw ard: it consists
  • f
follo wing a unique path in the transducer whose left lab els corresp
  • nd
to the input. Ho w ev er, in
  • rder
to ha v e a complexit y fully indep enden t
  • f
the size
  • f
the grammar and in particular, indep enden t
  • f
the n um b er
  • f
transitions at eac h state,
  • ne
should carefully c ho
  • se
an appropriate represen tation for the transducer. In
  • ur
impleme n tation, the transitions can b e accessed randomly . The transducer is rst represen ted b y a t w
  • -dimensional
table whose ro ws are indexed b y the states and whose columns are indexed b y the alphab et
  • f
all p
  • ssible
input letters. The con- ten t
  • f
the table at line q and at column a is the w
  • rd
w suc h that the transition from q with the input lab el a
  • utputs
w . Since
  • nly
a few tran- sitions are allo w ed from man y states, this table is v ery sparse and can b e compressed. This compression is ac hiev ed using a pro cedure for sparse data tables follo wing the metho d giv en b y T arjan and Y ao (1979 ). 12 Ac kno wledgmen ts W e thank Eric Brill for pro viding us with the co de
  • f
his tagger and for man y useful discussions. W e also thank Ara vind K. Joshi, Mark Lib erman and Mehry ar Mohri for v aluable discussions. W e thank the anon ymous review ers for man y helpful commen ts that led to impro v em e n ts in b
  • th
the con ten t and the presen tation
  • f
this pap er. 13 Conclusion The tec hniques describ ed in this pap er are more general than the problem
  • f
part-of-sp eec h tagging and are applicable to the class
  • f
problems dealing with lo cal transformation rules. W e sho w ed that an y transformation based program can b e transformed in to a deterministic nite-state transducer. This yields to
  • ptimal
time im- plemen tations
  • f
transformation based programs. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-38
SLIDE 38 36 As a case study , w e applied these tec hniques to the problem
  • f
part-of- sp eec h tagging and presen ted a nite-state tagger that requires n steps to tag a sen tence
  • f
length n, indep enden t
  • f
the n um b er
  • f
rules and the length
  • f
the con text they require. W e ac hiev ed this result b y represen ting the rules acquired for Brill's tagger as non-deterministic nite-state transducers. W e comp
  • sed
eac h
  • f
these non-deterministic transducers and turned the result- ing transducer in to a deterministic transducer. The resulting deterministic transducer yields a part-of-sp eec h tagger whic h
  • p
erates in
  • ptimal
time in the sense that the time to assign tags to a sen tence corresp
  • nds
to the time required to follo w a single path in this deterministic nite-state mac hine. The tagger
  • utp
erforms in sp eed b
  • th
Brill's tagger and trigram-based tag- gers. Moreo v er, the nite-state tagger inherits from the rule-based system its compactness compared to a sto c hastic tagger. W e also pro v ed the correctness and the generalit y
  • f
the metho ds. W e b eliev e that this nite-state tagger will also b e found useful com bined with
  • ther
language comp
  • nen
ts, since it can b e naturally extended b y com- p
  • sing
it with nite-state transducers whic h could enco de
  • ther
asp ects
  • f
natural language syn tax. Bibliograph y Brill, Eric. 1992. A simple rule-based part
  • f
sp eec h tagger. In Thir d Con- fer enc e
  • n
Applie d Natur al L anguage Pr
  • c
essing, pages 152{155, T ren to, Italy . Brill, Eric. 1994. A rep
  • rt
  • f
recen t progress in transformation error-driv en learning. In AAAI'94, T enth National Confer enc e
  • n
A rticial Intel li- genc e. Chorut, Christian. 1977. Une caract
  • erisation
des fonctions s
  • equen
tielles et des fonctions sous-s
  • equen
tielles en tan t que relations rationnelles. The
  • r
etic al Computer Scienc e, 5:325{338. Chorut, Christian. 1978. Contribution
  • a
l'
  • etude
de quelques famil les r e- mar quables de fonctions r ationnel les. Ph.D. thesis, Univ ersit
  • e
P aris VI I (Th
  • ese
d'Etat). Chomsky , N. 1964. Syntactic Structur es. Mouton and Co., The Hague. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-39
SLIDE 39 37 Ch urc h, Kenneth W ard. 1988. A sto c hastic parts program and noun phrase parser for unrestricted text. In Se c
  • nd
Confer enc e
  • n
Applie d Natur al L anguage Pr
  • c
essing, Austin, T exas. Clemenceau, Da vid. 1993. Structur ation du L exique et R e c
  • nnaissanc
e de Mots D
  • eriv
  • es.
Ph.D. thesis, Univ ersit
  • e
P aris 7. Cutting, Doug, Julian Kupiec, Jan P ederson, and P enelop e Sibun. 1992. A practical part-of-sp eec h tagger. In Thir d Confer enc e
  • n
Applie d Natur al L anguage Pr
  • c
essing, pages 133{140, T ren to, Italy . DeRose, S.J. 1988. Grammatical category disam biguation b y statistical
  • ptimization.
Computational Linguistics, 14:31{39. Eilen b erg, Sam uel. 1974. A utomata, languages, and machines. Academic Press, New Y
  • rk.
Elgot, C. C. and J. E. Mezei. 1965. On relations dened b y generalized nite automata. IBM Journal
  • f
R ese ar ch and Development, 9:47{65, Jan uary . F rancis, W. Nelson and Henry Ku
  • cera.
1982. F r e quency A nalysis
  • f
English Usage. Hough ton Miin, Boston. Kaplan, Ronald M. and Martin Ka y . 1994. Regular mo dels
  • f
phonological rule systems. Computational Linguistics, 20(3):331{378 . Karttunen, Lauri, Ronald M. Kaplan, and Annie Zaenen. 1992. Tw
  • -lev
el morphology with comp
  • sition.
In Pr
  • c
e e dings
  • f
the 14 th International Confer enc e
  • n
Computational Linguistics (COLING'92). Kupiec, J. M. 1992. Robust part-of-sp eec h tagging using a hidden Mark
  • v
mo del. Computer Sp e e ch and L anguage, 6:225{242. Lap
  • rte,
Eric. 1993. Phon
  • etique
et transducteurs. T ec hnical rep
  • rt,
Univ er- sit
  • e
P aris 7, June. Merialdo, Bernard. 1990. T agging text with a probabilistic mo del. T ec hnical Rep
  • rt
R C 15972, IBM Researc h Division. Mohri, Mehry ar. 1994a. Minimisation
  • f
sequen tial transducers. In Pr
  • c
e e d- ings
  • f
the Confer enc e
  • n
Computational Pattern Matching 1994. MERL-TR-94-07. V ersion 3.0 Marc h 1995
slide-40
SLIDE 40 38 Mohri, Mehry ar. 1994b. On some applications
  • f
nite-state automata the-
  • ry
to natural language pro cessing. T ec hnical rep
  • rt,
Institut Gaspard Monge. P ereira, F ernando C. N., Mic hael Riley , and Ric hard W. Sproat. 1994. W eigh ted rational transductions and their application to h uman language pro cessing. In ARP A Workshop
  • n
Human L anguage T e chnolo gy. Mor- gan Kaufmann. Revuz, Dominique. 1991. Dictionnair es et L exiques, M
  • etho
des et A lgo- rithmes. Ph.D. thesis, Univ ersit
  • e
P aris 7. Ro c he, Emman uel. 1993. A nalyse Syntaxique T r ansformationel le du F r an cais p ar T r ansducteurs et L exique-Gr ammair e. Ph.D. thesis, Uni- v ersit
  • e
P aris 7, Jan uary . Sc h } utzen b erger, Marcel P aul. 1977. Sur une v arian te des fonctions sequen- tielles. The
  • r
etic al Computer Scienc e, 4:47{57. Silb erztein, Max. 1993. Dictionnair es Ele ctr
  • niques
et A nalyse L exic ale du F r an cais | L e Syst
  • eme
INTEX. Masson. T apanainen, P asi and A tro V
  • utilainen.
1993. Am biguit y resolution in a reductionistic parser. In Sixth Confer enc e
  • f
the Eur
  • p
e an Chapter
  • f
the A CL, Pr
  • c
e e dings
  • f
the Confer enc e, Utrec h t, April. T arjan, Rob ert Endre and Andrew Chi-Chih Y ao. 1979. Storing a sparse table. Communic ations
  • f
the A CM, 22(11):606{611 , No v em b er. W eisc hedel, Ralph, Marie Meteer, Ric hard Sc h w artz, Lance Ramsha w, and Je P alm ucci. 1993. Coping with am biguit y and unkno wn w
  • rds
through probabilistic mo dels. Computational Linguistics, 19(2):359{382 , June. MERL-TR-94-07. V ersion 3.0 Marc h 1995