A simple method for citation metadata extraction using hidden - - PowerPoint PPT Presentation

a simple method for citation metadata extraction using
SMART_READER_LITE
LIVE PREVIEW

A simple method for citation metadata extraction using hidden - - PowerPoint PPT Presentation

A simple method for citation metadata extraction using hidden Markov models Erik Hetzner (California Digital Library) JCDL 2008 Advantages of our method Good performance on homogeneous citations. Reasonable performance on heterogeneous


slide-1
SLIDE 1

A simple method for citation metadata extraction using hidden Markov models

Erik Hetzner (California Digital Library) JCDL 2008

slide-2
SLIDE 2

Advantages of our method

Good performance on homogeneous citations.

Reasonable performance on heterogeneous citations.

Extractor can be implemented in a few pages

  • f code.
slide-3
SLIDE 3

Improving HMM performance

Reduce the size of the alphabet by mapping words to a smaller set of symbols.

Use two states for each label: first & rest.

Use ‘separator states’, one for each possible transition between labels.

slide-4
SLIDE 4

Hidden Markov models

a .25 b .75 .5 1 .5 .75 .25 .25 1 .75

slide-5
SLIDE 5

Alphabet of symbols: words?

exorcised throed deposed roil vaporized rattletrap mocking prohibit sleetier effectual tweeter decremented atrophied nearby captor earn oboe ticked in-

  • culate algorithmic extremist inherited burping silenced harassment doctri-

naire emptiest tarting freewheeled parqueting gentlewoman optimal dash- board taskmaster acceptance mucky prototyping virtual recapture per- petrate junking rewrote goody cooperated mottling yahoo gridiron suc- cessfully bumper siphoned witchcraft jettison capering grouchier disal- lowed eyeballing medic sullen certitude tearier parlor becoming morpho- logical cognomen saddening apprenticed signpost lignite wishing boldface postage audibility jingoistic lousy reacted rivulet arboreal primping eddy belatedly necessity ordinance retrogressed perverting sponging neutralizer deadlier inferential easel aptly trapeze circumlocution descanted caress- ing redeemable entice thunderstruck lectured postmarking twanged bel- lowing rainier grouching cozier flimsiest grizzly decorously jawboning tinier crookeder liberation sleeting heehawed puffin paisley daunt screenwriter …

slide-6
SLIDE 6

Alphabet of symbols: keywords

wAND wAPPEAR wCOMMUNICATIONS wCONFERENCE wDE wDISSERTATION wEDITOR wIN wINC wJOURNAL wNOTICES wNUMBER wPAGES wPHD wPRESS wPROCEEDINGS wREPORT wSUBMITTED wTECHNICAL wTHESIS wTRANSACTIONS wUNIVERSITY wVAN wVOLUME

slide-7
SLIDE 7

Alphabet of symbols: punctuation

pPERIOD pCOMMA pLEFTPAREN pRIGHTPAREN pLEFTBRACKET pRIGHTBRACKET pHYPEN pCOLON pSEMICOLON pQUESTIONMARK pMISC pAPOSTROPHE pDOUBLEQUOTE pSINGLEQUOTE

slide-8
SLIDE 8

Alphabet of symbols: word classes

wMONTH wSEASON

slide-9
SLIDE 9

Alphabet of symbols: features

fINITIAL fTC fUPPER fLOWER fNUMERAL4 fNUMERAL fMIXED

slide-10
SLIDE 10

T

  • kens → symbols

1 ˆ[aA][nN][dD]$

wAND 2 ˆ[Jj]an(uary)?$ → cMONTH 3 ˆ\.$

pPERIOD 4 ˆ,$

pCOMMA 5 ˆ[A-Z]$

fINITIAL 6 ˆ[A-Z][A-Z]+$

fUPPER …

slide-11
SLIDE 11

T

  • kens → symbols

Friedman, Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995.

slide-12
SLIDE 12

T

  • kens → symbols

fTC, Daniel P., and Matthias Felleisen. The Little

  • Schemer. 4th Edition. Cambridge, Mass.: The

MIT Press, 1995.

slide-13
SLIDE 13

T

  • kens → symbols

fTCpCOMMA Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995.

slide-14
SLIDE 14

T

  • kens → symbols

fTCpCOMMA fTC fINITIALpPERIODpCOMMA wAND fTC fTCpPERIOD wTHE fTC wTCpPERIOD fMIXED wEDITIONpPERIOD fTCpCOMMA fTCpPERIODpCOLON wTHE fUPPER fTCpCOMMA fNUMERAL4pPERIOD

slide-15
SLIDE 15

Label states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f fTC

slide-16
SLIDE 16

Label states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r pCOMMA

slide-17
SLIDE 17

Label states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r fTC

slide-18
SLIDE 18

Label states

Friedman, Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r fINITIAL

slide-19
SLIDE 19

Label states

Friedman, Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r pPERIOD

slide-20
SLIDE 20

Separator states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r a|a pCOMMA

slide-21
SLIDE 21

Separator states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r a|a wAND

slide-22
SLIDE 22

Label states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r fTC a|a

slide-23
SLIDE 23

Label states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r a|a fTC

slide-24
SLIDE 24

Separator states

Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r a|a a|t pPERIOD

slide-25
SLIDE 25

Results on the Cora dataset

token .944 field .892 whole instance .613

slide-26
SLIDE 26

Improving HMM performance

Reduce the size of the alphabet by mapping words to a smaller set of symbols.

Use two states for each label: first & rest.

Use ‘separator states’, one for each possible transition between labels.

slide-27
SLIDE 27

Erik Hetzner erik.hetzner@ucop.edu http://purl.net/net/egh/hmm cite parser/