A new automatic spelling correction model aimed at improving - - PowerPoint PPT Presentation

a new automatic spelling correction model aimed at
SMART_READER_LITE
LIVE PREVIEW

A new automatic spelling correction model aimed at improving - - PowerPoint PPT Presentation

A new automatic spelling correction model aimed at improving parsability Rob van der Goot and Gertjan van Noord Old approach IV/OOV Generate candidates Rank candidates New approach IV/OOV Generate candidates Rank


slide-1
SLIDE 1

A new automatic spelling correction model aimed at improving parsability

Rob van der Goot and Gertjan van Noord

slide-2
SLIDE 2

Old approach

  • IV/OOV
  • Generate candidates
  • Rank candidates
slide-3
SLIDE 3

New approach

  • IV/OOV
  • Generate candidates
  • Rank candidates
slide-4
SLIDE 4

Data

  • LexNorm v1.2
  • 549 tweets / 10,576 tokens
  • 2,140 OOV tokens
  • 1,184 tokens corrected
slide-5
SLIDE 5

17

  • nly

IV

  • nly

3mths OOV 3mths left IV left in IV in school IV school . NO . i IV i wil OOV will always IV always mis OOV miss my IV my skull IV skull , NO , frnds OOV friends and IV and my IV my teachrs OOV teachers

4 new IV new pix OOV pictures comming OOV coming tomoroe OOV tomorrow

slide-6
SLIDE 6

IV/OOV

  • Aspell dictionary
  • IV tokens skipped
  • 90% of the errors (Bo Han, 2013)
  • Example:

– I am tiret – I am tire

slide-7
SLIDE 7

IV/OOV

slide-8
SLIDE 8

IV/OOV

slide-9
SLIDE 9

Generate candidates

  • Edit distances (Modified Aspell)
  • N-grams
  • Original token
slide-10
SLIDE 10

Generate candidates

slide-11
SLIDE 11

Generate candidates

slide-12
SLIDE 12

Rank candidates

  • N-grams
  • Edit distance
  • Occurrence in dictionaries
  • Parse probability
slide-13
SLIDE 13

Rank candidates

  • 1. Random Forest
  • 2. Coordinate Ascent
  • 3. MART
  • 4. RankBoost
  • 5. RankNet
  • 6. AdaRank
  • 7. LambdaMART
  • 8. ListNet
slide-14
SLIDE 14

Rank candidates

  • Average 222 candidates

top Accuracy 1 0.32 5 0.62 10 0.72

slide-15
SLIDE 15

(Dis-) Advantages

  • Includes IV errors
  • More general
  • Adaption
  • Less efficient
  • Training data
slide-16
SLIDE 16

Future work

  • Rank on sentence level
  • Generate different token orders
  • Generate multi-word solutions
  • New corpus (parses)