Attention Shifting for Parsing Speech Keith Hall and Mark Johnson - - PowerPoint PPT Presentation

attention shifting for parsing speech
SMART_READER_LITE
LIVE PREVIEW

Attention Shifting for Parsing Speech Keith Hall and Mark Johnson - - PowerPoint PPT Presentation

Attention Shifting for Parsing Speech Keith Hall and Mark Johnson Brown University ACL 2004 July 22, 2004 Attention Shifting Iterative best-first word-lattice parsing algorithm Posits a complete syntactic analyses for each path of a


slide-1
SLIDE 1

Attention Shifting for Parsing Speech

Keith Hall and Mark Johnson

Brown University

ACL 2004

July 22, 2004

slide-2
SLIDE 2

Attention Shifting

  • Iterative best-first word-lattice parsing algorithm
  • Posits a complete syntactic analyses for each path of a word-lattice
  • Goals of Attention Shifting

– Improve accuracy of best-first parsing on word-lattices

(Oracle Word Error Rate)

– Improve efficiency of word-lattice parsing

(Number of parser operations)

– Improve syntactic language modeling based on multi-stage parsing

(Word Error Rate)

  • Inspired by edge demeriting for efficient parsing

Blaheta & Charniak demeriting (ACL99)

7/22/2004 ACL04: Attention Shifting for Parsing Speech 1

slide-3
SLIDE 3

Outline

  • Syntactic language modeling
  • Word-lattice parsing
  • Multi-stage best-first parsing

7/22/2004 ACL04: Attention Shifting for Parsing Speech 2

slide-4
SLIDE 4

Noisy Channel

Langauge Source

Noisy Channel

Language Output

P(A, W) = P(A|W)P(W)

Language Model Noise Model

  • Speech recognition: Noise model = Acoustic model

arg max

W P(W|A) = arg max W P(A, W)

7/22/2004 ACL04: Attention Shifting for Parsing Speech 3

slide-5
SLIDE 5

Syntactic Language Modeling

w1, ..., wi, ..., wn1 ... Language Model w1, ..., wi, ..., wn2 w1, ..., wi, ..., wn3 w1, ..., wi, ..., wn4 w1, ..., wi, ..., wnm

  • 1, ..., oi, ..., on

8 2 3 5 1 6 4 7 10 9 the/0 man/0 is/0 duh/1.385 man/0 is/0 surely/0 early/0 mans/1.385 man's/1.385 surly/0 surly/0.692 early/0 early/0

n-best list extractor

  • Adding syntactic information to context (conditioning information)

P (W ) = Qk

1 P (wi|π(wk, . . . , w1))

  • n-best reranking

– Select n-best strings using some model (trigram) – Process each string independently – Select string with highest P (A, W )

  • Charniak (ACL01), Chelba & Jelinek (CS&L00,ACL02), Roark (CL01)

7/22/2004 ACL04: Attention Shifting for Parsing Speech 4

slide-6
SLIDE 6

Parsing word-lattice

1 <s>/0 2 the/0 3 duh/1.223 4 man/0 5 mans/0.510 6 man’s/0.916 9 man/0 is/0 7 early/0 surly/0.694 early/0 10 is/0 8/1.307 </s>/0 surely/0 NN VB VBZ VP VB DT JJ

  • Compress lattice with Weighted FSM determinization and minimization

(Mohri, Pereira, & Riley CS&L02)

  • Use compressed word-lattice graph as the parse chart
  • Structure sharing due to compressed lattice

VP → NN VB covers string man is VP → VBZ covers string mans

7/22/2004 ACL04: Attention Shifting for Parsing Speech 5

slide-7
SLIDE 7

Word-lattice example

  • I WOULD NOT SUGGEST ANYONE MAKE A DECISION ON WHO TO VOTE FOR

BASED ON A STUDY LIKE THIS (160 arcs, 72 nodes)

2 i/0 i./10.71 1 if/81.46 3 would/0 i/0 4 not/0 8 see/97.62 5 suggest/0 9 suggested/92.75 just/0 to/80.41 the/54.78 that/133.4 they/55.21 it/73.10 7 anyone/0 6 any/55.00 10 anyway/93.69 anyone/0 11 make/0 made/132.9 would/110.6
  • ne/0.5
way/87.54 when/0.933 make/0 12 a/0 13 decision/0 16
  • n/0
14 and/65.94 42 food/74.85 37 to/201.2 18 who/0 17 whom/157.8 how/238.1 15 who/0 31 devote/0 55 force/158.2 ford/94.43 fork/87.49 form/103.5 formed/101.8 former/165.2 fort/73.29 forth/124.9 forward/166.8 fourth/122.7 more/264.4 32 foreign/126.3 35 for/0 34 a/148.7 to/156.0 39 four/12.94 43 a/0 38 devote/0 devote/0 28 to/76.96 26 develop/174.3 23 developed/188.5 21 devotes/127.1 19 did/75.42 do/65.48 to/0 29 vote/0 30 four/0.203 27 for/1.683 four/1.660 24 for/0.210 22 four/0.113 for/2.234 ford/42.80 form/51.87 foreign/74.75 20 vote/0 44 vote/0 45 for/0 four/57.01 a/106.9 are/111.9 46 bass/0 52 base/12.53 48 based/53.86 40 based/0 are/58.93 based/0 36 based/0 56 based/0 33 based/0 49 sonat/0 50
  • n/10.41
25 based/0 sonat/0
  • n/10.99
63 steady/0 study/11.87 59 to/76.32 his/113.1 austerity/85.86 62 a/0 61 is/29.99 as/18.04 60 its/47.22 51 the/98.57 sonat/0
  • n/11.01
for/0 four/57.89 a/108.3 are/111.9 based/53.86 bass/0 base/12.52 four/0 are/96.33
  • f/63.72
based/37.40 41 base/0 sonat/0 57
  • n/11.00
to/76.32 his/113.1 austerity/85.86 a/0 is/29.99 as/18.04 its/47.22 58 the/144.9 53
  • n/120.3
47 donna/0 donna/0
  • n/105.9
four/0 donna/0
  • n/106
study/0 54 a/0 for/0 sonat/0
  • n/11.00
steady/0 study/11.89 64 like/0 study/13.62 standing/107.2 steady/0 steady/0 steady/0.023 study/12.52 teddy/3.808 steady/0 study/11.19 steady/0 study/13.62 standing/107.2
  • n/0
steady/0 study/11.19 65 this/0 66/0 </s>/0
  • compressed NIST ’93 HUB-1 lattices

– average of 800 arcs/lattice (max 15000 arcs) – average of 100 nodes/lattice (max 500 nodes)

7/22/2004 ACL04: Attention Shifting for Parsing Speech 6

slide-8
SLIDE 8

Best-first Word-lattice Parsing

1 <s>/0 2 the/0 3 duh/1.223 4 man/0 5 mans/0.510 6 man’s/0.916 9 man/0 is/0 7 early/0 surly/0.694 early/0 10 is/0 8/1.307 </s>/0 surely/0 Agenda (Priority Queue)

NP(0,4) 0.567 NN(3,9) 0.550

Compute FOM Grammar

VB VB VP VB DT NN

  • Bottom-up best-first PCFG parser
  • Stack-based search technique based on figure-of-merit
  • Attempts to work on “likely” parts of the chart
  • Ideal figure-of-merit: P (edge) = inside(edge) ∗ outside(edge)

details in (Hall & Johnson ASRU03)

7/22/2004 ACL04: Attention Shifting for Parsing Speech 7

slide-9
SLIDE 9

Word-lattice Parsing

Compresed lattice Outside HMM Best-first PCFG Parser Syntactic Category Lattice Inside-Outside Prunning Local Trees Lexicalized Syntactic Language Model (Charniak Parser) Word String from Optimal Parse Word-lattice First Stage Second Stage

  • First stage: best-first bottom-up PCFG parser
  • Second stage: Charniak Parser Language Model

(Charniak ACL01)

  • Parsing from lattice allows structure sharing
  • Combines search for candidate lattice paths

with search for candidate parses

7/22/2004 ACL04: Attention Shifting for Parsing Speech 8

slide-10
SLIDE 10

Multi-stage Deficiency

  • First-stage PCFG parser selects parses for a subset of word-lattice paths
  • Lexicalized syntactic analysis not performed on all of the word-lattice
  • Covering entire word-lattice requires excessive over-parsing

– 100X over-parsing produces forests too large for lexical-parser – additional pruning required, resulting in loss of lattice-paths

  • Attention shifting algorithm addresses the coverage problem

7/22/2004 ACL04: Attention Shifting for Parsing Speech 9

slide-11
SLIDE 11

Attention Shifting

Identify Unused Words Clear Agenda/ Add Edges for Unused Words Is Agenda Empty? no Continue Multi-stage Parsing yes PCFG Word-lattice Parser

  • Iterative reparsing
  • 1. Perform best-first PCFG parsing (over-parse as

with normal best-first parsing)

  • 2. Identify words not covered by a complete parse

(unused word has 0 outside probability)

  • 3. Reset parse Agenda to contain unused words
  • 4. If Agenda = ∅ repeat
  • Prune chart using inside/outside pruning
  • At most |A| iterations (|A| = number of arcs)
  • Forces coverage of word-lattice

7/22/2004 ACL04: Attention Shifting for Parsing Speech 10

slide-12
SLIDE 12

Experimental Setup

  • PCFG Parser trained on Penn WSJ Treebank f2-21,24

(speech-normalization via Roark’s normalization)

– Generated at most 30k local-trees for second-stage parser

  • Lexicalized parser: Charniak’s Language Model Parser

(Charniak ACL01)

– trained on parsed BLLIP99 corpus (30 million words of WSJ) – BLLIP99 parsed using Charniak string parser trained on Penn WSJ

7/22/2004 ACL04: Attention Shifting for Parsing Speech 11

slide-13
SLIDE 13

Evaluation

  • Evaluation set: NIST ’93 HUB-1

– 213 utterances – Professional readers reading WSJ text

  • Word-lattices evaluated on:

– n-best word-lattices using Chelba A∗ decoder (50-best paths) – compressed acoustic word-lattices

  • Metrics

– Word-lattice accuracy (first-stage parser): Oracle Word Error Rate – Word-string accuracy (multi-stage parser): Word Error Rate – Efficiency: number of parser agenda operations

7/22/2004 ACL04: Attention Shifting for Parsing Speech 12

slide-14
SLIDE 14

Results: n-best word-lattices

  • Charniak parser run on each of the n-best strings (reranking)

(4X over-parsing)

  • n-best word-lattice: pruned acoustic word-lattices containing only n-best

word-strings

  • Oracle WER of n-best lattices: 7.75

Model # edge pops Oracle WER WER n–best (Charniak) 2.5 million 7.75 11.8 100x LatParse 3.4 million 8.18 12.0 10x AttShift 564,895 7.78 11.9

7/22/2004 ACL04: Attention Shifting for Parsing Speech 13

slide-15
SLIDE 15

Results: Acoustic word-lattices

  • Compressed acoustic lattices

Model # edge pops Oracle WER WER acoustic lats N/A 3.26 N/A 100x LatParse 3.4 million 5.45 13.1 10x AttShift 1.6 million 4.17 13.1

7/22/2004 ACL04: Attention Shifting for Parsing Speech 14

slide-16
SLIDE 16

Conclusion

  • Attention shifting

– Improves parsing efficiency – Increases first-stage accuracy (correcting for best-first search errors) – Does not improve multi-stage accuracy

  • Pruning for second-stage parser constrains number of edges
  • Useful for best-first word-lattices parsing

7/22/2004 ACL04: Attention Shifting for Parsing Speech 15