Recent Advances in Automatic Speech Summarization Sadaoki Furui - - PowerPoint PPT Presentation

recent advances in automatic speech summarization
SMART_READER_LITE
LIVE PREVIEW

Recent Advances in Automatic Speech Summarization Sadaoki Furui - - PowerPoint PPT Presentation

Recent Advances in Automatic Speech Summarization Sadaoki Furui Department of Computer Science Tokyo Institute of Technology Outline Introduction Speech-to-text & speech-to-speech summarization Summarization methods


slide-1
SLIDE 1

Recent Advances in Automatic Speech Summarization

Sadaoki Furui Department of Computer Science Tokyo Institute of Technology

slide-2
SLIDE 2

Outline

  • Introduction
  • Speech-to-text & speech-to-speech summarization
  • Summarization methods

– Sentence extraction-based methods – Sentence compaction-based methods – Combination of sentence extraction and sentence compaction – Sentence segmentation

  • Evaluation schemes

– Extrinsic and intrinsic evaluations – SumACCY – ROUGE – Experimental results

  • Conclusions
slide-3
SLIDE 3

Major speech recognition applications

  • Conversational systems for accessing information

services (e.g. automatic flight status or stock quote information systems)

  • Systems for transcribing, understanding and

information extraction from ubiquitous speech documents (e.g. broadcast news, meetings, lectures, presentations and voicemails) Spoken Document Retrieval (SDR)

slide-4
SLIDE 4

User Query Web Server Audio clips Audio Clip Requests Retrieval Index

Segmentation/Cluster info. Metadata

Rich Transcription Audio Archive Metadata Construction Audio Fetching & Transcoder AUDIO ENTRY Speech Recognition & Audio Tagging Audio Segmentation & Clustering

Transcription Spoken Document Transcriber

ENROLLMENT QUERY & RETRIEVAL

Spoken document retrieval system at Univ. Colorado Boulder

slide-5
SLIDE 5

ASR transcription

Multiple languages

Name Enti Name Entity Detection ty Detection Name Enti Name Entity Detection ty Detection Word level Word level Word level Word level En Entit tity lev level En Entit tity lev level Segm Segmentation & entation & Segm Segmentation & entation & Diarization Diarization Diarization Diarization Building block level ilding block level Building block level ilding block level Concept level

  • ncept level

Concept level

  • ncept level

Topic level Topic level Topic level Topic level St Struct ructure level ure level St Struct ructure level ure level People, locations,

  • rganizations

Style chunks, speaker turns, paragraphs Titles, key concepts, relationships Concise abstract of desired length Information retrieval & brow sing

Machine translation Machine translation

Information Ext Information Extraction action Information Ext Information Extraction action Docum

  • cument Summ

ent Summarization arization Docum

  • cument Summ

ent Summarization arization Anal nalysi ysis & Organization s & Organization Anal nalysi ysis & Organization s & Organization

Spoken document retrieval (SDR)

(J. Hansen, 2005)

slide-6
SLIDE 6

Speech transcription and summarization for spoken document retrieval (SDR)

  • Although speech is the most natural and effective method of

communication between human beings, it is not easy to quickly review, retrieve and reuse speech documents if they are simply recorded as audio signal.

  • Therefore, transcribing speech is expected to become a crucial

capability for the coming IT era.

  • Speech summarization which extracts important information

and removes redundant and incorrect information is necessary for transcribing spontaneous speech.

  • Efficient speech summarization saves time for reviewing speech

documents and improves the efficiency of document retrieval.

  • Summarization results can be presented by either text or speech.
slide-7
SLIDE 7

Classification of speech summarization methods

Audience

  • Generic summarization
  • User-focused summarization
  • Query-focused summarization
  • Topic-focused summarization

Function

  • Indicative summarization
  • Informative summarization

Extracts vs. abstracts

  • Extract: consists wholly of portions from the source
  • Abstract: contains material which is not present in the source

Output modality

  • Speech-to-text summarization
  • Speech-to-speech summarization

Single vs. multiple documents

slide-8
SLIDE 8

Indicative vs. informative summarization

Indicative

summarization

Informative

summarization Summarization Summarization Information extraction Speech understanding Topics Sentence(s)

Raw utterance(s) Abstract Summarized utterance(s)

Target

Presentation summarization

0205-08

slide-9
SLIDE 9

Fundamental problems with speech summarization

  • Disfluencies, repetitions, word fragments, etc.
  • Difficulties of sentence segmentation
  • More spontaneous parts of speech (e.g. interviews

in broadcast news) are less amenable to standard text summarization

  • Speech recognition errors
slide-10
SLIDE 10

Speech-to-text/speech summarization

Speech-to-text summarization:

a) The documents can be easily looked through b) The part of the documents that is interesting for users can be easily extracted c) Information extraction and retrieval techniques can be easily applied to the documents

Speech-to-speech summarization:

a) Wrong information due to speech recognition errors can be avoided b) Prosodic information such as the emotion of speakers that is conveyed only by speech can be presented

slide-11
SLIDE 11

Speech-to-speech summarization

  • Simply presenting concatenated speech

segments that are extracted from original speech,

  • r
  • Synthesizing summarized text using a speech

synthesizer.

– Since state-of-the-art speech synthesizers still cannot produce completely natural speech, the former method can easily produce better quality summarizations, and it does not have the problem of synthesizing wrong messages due to speech recognition errors. – The major problem is how to avoid unnatural noisy sound caused by the concatenation.

slide-12
SLIDE 12

Speech-to-text summarization methods

  • Sentence extraction-based methods

– LSA-based methods – MMR-based methods – Feature-based methods

  • Sentence compaction-based methods
  • Combination of sentence extraction and

sentence compaction

slide-13
SLIDE 13

Speech-to-text summarization methods

  • Sentence extraction-based methods

– LSA-based methods – MMR-based methods – Feature-based methods

  • Sentence compaction-based methods
  • Combination of sentence extraction and

sentence compaction

slide-14
SLIDE 14

Sentence clustering using SVD

) / ( log

m A mn mn

F F f a ⋅ =

V U A Σ

T σ1σ2 σΝ

=

M content words N sentences

j i

Target Matrix Target Matrix Information of Information of sentence sentence i i Information of Information of word word j j Right singular Right singular vector matrix vector matrix Left singular Left singular vector matrix vector matrix Singular Singular value matrix value matrix

SVD semantically clusters content words and sentences SVD semantically clusters content words and sentences

Deriving a latent semantic structure from a presentation speech Deriving a latent semantic structure from a presentation speech represented by the represented by the matrix matrix A A

Element Element a amn

mn of the matrix

  • f the matrix A

A

: : Number of occurrences of a content word (m) in the sentence (n)

mn

f Fm : Number of occurrences of a content word (m) in a large corpus

slide-15
SLIDE 15

LSA-based sentence extraction - 1

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

Τ NN N N Nk k k N

v v v v v v v v v L M M M M L M M M M L

2 1 2 1 1 21 11

V One of the summarization techniques using the SVD (Gong et al, 2 One of the summarization techniques using the SVD (Gong et al, 2001) 001) Each singular vector represents a salient topic Each singular vector represents a salient topic The singular vector with the largest corresponding singular val The singular vector with the largest corresponding singular value represents ue represents the topic that is the most salient in the presentation speec the topic that is the most salient in the presentation speech h

Choose a sentence having the largest Choose a sentence having the largest index within the singular vector index within the singular vector k k

The sentence best describes the topic The sentence best describes the topic represented by the singular vector represented by the singular vector Extracted sentences best describe the topics represented by the Extracted sentences best describe the topics represented by the singular vectors singular vectors and are semantically different from each other. and are semantically different from each other.

slide-16
SLIDE 16

Drawbacks to the LSA-based method - 1

  • Dimensionality is tied to summary length and that

good sentence candidates may not be chosen if they do not “win” in any dimension.

  • When singular vectors are selected incrementally,

as the number of vectors being selected increases, the chances that non-relevant topics get included in a summary also increases.

LSA-based method -2

slide-17
SLIDE 17

LSA-based sentence extraction -2

=

σ =

K k ik k i

v

1 2

) ( ψ

Score for sentence extraction Score for sentence extraction

Dimension reduction by SVD Dimension reduction by SVD Each sentence is represented by a weighted singular Each sentence is represented by a weighted singular-

  • value vector

value vector In order to evaluate each sentence, the score of each sentence In order to evaluate each sentence, the score of each sentence is calculated by is calculated by the norm in the the norm in the K K dimensional space dimensional space

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

Mi i i i i

a a a a A M

3 2 1

⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

iN N i i i

v v v A σ σ σ M

2 2 1 1

ˆ

⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ =

iK K i i

v v σ σ ψ M

1 1

SVD SVD Dimension Dimension reduction reduction A fixed A fixed number number of sentences having relatively large sentence scores in the

  • f sentences having relatively large sentence scores in the

reduced dimensional space are selected. reduced dimensional space are selected.

slide-18
SLIDE 18

Sentence extraction from introduction and conclusion parts

Detecting the boundary

  • f the introduction part

Detecting the boundary

  • f the conclusion part

Hypothesis : presentation speech consists of introduction, main Hypothesis : presentation speech consists of introduction, main subjects and subjects and conclusion parts conclusion parts Under the condition of 10% summarization ratio Under the condition of 10% summarization ratio

Human subjects tend to extract sentences from introduction and c Human subjects tend to extract sentences from introduction and conclusion parts

  • nclusion parts

Extracting sentences from these parts Extracting sentences from these parts Extracting sentences from these parts

Cohesiveness is measured by a cosine value between content word Cohesiveness is measured by a cosine value between content word-

  • frequency vectors

frequency vectors consisting of a fixed number of content words consisting of a fixed number of content words

1 50 100 150 200 250 content words cohesiveness

slide-19
SLIDE 19

Subjective evaluation results represented by the normalized score

  • 180 automatic summaries (30 presentations x 6 summarization methods)

were evaluated by 12 human subjects in terms of ease of understanding and appropriateness as summaries in five levels.

  • Converted into factor scores to normalize subjective differences.
  • IC method significantly improves summarization performance.
  • Difference between SIG+IC and DIM+IC is not significant.

Summarization Methods

SIG: sentence extraction by a significant score (amount of information) LSA: LSA-based method-1 DIM: LSA-based method-1 IC: beginning and ending period weighting

slide-20
SLIDE 20

MMR-based method

  • Vector-space model of text retrieval
  • Particularly applicable to query-based and multi-document

summarization

  • Chooses sentences via a weighted combination of their

relevance to a query (or for generic summaries, their general relevance) and their redundancy with sentences that have already been extracted, both derived using cosine similarity

  • MMR score for a given sentence Si in the document:

( ) ( ) ( ) ( ) ( ) ( )

Summ S Sim D S Sim i Sc

i i MMR

, 1 , λ λ − − =

D: Average document vector Summ: Average vector from the set of sentences already selected λ: Trade off between relevance and redundancy (annealed) Sim: Cosine similarity between documents

slide-21
SLIDE 21

Feature-based method

  • Textual features

– Named entities (person, organization and place names) – Mean and maximum TF-IDF scores – LSA sentence score – Topic significance scores and term entropy obtained through PLSA – Confidence score

  • Structural and discourse features

– Structural features (sentence position, speaker type, etc.) – Discourse features (number of new nouns in each sentence, etc.)

  • Prosodic features

– F0, energy, and duration (mean, standard deviation, minimum, maximum, range, slope, etc.) – Speaking rate

slide-22
SLIDE 22

N

S(W) = Σ

i=1

An example of feature-based important sentence extraction method

Sentence with N words W = w1, w2, ... , wN Significance (topic) score

Important information extraction (Amount of information)

Confidence score

Recognition error exclusion (Acoustic & linguistic reliability)

Linguistic score

Linguistic correctness (Bigram/Trigram)

L(wi) + λI I(wi ) + λC C(wi )

Sentence extraction score

N 1

slide-23
SLIDE 23

Speech-to-text summarization methods

  • Sentence extraction-based methods

– LSA-based methods – MMR-based methods – Feature-based methods

  • Sentence compaction-based methods
  • Combination of sentence extraction and

sentence compaction

slide-24
SLIDE 24

Sentence compaction

2 3 6 7 Summarized (compressed) sentence Each transcribed utterance 2 3 4 6 5 7 9 8 10 8 1 1 9

Specified ratio

e.g. Extracting 7 words from 10 words: 70%

A set of words is extracted

0205-10

slide-25
SLIDE 25

Word extraction score

Summarized sentence with M words V = v1 ,v2 ,…, vM

Score

Significance (topic) score

Important information extraction (Amount of information)

Confidence score

Recognition error exclusion (Acoustic & linguistic reliability)

Word concatenation score

Semantic correctness (Word dependency probability)

Linguistic score

Linguistic correctness (Bigram/Trigram)

M

S(VM) = Σ

m=1

L(vm |… vm-1) + λI I(vm ) + λC C(vm ) + λT Tr(vm )

0205-12

slide-26
SLIDE 26

Word concatenation score

“the beautiful Japan”

Grammatically correct but incorrect as a summary A penalty for word concatenation with no dependency in the original sentence Intra-phrase Intra-phrase

Phrase 1 Phrase 1 Phrase 2 Phrase 2

0205-16

in Japan

Inter-phrase Inter-phrase

the beautiful cherry blossoms

slide-27
SLIDE 27

Word concatenation score based on SDCFG

Word dependency probability

SDCFG

(Stochastic DCFG)

If the dependency structure between words is ambiguous, 0 or 1 If the dependency structure between words is deterministic, The dependency probability between wm and wl, d(wm, wl, i, k, j), is calculated using Inside-Outside probability based on SDCFG. Tr (wm, wn)

m n-1 L j

= log Σ Σ Σ Σ d(wm, wl, i, k, j)

i=1 k=m j=n l=n

S

β α α β α w1 ......wi-1wi ....wm ....wkwk+1..wn ...wl wj wj+1 .....wL

Outside probability Outside probability Inside probability Inside probability S: Initial symbol, α, β : Non-terminal symbol, w: Word

0205-18

slide-28
SLIDE 28

Integration of ASR and sentence compaction by WFST (Hori et al., NTT)

  • Speech Recognition with compaction

– Transcribe speech signal & extract important phrases excluding recognition errors

  • Paraphrasing

– Translate spoken language into written language

Speech recognition with compaction Stochastic paraphrasing Summarization result Speech input

Weighted Finite-State Transducer

Hyps. Score

slide-29
SLIDE 29

Integration of speech recognition and paraphrasing

W ˆ

T ˆ

Speech recognition

O

Paraphrasing Speech summarization

O

( ) ( ) ( )

T P T W P W O P T

W T

max max arg ˆ = Composition & Optimization

O: Feature vector seq. W: Source word seq. T : Target word seq.

( ) ( )

W P W O P W

W

max arg ˆ =

( ) ( )

T P T W P T

T

ˆ max arg ˆ =

T ˆ

slide-30
SLIDE 30

Extended Lexicon WFST for sentence compaction

u:ε d:desu e:ε s:ε a:aka k:ε a:ε a:ao

ε:<sp>/λ ε:ε ε:ε ε:ε ε:ε

Original word-lexicon Wildcard word

Wildcard word:

  • Accept an arbitrary phone seq.

weighted by phone 2-grams

  • Output an inter-phrase symbol

(< sp> )

  • Control summarization ratio

by the penetration weight

slide-31
SLIDE 31

WFST for paraphrasing

Translation of spoken language into written language

T ˆ

Paraphrasing 3-gram(T)

Word substitution

W T

( ) ( )

( ) ( )

T P T W T P T W P T

T T

, max arg max arg ˆ δ ≈ =

W

Composition & Optimization

W: Source word seq. T : Target word seq.

T

( ):

,T W δ

Word substitution model

slide-32
SLIDE 32

WFST for paraphrasing

5 6 3 2 4 1 7

OH:ε UM:ε

w:w/ω∗

IT’S : IT ε : IS NO : I WAY : CANNOT ε: BELIEVE A : AN PIECE : EASY OF : TASK CAKE : ε ε : ε

  • Ex. OH, NO WAY. IT’S A PIECE OF CAKE.

⇒ I CANNOT BELIEVE IT. IT IS AN EASY TASK.

*Any word can be replaced by itself with a cost ω. ε:IT 8

slide-33
SLIDE 33

Speech summarization using WFST

Transcription Written text corpus Parallel corpus

D S G C WFST WFST decoder decoder

Speech data

L H D S G L C H

  • Written style

N-gram Spoken style N-gram Paraphrasing rules Extended Lexicon Triphone

HMMs Speech input: O Summarized result: T ^

slide-34
SLIDE 34

A multi-stage compaction approach to broadcast news summarization (by Kolluru)

Broadcast news Broadcast news Automatic speech recognizer Automatic speech recognizer POS tagger POS tagger Partial parser Partial parser Speech disfluencies Speech disfluencies Sentence & story boundary Sentence & story boundary MLP confidence scores MLP confidence scores 100-word summary 100-word summary Partially parsed output Partially parsed output Summary Summary MLP chunk MLP chunk Language model Language model Acoustic model Acoustic model Prosodic cues Prosodic cues Confidence score Confidence score Mean tf.idf Mean tf.idf Sum tf.idf, Named entity frequency, Chunk length Sum tf.idf, Named entity frequency, Chunk length Named entity tagger Named entity tagger Co-reference module Co-reference module Lexical cues Lexical cues Up to 3 levels

  • f parsed output

Up to 3 levels

  • f parsed output
slide-35
SLIDE 35

Combination of sentence extraction and compaction

  • Records
  • Minutes
  • Captions
  • Indexes
  • Records
  • Minutes
  • Captions
  • Indexes

Summarization Summarization

Acoustic model Acoustic model Word dependency probability Word dependency probability (Word frequency) Language model Language model Summarization language model Summarization language model

Manually parsed corpus

Speech corpus Speech corpus

Summary corpus Large-scale text corpus

Sentence compaction Sentence compaction Sentence extraction Sentence extraction

Word posterior probability

Sentence segmentation Sentence segmentation Speech recognition Speech recognition Spontaneous speech Spontaneous speech

(Recognition results)

(Summary)

slide-36
SLIDE 36

* Initial and terminal symbols cannot be skipped. * Word concatenation score is not applied to the sentence boundaries.

2-stage dynamic programming for summarizing multiple sentences

0205-21

Summarization hypothesis Recognition result

<s> v1,1 v1,2 …</s><s> v2,1 v2,2 … </s><s> v3,1 v3,2 … </s> </s> w2 w1 <s> </s> w4 w3 w2 w1 <s> </s> w3 w2 w1 <s>

Sentence 3 Sentence 2 Sentence 1

Summarization ratio

0% 100%

slide-37
SLIDE 37

Sentence segmentation

  • Speech recognition results have no punctuation or

proper segmentation.

  • Readability and usability of transcripts can be

significantly improved by segmenting text into logical units such as sentences.

  • Segmentation has a significant effect on the further

processing of speech, such as information extraction, topic detection and summarization.

  • Prosodic and N-gram features have been employed.
  • Due to poor grammatical structure, unclear

definition of sentences, disfluencies, and incorrectly recognized words, sentence segmentation of speech is stll difficult.

slide-38
SLIDE 38

Evaluation schemes

  • Quality of a summary depends on how it is used,

how readable an individual finds, and what information an individual thinks should be included.

  • Extrinsic evaluation: assessed in a task-based

setting; e.g. information browsing and access interface (ideal, but time-consuming and expensive)

  • Intrinsic evaluation: assessed in a task-independent

setting (normally employed)

  • Subjective evaluation: too costly
  • Objective evaluation: essential (using manual

summaries, which vary according to human subjects, as targets)

slide-39
SLIDE 39

Objective evaluation methods

  • Summarization accuracy using a network merging

manual summaries (SumACCY) (Hori et al., 2001)

  • Summarization accuracy weighted by the majority of

manual summaries (WSumACCY) (Hori et al., 2003)

  • Summarization accuracy using individual manual

summary (SumACCY-E) (Hirohata et al., 2004)

  • N-gram precision (Hori et al., 2000)
  • Number of overlapping n-grams (ROUGE-N) (Lin et

al., 2003)

  • Sentence recall/precision (Kitade et al., 2004)
slide-40
SLIDE 40

Summarization accuracy

Variations of manual summarization results are Variations of manual summarization results are merged into a merged into a word network word network The word network is considered to approximately express all The word network is considered to approximately express all possible correct summarization covering subjective variations possible correct summarization covering subjective variations Word accuracy of automatic summarization is calculated as the Word accuracy of automatic summarization is calculated as the summarization accuracy summarization accuracy using the word network using the word network The variations are too large at 10% summarization ratio compared The variations are too large at 10% summarization ratio compared to 50% to 50% Inappropriate summaries Inappropriate summaries Word accuracy of automatic summarization is calculated Word accuracy of automatic summarization is calculated using using manual summaries manual summaries individually individually (not using a network) (not using a network) SumACCY SumACCY-

  • E/max : Largest score of the word accuracy

E/max : Largest score of the word accuracy SumACCY SumACCY-

  • E/ave

E/ave : Average score of the word accuracy : Average score of the word accuracy SumACCY SumACCY SumACCY SumACCY-

  • E

E

slide-41
SLIDE 41
  • The network approximately covers all possible correct summaries

including subjective variations.

Summarization accuracy (SumACCY)

SumACCY is defined as word accuracy based on a word string, extracted from the word network, that is closest to the automatic summarization result.

Human summaries are merged into a single network.

SumACCY = {Len-(Sub+Ins+Del) }/Len*100 [%] Len: number of words in the most similar word string in the network Sub: number of substitution errors Ins: number of insertion errors Del: number of deletion errors

0205-23

The The Japan Japan in in beautiful beautiful cherry cherry blossoms blossoms bloom bloom in in spring spring

slide-42
SLIDE 42

ROUGE-N

ROUGE ROUGE-

  • N :

N : N N-

  • grams recall

grams recall between an automatic summary between an automatic summary and a set of manual summaries and a set of manual summaries N N-

  • grams: 1

grams: 1-

  • grams, 2

grams, 2-

  • grams and 3

grams and 3-

  • grams

grams ROUGE ROUGE-

  • N is computed as follows:

N is computed as follows:

( ) ( )

∑ ∑ ∑ ∑

∈ ∈ ∈ ∈

=

H n H n

S S S g n S S S g n m

g C g C

( ):

n m g

C

( ):

n

g C :

H

S : S :

n

g

ROUGE ROUGE-

  • N

N

A set of manual summaries A set of manual summaries Individual summary Individual summary N N-

  • gram

gram Number of Number of g gn

n in the manual summary

in the manual summary Number of co Number of co-

  • occurrences of
  • ccurrences of g

gn

n in the manual summary

in the manual summary and the automatic summary and the automatic summary

slide-43
SLIDE 43

Correlation between subjective and objective evaluation scores (averaged over presentations)

In the subjective evaluation, the summaries were evaluated in terms of ease of understanding and appropriateness as summaries on five levels.

slide-44
SLIDE 44

Correlation between subjective and objective evaluation scores (each presentation)

slide-45
SLIDE 45

Conclusions

  • Although various automatic speech summarization

techniques have been proposed and tested, their performance is still much worse than that of manual summarization.

  • In order to build really useful speech summarization systems

applicable to real applications, we definitely need more efficient and speech-focused techniques, including sentence (utterance) segmentation methods.

  • It remains to be determined through further experiments by

researchers using various corpora whether or not the objective evaluation measures that have been proposed correlate well with human judgments. There still exists large room for improvement in the objective measures.

  • It is also necessary to evaluate summaries extrinsically

within the context of applications, instead of only using intrinsic evaluation methods.