Narrative Theme Navigation for Sitcoms Supported by Fan-generated - - PowerPoint PPT Presentation

narrative theme navigation for sitcoms supported by fan
SMART_READER_LITE
LIVE PREVIEW

Narrative Theme Navigation for Sitcoms Supported by Fan-generated - - PowerPoint PPT Presentation

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina What? Novel method to generate indexing information


slide-1
SLIDE 1

Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

slide-2
SLIDE 2

What?

Novel method to generate indexing information for the navigation of TV content

slide-3
SLIDE 3

Why?

 Lots of different ways to watch videos

 DVD, Blu-ray  On-demand  Internet

 Lots of videos out there!  Need better ways to navigate content

 Show a particular scene  Show where a favorite actor talks

 Support random seek into videos

slide-4
SLIDE 4

Example: Sitcoms

 Specifically “Seinfeld”  Strict set of rules

 Every scene transition is marked by music  Every punchline marked by artificial laughter

 Video: http://www.youtube.com/watch?v=PaPxSsK6ZQA

slide-5
SLIDE 5

Outline

1.

Original Joke-O-Mat (2009)

  • System setup
  • Evaluation
  • Limitations

2.

Enhanced version (2010)

  • System setup
  • Evaluation

3.

Future Work

slide-6
SLIDE 6

Outline

1.

Original Joke-O-Mat (2009)

  • System setup
  • Evaluation
  • Limitations

2.

Enhanced version (2010)

  • System setup
  • Evaluation

3.

Future Work

slide-7
SLIDE 7

Joke-O-Mat

 Original system (2008-2009)  Ability to navigate basic narrative elements:

 Scenes  Punchlines  Dialog segments

 Per-actor filter  Ability to skip certain parts  “Surf” the episode

"Using Artistic Markers and Speaker Identification for Narrative-Theme Navigation of Seinfeld Episodes” G. Friedland, L. Gottlieb, and A. Janin Proceedings of the 11th IEEE International Symposium on Multimedia (ISM2009), San Diego, California, pp. 511-516

slide-8
SLIDE 8

Joke-O-Mat

 Two main elements: 1.

Pre-processing step

2.

Online video browser:

slide-9
SLIDE 9

Joke-O-Mat

 Two main elements: 1.

Pre-processing and analysis step

slide-10
SLIDE 10

Acoustic Event & Speaker Identification

 Goal: Train GMMs for different

audio events

 Jerry, Kramer, Elaine, George  Male & female supporting actor  Laughter  Music  Non-speech (i.e. other noises)

 Use 1-minute audio sample  Compute 19-dim MFCCs  Train 20-component GMMs

slide-11
SLIDE 11

Audio Segmentation

 Given the trained GMMs  2.5 sec * 10ms = 250 frames  Compute likelihood for each set of features for each GMM  Use majority vote to classify to either speakers or laughter/

music/non-speech

slide-12
SLIDE 12

Narrative Theme Analysis

 Transforms acoustic event segmentation and speaker

detection into narrative theme segments

 Rule-based system:

 Dialog = single contiguous speech segment  Punchline = dialog + laughter  Top-5 punchlines = 5 punchlines

followed by the longest laughter

 Scene = segment of at least 10 sec

between two music events

slide-13
SLIDE 13

Narrative Theme Analysis

 Creates icons for the GUI  Sitcom rules: actor has to be shown once a certain speaking

time is exceeded

 Median frame of the longest speech segment for each actor  Could use a visual approach here..  Use median frame for other

events (scene, punchlines, dialog)

slide-14
SLIDE 14

Online Video Browser

 Shows video  Allows for play/pause, seeking to random positions  Navigational panel allows to browse directly to:

 Scene  Punchline  Top-5 punchlines  Dialog element

 Select/deselect actors  http://www.icsi.berkeley.edu/jokeomat/HD/auto/

index.html

slide-15
SLIDE 15

Evaluation

Phase Performance For 25min Episode Training 30% real-time 2.7min Classification 10% real-time 2.5min Narrative Theme Analysis 10% real-time 2.5min Total 7.7min

 Diarization Error Rate (DER) = 46%  5% per class  Winner of the ACM Multimedia Grand Challenge 2009

slide-16
SLIDE 16

Limitations of the original Joke-O-Mat

 Requires manual training of speaker models  Requires 60 seconds of training data for each speaker

 Cannot support actors with minor roles

 Does not take into account what was said

slide-17
SLIDE 17

Outline

1.

Original Joke-O-Mat (2009)

  • System setup
  • Evaluation
  • Limitations

2.

Enhanced version (2010)

  • System setup
  • Evaluation

3.

Future Work

slide-18
SLIDE 18

Extended System

 Enhanced Joke-O-Mat (2010)  + Speech Recognition  Keyword search  Automatic alignment of speaker ID and ASR with:

 Fan-generated scripts  Closed captions

 Significantly reduces manual intervention

slide-19
SLIDE 19

New Joke-O-Mat System

slide-20
SLIDE 20

New Joke-O-Mat System

slide-21
SLIDE 21

Context-Augmentation

 Producing transcripts can be costly  Luckily we have the Internet!

 Scripts and closed captions produced by fans

slide-22
SLIDE 22

Fan-generated data

 Fan-sourced scripts

 Tend to be very accurate  However, don’t contain any time information

 Closed captions

 Contain time information  However, do not contain speaker attribution  Less accurate, often intentionally altered

 Normalize and merge them together…

slide-23
SLIDE 23

Fan-generated data

 Normalize the scripts and the closed captions  Then, use minimum edit distance to align two sources  Start & End words in script = Start & End words in caption  Use timing from the closed caption, speaker from the script  If one speaker = single-single speaker segment  If multiple speakers = multi-speaker segment (37.3%)

slide-24
SLIDE 24

Forced Alignment

 Generate detailed timing information for each word  Perform all steps of a speech recognizer on the audio  But, instead of using a language model, use only the

transcript sequence of words

 Also does speaker adaptation over segments

 Will be more accurate on speaker-homogeneous segments

+ Audio Transcript Alignment =

slide-25
SLIDE 25

Forced Alignment

 Run forced alignment on each segment  For 10 episodes tested – 90% of the segments aligned at the

first step

 Start time & end time of each word  Speaker attribution

slide-26
SLIDE 26

Forced Alignment

 Pool segments for each speaker and train speaker models  + train a garbage model

 On audio that falls between the segments  Assume that contain only laughter, music, and other non-speech

slide-27
SLIDE 27

Forced Alignment

 For the failed single-speaker segments:

 Still use segment start and end time  Don’t have a way to index exact temporal location of each word

 For each failed multi-speaker segment:

 Generate a HMM alternating:

 Speaker states  Garbage states

slide-28
SLIDE 28

Forced Alignment

 For each time step, advance an arc and collect probability

 Ex: if move across “Patrice” arc, invoke “Patrice” speaker model

at that time step

 Segmentation = most probable path through the HMM  Garbage model allows for arbitrary noise between speakers

 Minimum duration for each speaker  In reality, system was not sensitive the the duration

slide-29
SLIDE 29

Forced Alignment

 Multi-speaker segments => many single-speaker segments  Run the forced alignment with ASR again

slide-30
SLIDE 30

Music & Laughter Segmentation

 Laughter decoded using Shout speech/nonspeech decoder  Music models are trained separately (same as the original

Joke-O-Mat)

slide-31
SLIDE 31

Putting it all together

http://www.icsi.berkeley.edu/jokeomat/HD/auto/ index.html

slide-32
SLIDE 32

Evaluation

 Compare to expert-annotated ground truth 1.

DER

 False alarms: closed captions spanning multiple dialog

segments

 Missed speech: truncation of words in forced alignment

slide-33
SLIDE 33

Evaluation

 Compare to expert-annotated ground truth

  • 2. User Study

 25 participants  Randomly showed expert- and

fan-annotated episodes

 Asked to state preference

slide-34
SLIDE 34

Outline

1.

Original Joke-O-Mat (2009)

  • System setup
  • Evaluation
  • Limitations

2.

Enhanced version (2010)

  • System setup
  • Evaluation

3.

Future Work

slide-35
SLIDE 35

Limitations & Future Work

 Laughter and scene transition music – manually trained  Require scripts and closed captions

 Available from show producers

 Failed single-speaker segments – how to handle?

 Retrain speaker models  HMM for the whole episode

 Look at other genres (dramas, soap operas, lectures?)

 New rules

 Add visual data

slide-36
SLIDE 36

Thanks!