SLIDE 1
Narrative Theme Navigation for Sitcoms Supported by Fan-generated - - PowerPoint PPT Presentation
Narrative Theme Navigation for Sitcoms Supported by Fan-generated - - PowerPoint PPT Presentation
Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina What? Novel method to generate indexing information
SLIDE 2
SLIDE 3
Why?
Lots of different ways to watch videos
DVD, Blu-ray On-demand Internet
Lots of videos out there! Need better ways to navigate content
Show a particular scene Show where a favorite actor talks
Support random seek into videos
SLIDE 4
Example: Sitcoms
Specifically “Seinfeld” Strict set of rules
Every scene transition is marked by music Every punchline marked by artificial laughter
Video: http://www.youtube.com/watch?v=PaPxSsK6ZQA
SLIDE 5
Outline
1.
Original Joke-O-Mat (2009)
- System setup
- Evaluation
- Limitations
2.
Enhanced version (2010)
- System setup
- Evaluation
3.
Future Work
SLIDE 6
Outline
1.
Original Joke-O-Mat (2009)
- System setup
- Evaluation
- Limitations
2.
Enhanced version (2010)
- System setup
- Evaluation
3.
Future Work
SLIDE 7
Joke-O-Mat
Original system (2008-2009) Ability to navigate basic narrative elements:
Scenes Punchlines Dialog segments
Per-actor filter Ability to skip certain parts “Surf” the episode
"Using Artistic Markers and Speaker Identification for Narrative-Theme Navigation of Seinfeld Episodes” G. Friedland, L. Gottlieb, and A. Janin Proceedings of the 11th IEEE International Symposium on Multimedia (ISM2009), San Diego, California, pp. 511-516
SLIDE 8
Joke-O-Mat
Two main elements: 1.
Pre-processing step
2.
Online video browser:
SLIDE 9
Joke-O-Mat
Two main elements: 1.
Pre-processing and analysis step
SLIDE 10
Acoustic Event & Speaker Identification
Goal: Train GMMs for different
audio events
Jerry, Kramer, Elaine, George Male & female supporting actor Laughter Music Non-speech (i.e. other noises)
Use 1-minute audio sample Compute 19-dim MFCCs Train 20-component GMMs
SLIDE 11
Audio Segmentation
Given the trained GMMs 2.5 sec * 10ms = 250 frames Compute likelihood for each set of features for each GMM Use majority vote to classify to either speakers or laughter/
music/non-speech
SLIDE 12
Narrative Theme Analysis
Transforms acoustic event segmentation and speaker
detection into narrative theme segments
Rule-based system:
Dialog = single contiguous speech segment Punchline = dialog + laughter Top-5 punchlines = 5 punchlines
followed by the longest laughter
Scene = segment of at least 10 sec
between two music events
SLIDE 13
Narrative Theme Analysis
Creates icons for the GUI Sitcom rules: actor has to be shown once a certain speaking
time is exceeded
Median frame of the longest speech segment for each actor Could use a visual approach here.. Use median frame for other
events (scene, punchlines, dialog)
SLIDE 14
Online Video Browser
Shows video Allows for play/pause, seeking to random positions Navigational panel allows to browse directly to:
Scene Punchline Top-5 punchlines Dialog element
Select/deselect actors http://www.icsi.berkeley.edu/jokeomat/HD/auto/
index.html
SLIDE 15
Evaluation
Phase Performance For 25min Episode Training 30% real-time 2.7min Classification 10% real-time 2.5min Narrative Theme Analysis 10% real-time 2.5min Total 7.7min
Diarization Error Rate (DER) = 46% 5% per class Winner of the ACM Multimedia Grand Challenge 2009
SLIDE 16
Limitations of the original Joke-O-Mat
Requires manual training of speaker models Requires 60 seconds of training data for each speaker
Cannot support actors with minor roles
Does not take into account what was said
SLIDE 17
Outline
1.
Original Joke-O-Mat (2009)
- System setup
- Evaluation
- Limitations
2.
Enhanced version (2010)
- System setup
- Evaluation
3.
Future Work
SLIDE 18
Extended System
Enhanced Joke-O-Mat (2010) + Speech Recognition Keyword search Automatic alignment of speaker ID and ASR with:
Fan-generated scripts Closed captions
Significantly reduces manual intervention
SLIDE 19
New Joke-O-Mat System
SLIDE 20
New Joke-O-Mat System
SLIDE 21
Context-Augmentation
Producing transcripts can be costly Luckily we have the Internet!
Scripts and closed captions produced by fans
SLIDE 22
Fan-generated data
Fan-sourced scripts
Tend to be very accurate However, don’t contain any time information
Closed captions
Contain time information However, do not contain speaker attribution Less accurate, often intentionally altered
Normalize and merge them together…
SLIDE 23
Fan-generated data
Normalize the scripts and the closed captions Then, use minimum edit distance to align two sources Start & End words in script = Start & End words in caption Use timing from the closed caption, speaker from the script If one speaker = single-single speaker segment If multiple speakers = multi-speaker segment (37.3%)
SLIDE 24
Forced Alignment
Generate detailed timing information for each word Perform all steps of a speech recognizer on the audio But, instead of using a language model, use only the
transcript sequence of words
Also does speaker adaptation over segments
Will be more accurate on speaker-homogeneous segments
+ Audio Transcript Alignment =
SLIDE 25
Forced Alignment
Run forced alignment on each segment For 10 episodes tested – 90% of the segments aligned at the
first step
Start time & end time of each word Speaker attribution
SLIDE 26
Forced Alignment
Pool segments for each speaker and train speaker models + train a garbage model
On audio that falls between the segments Assume that contain only laughter, music, and other non-speech
SLIDE 27
Forced Alignment
For the failed single-speaker segments:
Still use segment start and end time Don’t have a way to index exact temporal location of each word
For each failed multi-speaker segment:
Generate a HMM alternating:
Speaker states Garbage states
SLIDE 28
Forced Alignment
For each time step, advance an arc and collect probability
Ex: if move across “Patrice” arc, invoke “Patrice” speaker model
at that time step
Segmentation = most probable path through the HMM Garbage model allows for arbitrary noise between speakers
Minimum duration for each speaker In reality, system was not sensitive the the duration
SLIDE 29
Forced Alignment
Multi-speaker segments => many single-speaker segments Run the forced alignment with ASR again
SLIDE 30
Music & Laughter Segmentation
Laughter decoded using Shout speech/nonspeech decoder Music models are trained separately (same as the original
Joke-O-Mat)
SLIDE 31
Putting it all together
http://www.icsi.berkeley.edu/jokeomat/HD/auto/ index.html
SLIDE 32
Evaluation
Compare to expert-annotated ground truth 1.
DER
False alarms: closed captions spanning multiple dialog
segments
Missed speech: truncation of words in forced alignment
SLIDE 33
Evaluation
Compare to expert-annotated ground truth
- 2. User Study
25 participants Randomly showed expert- and
fan-annotated episodes
Asked to state preference
SLIDE 34
Outline
1.
Original Joke-O-Mat (2009)
- System setup
- Evaluation
- Limitations
2.
Enhanced version (2010)
- System setup
- Evaluation
3.
Future Work
SLIDE 35
Limitations & Future Work
Laughter and scene transition music – manually trained Require scripts and closed captions
Available from show producers
Failed single-speaker segments – how to handle?
Retrain speaker models HMM for the whole episode
Look at other genres (dramas, soap operas, lectures?)
New rules
Add visual data
SLIDE 36