Semantic Multi-modal Analysis, Structuring, and Visualization for - - PDF document

semantic multi modal analysis structuring and
SMART_READER_LITE
LIVE PREVIEW

Semantic Multi-modal Analysis, Structuring, and Visualization for - - PDF document

Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos Alexander Haubold Department of Computer Science Columbia University Thesis Proposal Abstract Videos are rich in multimedia content and


slide-1
SLIDE 1

1

Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos

Alexander Haubold Department of Computer Science Columbia University

Thesis Proposal Abstract

Videos are rich in multimedia content and semantics, which should be used by video browsers to better present the audio-visual information to the viewer. Ubiquitous video players allow for content to be scanned linearly, rarely providing summaries or methods for searching. Through analysis of audio and video tracks, it is possible to extract text transcripts from audio, displayed text from video, and higher-level semantics through speaker identification and scene

  • analysis. External data sources, when available, can be used to cross-reference the video content

and impose a structure for organization. Various research tools have addressed video summarization and browsing using one or more of these modalities; however, most of them assume edited videos as input. We focus our research on genres in personal interaction videos and collections of such videos in their unedited form. We present and verify formal models for their structure, and develop methods for their automatic analysis, summarization and indexing. We specify the characteristic semantic components of three related genres of candidly captured videos: formal instructions or lectures, student team project presentations, and discussions. For each genre, we design and validate a separate multi-modal approach to the segmentation and structuring of their content. We develop novel user interfaces to support browsing and searching the multi-modal video information, and introduce the tool in a classroom environment with ≈160 students per semester. UI elements are designed according to the underlying video structure to address video browsing in a structured multi-modal space. These user interfaces include image/video browsers, audio/video segmentation browsers, and text/filtered ASR transcript browsers. Through several user studies, we evaluate and refine our indexing methods, browser interface, and the tools usefulness in the classroom. We propose a core/module methodology to analysis, structure, and visualization of personal interaction videos. Analysis, structure, and visualization techniques in the core are common to all

  • genres. Modular features are characteristic to video genres, and are applied selectively. Structure
  • f interactions in each video is derived from the combination of the resulting audio, visual, and

textual features. We expect that the framework can be applied to genres not covered here with the addition or replacement of few characteristic modules.

slide-2
SLIDE 2

2

Contents

1 Introduction 3 1.1 Motivation 3 1.2 Background 3 2 Research Approaches 7 2.1 Three genres 8 2.2 Common Tools 9 2.2.1 Analysis and Structure 9 2.2.2 Visualization Techniques 10 2.3 Genre-specific Tools 11 2.3.1 Lectures 11 2.3.2 Presentations 12 2.3.3 Discussions 13 3 Research Progress 14 3.1 Structuring Lecture Videos using visual contents 14 3.1.1 Classification by Media Type 14 3.1.2 Topological Segmentation 15 3.2 Structuring Lecture Videos using textual contents 15 3.2.1 Data Acquisition 16 3.2.2 Analysis 16 3.2.3 Results 16 3.3 Segmentation and Augmentation for Classroom Presentation Videos 18 3.3.1 Audio Segmentation 18 3.3.2 Visual Segmentation 18 3.3.3 Combined Audio-Visual Segments 18 3.3.4 Text Augmentation 19 3.3.5 Interface 20 3.3.6 User Study 20 3.4 Research on Accommodating Sample Size Effects in symmetric KL in Speaker Clustering 21 3.4.1 Empirical Solution 21 3.5 Summary of research progress 22 4 Proposed Work 22 4.1 High-level Structure Detection 22 4.2 Video Structure Comparison 23 4.3 Speaker Table of Contents 23 4.4 Analysis of Discussion Video and Application of Common Approaches 24 4.5 Text Indexing 24 4.6 User Interfaces and Tools 24 4.7 User Studies 25 4.8 Feedback Annotations for Videos (Optional) 25 5 Conclusion 25 5.1 Schedule 26

slide-3
SLIDE 3

3

1 Introduction

1.1 Motivation With the advent of ubiquitous high-performance computers, high-speed networks, and inexpensive recording equipment, the use of video as a medium for communication and information dissemination is increasing significantly. What used to be a laborious task of recording with bulky video equipment, transferring footage to computers with expensive hardware, and disseminating video material by means of portable media, is now reduced to simple plug-and-play procedures, and easy-to-use software for compression and distribution. Due to its simplicity, video has started to play an important and integral role for many

  • rganizations – presentations, discussions, lectures, and other events are readily captured, shared,

and archived. This trend is particularly strong in the University environment, where lectures are recorded for distant learning programs, student presentations and discussions are documented by instructors for providing feedback, and guest talks are captured for archiving purposes. One of the drawbacks of this intensive use of video is large accumulation of raw video

  • footage. While it is easy to transfer video to a computer, there still exists a need for editing,
  • rganization, and effective dissemination. Without manual intervention, the raw footage is

merely a serial stream of video data; however, the content is rich in information that should be indexed and searched similar to a textbook. Indices and table of content for videos require the detection of structure, from which hierarchies of contextual units can be derived. Structure is determined from segmentation of the various media used in video (imagery, audio, text, etc.), and different emphasis is placed on each of the tracks depending on the type of video content. Research problems include analysis of the types of video and their features, segmentation and clustering of audio by speakers, transcript extraction and filtering from audio, segmentation and clustering of visual content, interactive visualization approaches for structure in video, and effective distribution of content to viewers. This work contributes new methods and approaches of multi-modal video analysis for personal interaction videos. One of the genres considered (lecture video) has been explored in many prior works, while two additional genres (presentation and discussion video) are newly introduced, compared, and contrasted. The findings and implementation of this work find widespread application, in particular in environments where vigorous personal interaction plays an important role, for example in the team-oriented University classroom. 1.2 Background Personal interaction videos are rich in content but typically lack frequent action events, which are commonly found in genres of news, sports, and film. They are to the most part unedited and often contain long sequences during which a single topic is covered. Determining structure of their contents relies on approaches of content analysis for determining contextually coherent units, and their temporally recurring instances.

slide-4
SLIDE 4

4 This area of content-based video indexing and retrieval (CBVIR) builds on analysis of multi- modal sources, including imagery, motion in video, audio, text from speech, text from image, and text from other sources, to name a few. Substantial research has been carried out in these fields, to the most part isolated to unique problems in one medium. Bashir and Khokhar [1] provide a hierarchical overview of CBVIR (see Figure 1), in which analysis of a medium falls into one of three levels: low-level features from signal processing, semantic representations from computer vision methodologies, and high-level intelligent reasoning from AI and psychology/philosophy understanding. While their figure focuses solely on imagery in video,

  • ther media such as audio and text share a similar hierarchy. An approach to a complete analysis

would draw techniques from all three levels: segmentation of video and audio signals using low level features, determination of content similarity at the semantic level, and presentation of the summary at high level. Segmentation of video into shots tends to be based on low-level features, such as histogram changes, MPEG motion vectors, Gabor energies, textures, etc. The Cornell Lecture Browser [2] uses histograms to detect presentation slide changes; Smith [3] and Yang [4] use it to detect cuts in news and other non-presentation videos, and Haubold [5] uses it for presentation and significant speaker pose changes. Feature vectors from such low-level features are also used for statistical approaches to segmentation, and machine learning methods of classification of shots. Souvannavong [6] applies Latent Semantic Indexing (LSI), a well-known method in text analysis discussed by Landauer [7], to clustering of shots in news videos. Dorai [8] uses low-level feature vectors to train classifiers for video shot types, such as blackboard/whiteboard, narrator, and Figure 1: Classification of Content Modeling Techniques by Bashir and Khokhar [1]. Level I: Modeling of Raw Video Data, Level II: Representation of derived or logical features, Level III: Semantic level abstractions.

slide-5
SLIDE 5

5 slide text, among others. Haubold [9] uses a decision tree for a similar classification for video shot types appearing in Columbia Video Network lecture videos. Segmentation and classification of audio is approached similar to video. Low-level features include volume, zero-cross rate, frequency centroids, Fourier analysis, and the widely used Mel Frequency Cepstral Coefficients (MFCC), among others. Fujii [10] uses the audio signal of a lecture video to detect pauses in speech, which are then used to model text-from-speech topics. Low-level audio features are also the basis of video classification into sports, news, and commercial in work by Liu [11]. The Cornell Lecture Browser system [2] artificially introduces audio signals as cues for synchronization between several cameras. Chen [12] introduces a robust speaker segmentation algorithm using low-level MFCC features and the Bayesian Information Criterion, which Haubold [5] applies as an important step to structuring presentation videos. The most useful derivation from the audio stream is the text transcript, captured through Automatic Speech Recognition (ASR). It has been applied ubiquitously in video analysis, for text query and search is a better-understood problem than its visual analog. Fujii [10] applies ASR to generate text transcripts for lecture videos, Haubold [5] for presentation videos, and Waibel [13] for meeting videos to serve as searchable and browseable indices. Smith [3] and Yang [4] apply transcripts from news videos as indices for searching and querying. In some works, transcriptions serve specifically as text indices to audio archives, for example in Young’s video mail [14], or Whittaker’s news streams [15]. While ASR algorithms use machine learning to improve accuracy given speaker and language models, their results are not perfect. In case no models are customized, accuracy drops dramatically, as evaluated by Witbrock [16]. In the case

  • f lecture videos, analysis of content and training of language models can lead to better accuracy

as shown by Glass [17]. Haubold [18] shows that external indices can help significantly in filtering highly inaccurate ASR transcripts. Text is also used as a separate medium in video analysis for contextual segmentation and clustering purposes. Typically, statistical methods are applied to select interesting and useful words and phrases, while discarding topologically redundant ones as demonstrated by Yang [19]. Common approaches are the application of text frequency, inverse document frequency (TF- IDF), and LSI. Lin [20] and Ponceleon [21] show how words from lecture video transcripts are used as low-level features to detect topic changes, and thereby segment the video. Lin [20] and Yang [4] analyze word classes to rank and form better comparisons. Yang [4] also introduces external corpora, such as WordNet [22] and news web pages, to expand queries. Several researchers have investigated videos with respect to extraction of structure. They are addressing the need to build hierarchies around serial content, so that shots are no longer self- contained contextual entities, but can be linked and made relevant with other content. Conceptualization of content discussed by Natsev [23] and Kender [24] is one such approach in which video shots in the news domain are annotated with trained concepts descriptive of the

  • scene. The resulting concepts for shots can be used to track news episodes over time, or can be

searched with additional query expansion given dictionaries such as WordNet, demonstrated by Haubold [25] on TREC 2005 Video data. Hauptmann [26] takes a similar approach for various genres of video, including promotional and documentaries. Sources of annotation include Optical Character Recognition (OCR), ASR, face detection, and image comparisons for querying. Sundaram [27] investigates structure and hierarchy of scenes in film derived from domain- specific features, such as dialogue and cinematographic rules. Imagery, audio, and text are the most commonly identified media in video. Their analysis depends mostly on video-internal content that can be extracted with relative ease. Research by

slide-6
SLIDE 6

6 Waibel [13] suggests that environmental cues, such as phone rings or door knocks in a meeting environment, can be incorporated into structuring video content. However, event detection is beyond the scope of video content analysis as defined here. The work of segmentation, indexing, and structuring of video is eventually presented in user interfaces, whose primary goal it is to allow for parallel data exploration of an inherently serial

  • medium. Currently available media players still rely to the most part on a time slider and fast

forward/rewind functions. An efficient interface attempts to display audio, video, and textual information as compactly as possible, allowing a viewer to advance to any part of the video, while playing back as little as possible. Literature provides many examples of prototypes for visual and audio summarization interfaces and browsers. Lee [28] presents an overview of video browsing issues and a comparison of 15 prototypes from research. Some of the most important features mentioned are abstraction of information, and the Shneiderman [29] Visual Information- Seeking Matra: Overview first, zoom and filter, then details-on-demand. Li [30] evaluates features for video browsers for individual genres of lecture, presentation, entertainment, and

  • ther videos. Specifically for lecture videos, users preferred using tables of content to skimming

the videos. Most interfaces use keyframes as means of visual summaries, such as the keyframe-based UI for digital video by Girgensohn [31] (Figure 2), where a space-optimizing mural of differently Figure 2: Keyframe-based UI for Digital Video by Girgensohn [31]. Figure 3: Classroom 2000 by Abowd [32]. Figure 4: Video Archive Access by Worring [33]. Figure 5: Semantic Exploration

  • f Lectures by Altman [35].

Figure 6: SCAN: Retrieval from Speech Archives by Whittaker [36]. Figure 7: Video Mail Retrieval by Young [37].

slide-7
SLIDE 7

7 sized keyframes is used to present a video segment. Size of keyframes is proportional to importance as determined during segmentation and indexing. Earlier work on lecture videos used a popular web page framing approach, where video playback features are separated from electronic slides, and text indices. The Cornell Lecture Browser [2] and Classroom 2000 [32] (Figure 3) are a few such examples. Some more recent interfaces by Worring [33] and Tang [34] cluster keyframes based on feature vectors (Figure 4), allowing for zoom on dense clusters for detailed exploration. Altamn [35] proposes a semantic lecture browser (Figure 5), which places emphasis on pedagogical events, such as in-class discussion, theorem, equation, diagram, example, etc. The interface is based on a hyperbolic graph of events from a lecture, and allows navigation to related events. However, the interface remains a suggestion, as no related work on segmentation and video content understanding exists. While video summaries are based on their visual video source, browsing audio archives requires a multi-modal shift: it is possible to skim text, but not serial audio streams. ASR is used to create that shift into the visual mode of text. SCAN [36] for news audio (Figure 6) and the Video Mail Browser [37] (Figure 7) show user interface examples for skimming and querying of audio streams using ASR transcribed text as a medium. Video summarization and indexing approaches have been studied on a variety of genres, including lecture and presentation videos. Analysis and domain-specific methods of content extraction have lead to customized user interfaces that emphasize features in those domains. Structure in videos, however, has been sparsely addressed, in particular in a multi-modal space covering video, audio, and text. Tables of content and indices for books are based on structure of

  • content. We propose that a similar framework should be applied to video content.

2 Research Approaches

In this section, I will discuss our approach to analysis, structuring, and visualization of personal interaction videos. For a given genre of video, we analyze characteristic semantic components, such as speakers from audio, visual cues from video, text from ASR transcripts, and data from external sources, such as electronic slides and textbooks. Most of the semantic components are shared among the different genres of interaction video, and thus our tools for content structuring can be used on multiple genres. The various modalities of a video are first segmented and clustered where viable. Text transcripts are then generated using ASR software and are filtered with matching external corpora of text. The collection of filtered phrases is clustered as an independent medium similar to audio and video. It further serves to augment videos with searchable text. In a final step, structure is derived from patterns of recurrence between identified content units, whether audio, video, or text. A custom-designed user interface combines the original video and the identified contextual structure for browsing and comparing videos.

slide-8
SLIDE 8

8 2.1 Three genres We focus on three types of personal interaction videos: formal instructions or lectures, student team project presentations, and discussions. These genres lie as separate points in a high dimensional space: number of speakers, number of audiences, structure of content, external structuring materials, quality and placement of cameras and microphones, and degree of sophistication of capture and editing (Table 1). Each genre drives, in a nearly algorithmic way, the form of the GUI and the nature of the user studies validating them. Analysis of a genre leads to the selection of the tools used to segment and cluster video

  • content. Lecture videos are conducted by an instructor who dictates the structure of the material.

Content is presented at a slow pace using blackboard/whiteboard notes, projected handwriting, and/or electronic slides. In a separate process, a trained camera operator decides what physical setting to record, depending on whether the instructor is addressing the class, engages in a discussion, or is writing notes. Structure is determined from the recurrence pattern of physical scenes and content from presented notes. With the exception of questions from the class, only the instructor is speaking throughout a lecture video. The lack of dialogue makes speaker segmentation and clustering less applicable for determining structure. Because of the slow-paced content progression, textual cues can be used to determine long-term structure across several lectures (typically 15-35 per semester). Classroom presentations by students are in many ways an inverse to lectures. A team of novice students is presenting content to an expert instructor. The format of a presentation typically includes the use of electronic slides, which are displayed at a fast pace while several students discuss points of interest. Contrary to lectures, written content cannot be used as the primary cue for structure. Instead, speaker segments and recurrences describe the presentation’s interactions, whether between presenters, or between presenters and audience during Q&A. Discussion videos constitute the most unstructured type of interaction videos. Characteristically, they do not adhere to fixed scene settings or even predictable contents. Footage is shot in various indoor settings, as well as during outdoor site visits. We expect that cues can only be derived from global visual segmentation, and speaker segmentation. Type Speaker # Speakers Audience # Audience Structure External Material Recording Operator Camera/ Microphone Lecture expert 1 novices 1 explicit written content trained fix/fix Classroom Presentation novices n expert 1 implicit structured

  • utline

untrained/ none fix/ mobile Discussion novices n novices/ expert(s) n none none untrained/ nervous mobile/ mobile Table 1: Three types of personal interaction videos and their distinguishing characteristics.

slide-9
SLIDE 9

9 2.2 Common Tools The common core is shown in Figure 8, and is separated into 4 main parts: Audio, encompassing Speaker Segmentation, Speaker Clustering, and Speaker Topology; Video, containing Keyframe Extraction, Video Player, and Keyframe Player; Text, comprising Text Phrase Ranking, Clustering, and Querying; and Structure referring to Comparison of Video

  • Structure. Solid rectangles denote completed components, and dotted lines denote proposed

work. 2.2.1 Analysis and Structure Depending on the genre of video, segmentation and clustering of video and audio are performed using different cues. Visual data is analyzed with increasing precision from discussions to presentations to lectures. Audio segmentation by speaker is treated in an almost exact inverse relationship. Lectures generally do not require it; in presentations, more emphasis is placed on speakers, as fewer cues are available from imagery; in discussions without predictive scene settings, almost no visual cues can be used, and segmentation and clustering depends largely on speakers and possibly sound effects. Speaker segmentation is used parallel to visual segmentation to capture human interactions in video. It is insofar important to structure as it complements visual segmentation in some, and replaces it in other cases. We use known approaches of the Bayesian Information Criterion (BIC) applied to Mel Frequency Cepstral Coefficients (MFCC) to find speaker segments. After determining appropriate values for MFCC sampling size and frequency, this method proves very stable and accurate. However, popular approaches to clustering speaker data (Speaker Clustering) using the symmetric KL distance have shown to be inconsistent with our data. We have identified shortcomings when speaker segments used for comparison have different lengths and present results from an empirical

  • solution. A second problem related to inaccurate modeling of lower-order MFCC channels was

determined, and we hope to produce a solution in our on-going work. With more accurate speaker clustering, a hierarchical structure similar to that of visual lecture contents can be built. Analysis of the visual modality is based on determining shots of similar content. Depending

  • n the video genre, similarities can be found in written content, scene setting, or gestural events.

Keyframe Extraction is common to all segmentation approaches, its result serving as a visual index into the video contents. Figure 8: Framework of common tools and user interface components. These components are part of the Core, and are shared by all three genres of video analyzers. Working components are marked with solid lines, and proposed items are dotted.

slide-10
SLIDE 10

10 Written and spoken text is not a video-intrinsic media, and must therefore be derived using OCR and ASR. We focus our research on analyzing text from speech, as the most common element shared by interaction video. While lecture videos and most presentation videos contain written text, there is no guarantee that all interaction videos contain written material. ASR transcripts from videos exhibit high word error rates (up to 75%). While the error rate is high, most of errors are due to recall, and not precision. We take advantage of this observation and use external corpora to filter out irrelevant terms, and ensure that the remaining terms are in fact true parts of the video transcript. Text terms are mapped between transcripts end external corpora by finding the longest possible matches in number of words. Stop word removal is not performed before this mapping takes place. This approach gives us the advantage of finding common phrases in speech beyond pure noun phrases, which tend to be considered most descriptive. We have observed that the length of a phrase is proportional to its semantic significance. We therefore apply Text Phrase Ranking for all filtered phrases, and include phrase length as a parameter to compute weight of usefulness. Temporal repetition of phrases is considered separately, and is used to cluster phrases over a desired temporal period (Text Phrase Clustering). 2.2.2 Visualization Techniques Ubiquitous video browsers have adopted little from research efforts, and rely mostly on timelines with which a user can skip to any portion of a video. Some interfaces feature keyframes as indices, but offer no other means of browsing content or structure. Due to the limitations of existing browsers, new developments in content and structure browsing and comparison require custom GUIs. Our approach to building an interface is based on the different contents modalities, all of which we consider media tracks. Each media track is visualized using its own representative interface; video is summarized using keyframes, audio using speaker identifying features, and filtered transcriptions using ranked and clustered phrases. (An example of the media track methodology is discussed later in Figure 11). An abstracted Speaker Topology graph, common to all three video genres, shows occurrences, recurrences, and temporal dependencies of speakers. Underlying this visualization are Speaker Segmentation and Speaker Clustering. Visual information is presented in a standard Video Player and a Keyframe Player aimed at fast browsing by displaying keyframes in a highly compressed temporal order. Keyframes selected during the analysis step can be viewed as a slideshow in constant speed or temporally scaled to the corresponding video segment’s duration. Ranked Phrases text graphs summarize textual information in a two-dimensional view, where the horizontal dimension corresponds to time and the vertical dimension denotes confidence of usefulness. Phrases can be clustered temporally by means of interactive sliders, allowing for visual exploration of a phrase’s recurrence in a given

  • video. Text Querying is performed over all videos with a ubiquitous keyword search interface.

Structure of video is derived from the available visual and audio modalities and can be compared between different videos (Structure Comparison). (For example, Figure 9 shows an example of visual topology in a lecture video, which can be used in such a comparison.) One of the goals of comparing high-level structure between videos is to derive interesting differences between videos, for example teaching style differences in lecture videos.

slide-11
SLIDE 11

11 2.3 Genre-specific Tools Analysis and Visualization features characteristic to each genre are added as modules to the common core. Lecture videos mainly benefit from more elaborate analysis in the visual domain, including Scene Segmentation, Pixel Content Clustering, Environmental Clustering, a Video Content Index, and a Video Scene Topology. Filtering of text is performed using Textbook Indices, and Video Clustering is carried out over a semester’s series of lectures. Presentations contain more visual material that develops at a fast pace, and therefore lends itself to Scene Segmentation, but less to content analysis. Interesting visual information is extracted from Speaker Face Detection and a corresponding Speaker Index. Text filtering takes place using Electronic Slides and Topic Lists. Discussion Videos have the least visually comparable information other than what can be compared using Environmental Clustering of Segmented Scenes and presented in a Video Content Index. Since discussions focus mostly on the interaction between people, Speaker Face Detection and the corresponding Speaker Index are expected to result in the most meaningful indicators of structure. Text filters are derived from a Meeting Agenda. 2.3.1 Lectures Visual segmentation for lecture videos is based on changes in content (Scenes: Inactivity Segments). Since the camera tends to capture static scenes of hand and body motion of an instructor writing or gesturing, cues for scene changes are taken from significant changes in camera position. The proprietary software from Columbia University’s Video Network performing this operation for lecture videos introduces additional segments at some frequency of

  • inactivity. Our algorithm labels the resulting keyframes by their corresponding type in the

Environment Clustering step. Keyframes fall into one of several categories which are common in classrooms, e.g. Blackboard, Sheet of Paper (if the instructor prefers paper over the blackboard), Podium (when the instructor addresses the class), etc.). Analysis of written content in keyframes

  • f type Blackboard or Sheet of Paper and clustering based on their similarities establishes the

absolute segmentation of teaching units. Methods of comparison include filtering of desired

slide-12
SLIDE 12

12 content pixels in a given keyframe, and matching of their spatial patterns to other keyframes (Pixel Content Clustering). We find and demonstrate that high transcription error rates can be compensated with appropriate filtering. We apply an external corpus that is contextually related to the material discussed in a given course (Text Filter: Textbook Index). For the most part, this includes textbook indices, but we have also used typed lecture notes and presentation slides for courses in which no textbook is used. We use text phrases to cluster a series of lectures and highlight the topics, which are discussed across several videos (Video Clustering). We also use text phrases for a given lecture video to determine with high accuracy, which chapter in the course textbook best corresponds to the lecture. Using commercial ASR software, we extract text transcripts from lecture videos. No language or speech model adaptation is carried out for several reasons. Firstly, lecture videos are not recorded with special DSP microphone hardware, which is required to ensure optimal speech signal processing. Also, compressed videos generally lack the audio-visual quality necessary for this type of processing. 2.3.2 Presentations Presentation videos do not require a detailed analysis of content. Focus is shifted from detailed writing to a presenter’s staging of the contents. Important cues for structure are derived from a speaker’s identity, occurrence, recurrence, and duration. Speaker segmentation and clustering are supplemented by Speaker Face Detection, which aims to capture the current speaker’s visual identity. The results of this analysis are presented in a visual Speaker Index. Changes in visual aids and unexpected speaker motion are considered for visual segmentation. Local and global comparisons of histograms over neighboring windows are generally sufficient to detect visual changes and some gestural differences (Scenes: Difference in Content, Difference in Motion). Text transcripts are generated using ASR software without language or speech model

  • adaptation. It is infeasible to perform language and model adaptations for a large group of people

appearing in such videos, with an average of 20 students presenting on different material in a 75- minute period. We use presentation slide contents and expected presentation topic terms as

slide-13
SLIDE 13

13 external corpora for filtering inaccurate transcripts (Text Filter: E-Slides & Topic List). We denote terms from slides “content phrases”, and expected terms “topic phrases”. Topic phrases hint at the only available external structure in presentations, as imposed by the course instructor. They include topics that students are expected to discuss. In our visualization, text phrases are included as separate media tracks for summarizing video content. Length and recurrence of phrases is considered proportional to their relevance, with longer phrases being more expressive. In our text summary, relevance is depicted by position and color of text blobs. In a user study we verify that this approach significantly decreases time and increases accuracy of answers related to summarizing video content. 2.3.3 Discussions Discussion videos lack most visual structure, whether recorded in a fix or mobile setting. Prior work on meetings by Lee [39] suggests visual segmentation of people using a fixed omni- directional camera, background- and object-modeling. An approach for meeting video segmentation based on salient text transcript breaks as opposed to speaker change is presented by Gruenstein [40]. However, our discussion videos do not guarantee fixed camera poses, and close- to-perfect manual transcripts are not available for text segmentation. Similar to presentation videos, discussion videos generally focus on personal interaction rather than written material. Speaker Face Detection is applied to capture visual speaker identities, which are then presented in the video browser in a Speaker Index. Due to the lack of contextual visual cues, such as writing or visual aids, we anticipate that a visual segmentation based on difference in motion (Scenes: Difference in Motion) can be applied to detect coarse scene changes. This segmentation is complemented by Environment Clustering to detect global differences, such as indoor settings versus outdoor settings. Cluster groups are included in the visualization in the form of a Video Content Index, which represents each cluster by a representative keyframe. ASR transcripts are again used for text analysis and indexing. However, we expect audio quality to be significantly worse than for the semi-controlled environments of the other two genres, which will impact accuracy of transcribed text. Moreover, there are fewer context-rich

slide-14
SLIDE 14

14 external indices that can be used to filter the text. We intend to rely on a meeting agenda (Text Filter: Agenda) where available.

3 Research Progress

In this section, I will present details of my completed research. We focus on two out of the three genres of personal interaction video, namely lecture videos and classroom presentations. Approaches for determining and visualizing structure in lecture videos include analysis and segmentation of visual information from keyframes, presented at ICME 2003 [9], and textual information from automatic speech recognition transcripts, presented at MCBAR 2004 [18]. For classroom presentation videos, structure is taken from visual, speaker, and textual information. This work was presented at ACM MM 2005 [5]. We find and characterize previously unreported problems with a popular method for speaker clustering, results of which are in preparation for ACM MM 2006 [45]. 3.1 Structuring Lecture Videos using visual contents Our approach to segmentation of instructional videos builds on existing video recordings. The environment in which lectures are videotaped provides traditional and modern tools for teaching: blackboard, whiteboard, paper + pen, computer (electronic slides, web browser, telnet, etc.). We first focus our analysis on detection of the media used in a given keyframe of the video. We identify six types: board, class, computer, podium, illustration, and sheet. In a second step, keyframes belonging to two of the six media types are further grouped by their visual content into topic clusters. The resulting structured visual contents are presented in a user interface geared towards browsing lecture videos by teaching units (Figure 9, right), as opposed to linear browsing using longs lists of keyframe images (Figure 9, left). The interface allows the user to search and browse for specific parts of an instructional video, and it also highlights potentially important portions of a lecture. The Topological View automatically lays out interrupted topics in a visually non-interfering planar array; this graphically captures those semantically dense points in the lecture with interaction between different topics. The Key Frame View enables the user to retrieve full-sized key frames in a separate window by sliding the mouse over the thumbnailed

  • images. First (last) key frames for a given topic can be determined by finding the start (end) of

the topic using icons in Topological View; clicking on these icons then highlights all the related key frames in the thumbnail strip below, regardless of their temporal separation. (Figure 9) Our dataset includes 17 videos of 75-150 minute long lectures. Keyframes are selected from the video by proprietary software. With an average of one keyframe every 20-25 seconds, a lecture is represented by 200-350 keyframes. 3.1.1 Classification by Media Type The first step in the segmentation process is to assign each key frame to a media type. This classification uses a decision tree of static image feature filters, such as color information in certain spatial arrangements, color patterns, and features such as edge information. Media type classification is rather robust, as most media types were correctly detected between 97 and 100%

  • f the time. The only exception is detection of type illustration.
slide-15
SLIDE 15

15 3.1.2 Topological Segmentation The proprietary software that selects key frames does so in a way that is mostly sensitive to instructor motion, concentrating its captures during periods of relative visual calm. Consequently, the key frames are highly redundant. Therefore, we cluster similar key frames into topics based on visual content whenever the more recent frame elaborates on the visual information found in the more distant one. This clustering reduces a set of key frames to a set of clusters (topics) of similar key frames that is 5 to 20% of that size. Steps of filtering and clustering are performed on keyframes in order to extract written contents and to find other closely related keyframes. Figure 9: (Left) Existing interface for viewing a lecture video, and browsing by keyframes. Keyframes are listed by their temporal appearance, but the interface lacks any means for identifying those frames for faster browsing. (Right) Top level of new user interface: topological index with key frame summary. (A) above: Each key frame media type is assigned a distinguishable color as well as a descriptive icon. (B) below: Vertical key frame summaries are aligned with media type icons; horizontal topological groupings capture topic commonalities. Icons and key frames are clickable; they select topics, magnify the thumbnails, and pop up the video at the appropriate frame We have collected data over 17 extended videos measuring 40 total hours that suggests several properties of the underlying processes. Media type classification is robust, as most media types were correctly detected between 97 and 100% of the time. The only exception is detection

  • f type illustration. Topological Segmentation performed equally well at a success rate of more

than 96%. 3.2 Structuring Lecture Videos using textual contents Textual lecture transcripts are a rich source of information, and lend themselves particularly well to comparisons with other lectures. Text searching and clustering is substantially more defined than their counterparts in the purely visual medium. With the availability of Automatic Speech Recognition (ASR) engines, audio tracks from lecture videos can be quickly transformed into useful text, albeit significant error rates.

slide-16
SLIDE 16

16 We investigate the possibility of extending a lecture browser’s ability to include cross-lecture indexing and referencing, in particular within a full university course with 10 to 30 lectures. We present the methods used in capturing transcripts and discuss the common difficulties encountered in the process. We then provide details of the analysis stage and tie in the results with several experimental interactive visualization schemes. 3.2.1 Data Acquisition For our purposes, we are using course videos from the Columbia Video Network and the commercial Automatic Speech Recognizer IBM ViaVoice to extract transcripts. So far, we have analyzed 7 courses related to Computer Science with 183 lectures (230 hours of video); 4 of these courses have been analyzed with different instructors’ voice trainings for an additional 90

  • transcripts. Most transcripts contain between 5,000 and 14,000 words with minimal punctuation
  • marks. When applying IBM ViaVoice to the extracted audio track, the Word Error Rate is at

approximately 75%. 3.2.2 Analysis For the purpose of indexing, summarization, and cross-referencing, meaningful text needs to be extracted from the transcripts. Ideally, such contents would include “theme” and “topic phrases” that describe the topics covered in a given lecture. The term “theme phrase” is loosely defined as a phrase shared among several transcripts, i.e. a phrase that appears in at least ¼ of all

  • transcripts. A “topic phrase” denotes the opposite, i.e. a phrase shared in less than ¼ of all
  • transcripts. The reduction of the index to smaller phrases is also performed with respect to stop

words in front and after content words. Lastly, a Porter stemmer [44, p. 534] is applied to all

  • words. As an alternative to finding index phrases in transcripts, we have explored using word

pairs. 3.2.3 Results We have investigated several interactive visualization techniques that present the results from text analysis to the student in a meaningful fashion, and have devised three visualizations. Common to all are three parameters that are roughly analogous to a camera’s settings. A “zoom” feature, derived from the occurrence of a phrase across transcripts, allows for setting the specificity of the displayed phrases, ranging from topic-specific to entirely thematic. The “focus” setting denotes the frequency with which a phrase occurs, which is derived from the occurrence

  • f a phrase within a given transcript. The third common setting, “contrast”, controls the length of

the phrases considered for display. Transcript Index Map The Transcript Index Map (Figure 10) is a graph in which index phrases are mapped to the transcripts they appear in. Its primary purpose is to provide the equivalent of a textbook index to each transcript, except that the index terms are not ordered alphabetically, but rather in order of

  • ccurrence. Transcripts appear temporally increasing along the horizontal direction, and index
slide-17
SLIDE 17

17 phrases drop vertically below each transcript in decreasing order of occurrence. To further distinguish the frequency with which an index phrase occurs, each item is colored in a spectrum from red to yellow denoting high to low occurrences, respectively. Textbook Chapter to Transcript Match In this second visualization (not illustrated due to lack of space) we attempt to match a given transcript to a textbook chapter based on the set of identified index phrases. While not every lecture must have a corresponding chapter in the textbook, and while some lectures cover more than one chapter, this interface highlights those chapters that have a relatively high probability of corresponding to the given lectures. The tabular interface is divided into individual chapters from the textbook in columns, and lecture transcripts in rows. Each cell represents a numeric value that ranks the relative score for each chapter-transcript pairing. Lecture Transcript Similarity For the third visualization (not illustrated) of lecture contents for a full course, we have created a graph that visually clusters similar lectures based on a set of selected phrases. The purpose of this tool is to allow a student to explore a course by dynamically grouping lectures that have similar contents based only on a small set of index phrases. Multidimensional Scaling (MDS) is used to collapse the higher dimensional space of N lecture transcripts down to 2 dimensions. While we have not performed explicit user studies on the Transcript Index Map (Figure 10), the visualization techniques have been incorporated in work on presentation videos. In user studies

  • ver 2 semesters we have observed significant improvements in accuracy and speed of

summarization of video content based on this visualization technique. Figure 10: Transcript Index Map for the course “Analysis of Algorithms”: Zoom is set to 13, i.e. half the number of transcripts for this course. Displayed are topic and theme phrases, with theme phrases appearing in larger blobs. Phrases are color-coded using a red to yellow gradient denoting higher to lower occurrences.

slide-18
SLIDE 18

18 3.3 Segmentation and Augmentation for Classroom Presentation Videos Classroom presentation videos differ from lecture videos in most characteristics. Their primary purpose is to record student teams presenting to the class, and to give instructors an audio-visual medium for feedback on presentation performance. While lectures have structured content, presentations only have a structured outline. Whereas lectures are accompanied by external and in-class textual materials, student presentations are limited to audio transcripts and sometimes condensed content from electronic slides. Lecture and presentation videos also differ in their quality of the recordings. While the former are taped in a semi-professional environment with steady cameras and trained camera operators, presentation videos are captured in classrooms with less ideal recording conditions, and untrained camera operators. Lighting and audibility varies significantly between and during presentation videos, as does the focus of the camera on current presenter. Our analysis takes advantage of several important characteristics of such video. External structure of presentation sections and electronic slides, if available, are used as indices to filter ASR transcripts. Frequent speaker changes captured from audio, and scene changes from video are used to build structure in the otherwise serial multimedia stream. 3.3.1 Audio Segmentation In general, audio segmentation for presentation videos lends itself to segmentation by

  • speaker. We employ the method of detecting speaker changes via the Bayesian Information

Criterion introduced by Chen [12]. The audio track is sampled at regular intervals and vectors of 13 Mel Frequency Cepstral Coefficients are determined for each set of audio samples. Using a two-window approach, the BIC is computed for each partition of this interval. 3.3.2 Visual Segmentation In this stage of segmentation, visual contents from a video are analyzed for shot boundaries. We apply methods of computing histogram changes between consecutive frames and detecting long-term changes by comparing the degree of change over time. We have found this method to be robust in detecting changes in presentation slides. More interestingly, this method also detects speaker changes by differentiating the characteristic movement patterns between two speakers. Common errors during visual segmentation include accidental movement of the camera, increased movement of students in front of the camera, especially during setup of presentations, and poor lighting (too much or too little) during the presentation. 3.3.3 Combined Audio-Visual Segments The nature of presentation video leads to the definition of a “presentation unit” for this genre,

  • ne in which boundaries are not identified solely by visual or audio scene changes, but by a

combination of the two. In some instances, using only one of the two methods would result in an unfavorable series of presentation units: 1. Speaker segmentation by itself produces no or too few segments for a long presentation lead by only one student; and 2. Visual segmentation on its

  • wn may miss dialogue changes, especially when two speakers use the same slide. The

integration of both segmentations takes into account significant audio as well as visual changes,

slide-19
SLIDE 19

19 including speaker, gesture, and visual aid changes. We have included a graphical representation that combines raw audio and video segmentations in the user interface. This timeline marks all relevant breaks in the video and labels them with timestamps (Figure 11). 3.3.4 Text Augmentation Parallel to visually summarizing video clips with thumbnails, we use text to summarize audio

  • clips. However, transcripts are not readily available for the presentations, and we cannot make

the assumption that every presentation is accompanied by electronic slides. We thus generate transcripts using the IBM ViaVoice ASR. The resulting transcripts are highly imperfect with large Word Error Rates (≈75%) due to several factors. We have manually generated a list of 30 frequently used words and phrases from the presentation slide titles, and we use them to filter the transcript. The resulting “theme phrases” are included in the user interface and provide the equivalent of a table of contents for each presentation (Figure 11, row 5). While the task of extracting title words can be automated, we have chosen to manually do so for this prototype. The list of theme phrases is considered static with respect to the course in which the tool is used. For domains outside of the videos we have used, this list of theme phrases could be compiled by cross-referencing frequently used headers

  • r listing all titles from presentation slides. We address a dynamic and scalable search method in

the topical text interface. Besides identifying theme phrases, we also apply text filtering of all of the phrases found in the source data of the electronic presentation slides, if available. To this end, each line of text in the slides is used as a phrase. The resulting “topic phrases” are included as an additional index in Figure 11: Complete timeline includes thumbnails for sufficiently long video segments (row 1), a timeline with time markers combining video and audio segmentations (row 2), visual video segmentation with activity graph (row 3: red), visual audio segmentation with activity graph (row 4: green), index phrases (row 5: yellow), text phrases (row 6: yellow).

slide-20
SLIDE 20

20 the user interface and give clues about specific items discussed in the presentation, including names, locations, numbers, etc. 3.3.5 Interface The interactive user interface is modeled as a linear time line (Figure 11), horizontally spanning the screen. This provides an overview of the presentation video’s structure and contents. For further exploration of the video, a zoom feature has been implemented that can be used to stretch the graph from 100 video frames/pixel to 1 frame/pixel (Figure 12). The interface has been modeled on informal observations of instructors and students accessing videos with more standard tools. The subjects in our classroom tend to have some familiarity with video editing, leading to the design of a row-media layout. We intend this view to be especially helpful for the viewing of video clips, while the text augmentation rows serve as search indices. 3.3.6 User Study We have evaluated our methods of visual summarization and indexing in the preliminary interface with 176 students of varying knowledge about the contents of the videos. Overall, we find strong evidence that our tool and methods of summarization lead to more efficient searching, while at the same time retaining the accuracy of results of traditional linear search

  • methods. While accuracy is at a similar level, the time required to answer a question was reduced

by 20% on average when participants were unable to use the video (Figure 13). The main difference between the two groups lies in the usage of video versus keyframes. We have Figure 12: Text and thumbnail distribution effect upon zooming. (a) The video is zoomed at 34 frames / pixel. The thumbnail row displays overlapping images and the text row displays all corresponding phrases in these 9 minutes, which creates a very busy and difficult to read visual. (b) The video is zoomed at 6 frames / pixel with obvious improvements in the thumbnail and text rows for these 90 seconds. While the words do not form sentences, they can be used to understand the material discussed here.

slide-21
SLIDE 21

21

  • bserved that when video is available, students tend to spend more time watching video than

necessary to complete a task of searching. Our approaches of combining audio and video segmentations show that the two media are neither inclusive nor exclusive, but complementary. 3.4 Research on Accommodating Sample Size Effects in symmetric KL in Speaker Clustering Speaker segmentation and clustering are core requirements for finding structure in video. A popular approach to clustering uses the symmetric KL measure to compute similarities between MFCC audio features. We determined through simulation and empirical evidence that the symmetric KL distance between short audio segments (< 5 seconds) exhibits large degradation in

  • performance. We have also observed that a similar effect exists for comparisons between

differently-sized feature segments. Accuracy decreases the larger the difference in length of two audio segments. We demonstrate an empirical correction of this sample-size effect that increases clustering accuracy. 3.4.1 Empirical Solution The symmetric KL distance measure is derived under two critical simplifying assumptions [43]: first, that the MFCC vectors are distributed as a d-dimensional Gaussian, and, second, that the sample means and covariances are perfect estimators of the population means and

  • covariances. The first assumption is necessary to allow a closed form evaluation of a d-

dimensional integral. The second assumption eliminates the need to model the effects of the two samples’ lengths on the standard errors of their statistics: the samples are assumed to be infinitely long with zero standard error. In practice, the first assumption is often violated; in particular, the speech signal is usually the sum of the speaker’s signal plus background noise. A Gaussian mixture model would be more appropriate, but deriving a closed form symmetric KL for it presents formidable analytic difficulties. Similarly, in practice, the estimated statistics show increasing error as sample length diminishes. However, the analytic modeling of the impact of standard errors also is challenging, even under the assumption of a single d- dimensional Gaussian. The symmetric KL measure, then, has likely been used without an

Time spent on questions

50 100 150 200 250 300 350 400 450 500

A C D B C A B C A B A B C A B C D Find video X (easy, 1st video, beginning) Find video X (difficult: 5th video, middle) Find your team's presentation Find yourself during the presentation Find team's discussion

  • n topic X

Summarize using keywords

Time (sec)

with Video without Video

Figure 13: Statistics collected from 176 user study participants in 4 groups (A, B, C, D), summarizing the overall improvements of the tool. We compare students having used the tool with and without access to the video stream. The speed at which questions were answered improved in almost all cases when using only the video summaries.

slide-22
SLIDE 22

22 understanding of these substantial limitations. Unfortunately, length effects do impact the clustering of shorter speech segments which are nevertheless of considerable practical interest. In lieu of an analytic closed form computation parameterized by the incoming lengths, we investigated one possible empirical solution, which adjusts the symmetric KL distance based on the response of the simulated model to segments taken from known identical distributions:

) , ( 2 ) , ( 2 ' 2 B A sim KL B A KL KL =

where KL2’ is the adjusted symmetric KL distance for speech segment pair (A,B) with length (|A|,|B|). Results for this solution are presented in. With the exception of a few outliers, KL2’ for short feature sets is more comparable to that of long ones for data from one speaker. We have tested this solution for clustering on data sets containing 5 and 20 speakers with significantly different-sized speech (5-252 seconds). Results for this empirical solution are

  • favorable. Dendrograms show that clustering improves, usually by uniting small spurious
  • utlying clusters with the true cluster to which they belong.

3.5 Summary of research progress We have completed analysis of lecture and student presentation videos, their visual, segmentation and clustering, and text transcript mapping approaches. Work on speaker segmentation has also been completed, while problems in speaker clustering are still being

  • addressed. Two preliminary user interfaces, one for structure representation and one for content

browsing have been developed and evaluated in user studies. Their completion and integration depend on the development of content- and structure-based visualization methods, as well as user study evaluations. We intend to begin work on discussion videos in the near future, building on previously designed segmentation and clustering approaches. Work on analysis of video content structure and comparison tools will commence during the final stages of this research.

4 Proposed Work

The goal of this research topic is to construct multimedia indices and table of contents for personal interaction video for the purpose of structured browsing. Several methods for segmentation and approaches for visualization and interaction have been explored. However, a complete analysis, implementation, and evaluation require further considerations. Once completed, the individual approaches for determining structure in interaction videos can be selectively applied to other genres of video for similar content analysis and dissemination. The framework of tools for analysis and visualization is presented in Figure 8 and its variants. 4.1 High-level Structure Detection We have demonstrated structure detection for lecture videos in terms of teaching units and their temporal discontinuities. In this context, we have used written and diagrammatic content to cluster keyframes, and have successfully determined distinct topics in the video. This analysis of structure resulted in the identification of topical centers in a lecture, where several topics were discussed in the same time period. We are working on extracting similar structure from student presentation video, and will be applying the same method on discussion videos. We expect that discussion videos exhibit a form of structure in terms of dialogue between persons, and topical

slide-23
SLIDE 23

23 centers when a particular topic is discussed vibrantly between several partakers. Student presentation videos do not contain a strong occurrence of dialogue or speaker recurrence in most cases during the presentation, but we expect that question/answer sessions between audience and presenters exhibit other structural tendencies. An example of hierarchical structure of video contents is presented in Table 2. Parallels are drawn to the structure in a book. 4.2 Video Structure Comparison Once a structure from audio, video, and text has been found for a set of videos, differences and similarities between videos can be determined. Structural comparisons can be used to find commonalities between styles of interactions in a video, as well as outliers that may be of particular contextual interest. We have performed informal comparisons between lecture videos from two instructors given the visual scene topology, and found interesting differences in style of teaching. While one instructor preferred to cover class material in a linear fashion, the second instructor engaged in an exchange of 2-4 subtopics at one time. This difference was made apparent by repeated interleaving clusters of keyframes belonging to those topics. Interestingly, this aggregation of teaching content occurred in the middle of almost every class in a series of lectures by the same instructor. We expect that structure comparisons of this kind can also be used for clustering of videos, and for automatic selection of recommended visualization tools for a given video. 4.3 Speaker Table of Contents Cues for structure and indices to contents in videos can be derived from speaker information, and is particularly applicable to student presentations and discussions. Structural cues include dialogues and order/progression of speakers, while speaker indices and tables of content are ideally produced from the speakers' names. We have already completed implementation of speaker segmentation, and are working on improving speaker clustering based on a number of problems we have identified with conventional methods. We propose to build a visual interface for speaker indexing that uses face detection in videos to extract the relevant speaker from a Book Chapter Section Paragraph Lecture Video All lectures for a semester One lecture Teaching units: topics of similar written content One keyframe Student Presentation Video All presentations for a class One presentation Material discussed by one speaker; Q&A session Electronic Slide (if available) Discussion Video All discussions for a team One discussion Dialogue between N partakers on a topic Speaker Table 2: Potential parallels between books and interaction videos: A book is structured into chapters and sections. Some books have sub-sections, others may impose a hierarchy of groups

  • f chapters. Video contents can be cast into a book-like hierarchy. Shown here is one possibility

for such structure.

slide-24
SLIDE 24

24 given video clip. Using such visual indices, contents in videos can be browsed by speaker, and synchronized transcript text can be identified on a per-speaker basis. We do not propose using face recognition to build speaker name indices, because this approach requires a database of

  • faces. Especially for a 160-students per semester class, this approach poses difficulties in timely
  • preparations. We intend to build on prior work on face detection in video, such as an MPEG-

specific approach [41], or a method developed specifically for talking heads in video [42]. 4.4 Analysis of Discussion Video and Application of Common Approaches We have determined some methods for segmentation, structure determination, and visualization for lecture videos and partly for student presentation videos. Discussion videos share several features from those two genres, including segmentation by speaker, and detection and clustering of different types of media, here being environment. Since a discussion video can be recorded in different locations, e.g. inside versus outside, detection of similarities between environments is an important part of the structure of the video. We intend to apply previously identified methods of segmentation and structure visualization to this third genre of video as a verification and evaluation of those methods. We expect that an abstract topological graph can be built for this, as well as other genres of video. 4.5 Text Indexing Text transcripts are not difficult to create for videos. Literature shows, however, that without adequate language or speaker model adaptation, transcript accuracy decreases dramatically. In most cases it is not feasible to generate custom adaptations, because the time and cost of such customization would exceed the utility of the tool that produces video indices. We have successfully shown that simple external filters can be used to separate correctly versus incorrectly identified words and phrases. In our text graphs for presentation videos, filtered words and phrases are presented to the viewer for gaining an overall sense for the content of a

  • video. We need to perform further analysis that would make isolated words and phrases more

relevant in a context of isolated and seemingly random text. 4.6 User Interfaces and Tools We are building an integrated segmentation tool and are designing and refining user interfaces to facilitate multi-modal structural browsing. The system implementation requires ease

  • f transferring recorded material from cameras to streaming video and a database of segmented
  • data. The user interface must integrate seamlessly into the database and streaming video for

reasonably fast browsing. We have established the framework in which videos are segmented in an off-line process and data is transferred to a database. The video browser uses that database to present the information. However, full streaming video capabilities as well as unsupervised (non- user-study) use of the video browser have yet to be considered. With progress on other research issues, we will update and improve our tools for segmentation and dissemination.

slide-25
SLIDE 25

25 4.7 User Studies For the past two semesters, we have been collecting results from extensive user studies in the classroom environment, involving more than 160 students each. The user studies follow the incremental research and changes made to the segmentation and browsing tools. We will continue to collect this data to evaluate the changes and potential improvements offered by our

  • approaches. As the tools mature and are integrated into the classroom environment, we expect
  • ur user studies to advance beyond supervised experiments. At such point, we will be collecting

student-specific statistics. A similar approach has been used for evaluation purposes in eClass (previously Classroom 2000) [38]. IRB approval for future studies is expected in the near future. 4.8 Feedback Annotations for Videos (Optional) Student presentations and discussions are recorded primarily for archiving. The methods and tools discussed in this work address the needs of two audiences involved in the presentations: instructors and students. Instructors use the presentations to evaluate and grade the performance

  • f teams and individual students. Students would like to review the footage to evaluate their own
  • performance. The review and feedback process between instructors and students can be linked by

introducing annotation tools used by instructors to highlight certain areas in the video. Annotations are made in the form of searchable text, which are added as a separate media track to the existing ones of audio, visual, textual, and structure. Students can use the feedback on specific “good” or “bad” examples of presentation items to improve their own style. We intend to implement and evaluate this method of feedback as an alternative to text summaries and traditional grading techniques.

5 Conclusion

Content- and structure-based analysis of video genres is necessary steps to video

  • summarization. Without sufficient understanding of visual, audio, and other video components,

summarization and indexing cannot take advantage of structure that is implicitly understood by

  • humans. Research in this field is particularly applicable to content-rich videos, such as lecture,

presentation, and discussion videos. Videos of these genres tend to be uncut and characteristically long, rendering them difficult to browse with conventional tools. We hope that

  • ur investigations in this field will lead to insights and useful tools for video browsing and

structure comparison.

slide-26
SLIDE 26

26 5.1 Schedule Task Subtasks Timeframe Audio Structure

  • Speaker clustering, modified GMM
  • Speaker segmentation at higher temporal

precision

  • Visual Speaker Indices (Face Detection)

Spring 2006 Text Indexing

  • Segmentation by key words/phrases
  • ASR Transcript / Audio synchronization
  • Text querying / search

Summer 2006 User Interface

  • Streaming video
  • Keyframe player

Summer 2006 Annotations

  • Text annotation of videos for feedback,

comments Summer 2006 (Optional) Discussion Videos

  • Visual segmentation
  • Audio segmentation
  • Transcript analysis
  • (Application of approaches from two other

genres) Fall 2006 (suspended from course for academic year 2005-06) Lecture Videos

  • Merge visual segmentation with transcript

clustering

  • Implement into video browser

Spring 2007 User studies

  • Unsupervised usage logging
  • Supervised experiments (IRB)

Spring 2006 – Fall 2007 Structure

  • Tool and methods for video comparison using

structural cues

  • Highlight structural differences between videos of

same genre

  • Cluster using structure (video similarity)

Fall 2007

slide-27
SLIDE 27

27

References

[1] F.I. Bashir, A.A. Khokhar, “Content based video indexing and retrieval: Current techniques and future directions”, International Workshop on Frontiers of Information Technology, December 2003. [2] S. Mukhopadhyay, B. Smith, “Passive capture and structuring of lectures”, Proceedings of the 7th ACM International Conference on Multimedia, Orlando, FL, 1999, pp. 477-487. [3] M.A. Smith, T. Kanade, “Video skimming and characterization through the combination of image and language understanding”, Proceedings of the 1998 IEEE International Workshop on Content-Based Access of Image and Video Databases, Bombay, India, 1998, pp. 61-70. [4] H. Yang, L. Chaisorn, Y. Zhao, S. Neo, T. Chua, “VideoQA: question answering on news video”, Proceedings of the 11th ACM International Multimedia Conference and Exhibition (MM 03), Berkeley, California, 2003, pp. 632-641. [5] A. Haubold, J.R. Kender, “Augmented Segmentation and Visualization for Presentation Videos”, ACM Multimedia Conference (MM 2005), November 2005, pp. 51-60. [6] F. Souvannavong, B. Merialdo, B. Huet, “Latent semantic indexing for video content modeling and analysis”, Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval, New York, NY, 2004, pp. 243-250. [7] T.K. Landauer, P.W. Foltz, D. Laham, “An introduction to latent semantic analysis”, Discourse Processes Journal, Volume 25, 1998, pp. 259-284. [8] C. Dorai, V. Oria, V. Neelavalli, “Structuralizing educational videos based on presentation content”, Proceedings of the 10th International Conference on Image Processing, Barcelona, Spain, 2003. [9] A. Haubold, J.R. Kender, “Analysis and Interface for Instructional Video”, IEEE Conference

  • n Multimedia and Expo (ICME 2003), July 2003, pp. 705-708.

[10] A. Fujii, K. Itou, T. Akiba, T. Ishikawa, “A cross-media retrieval system for lecture videos”, Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003), Geneva, Switzerland, 2003, pp. 1149-1152. [11] Z. Liu, Y. Wang, T. Chen, “Audio feature extraction and analysis for scene segmentation and classification”, Journal of VLSI Signal Processing Systems archive, Volume 20, Issue 1-2, October 1998, pp. 61-79.

slide-28
SLIDE 28

28 [12] S.S. Chen, P.S. Gopalakrishnan, “Speaker, environment and channel detection and clustering via the Bayesian information criterion”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, VA, 1998, pp. 127-132. [13] A. Waibel, T. Schultz, M. Bett, M. Denecke, R. Malkin, I. Rogina, R. Stiefelhagen, J.Yang, “SMaRT: the smart meeting room task at ISL”, Proceedings of the 2003 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, China, 2003,

  • pp. IV 752-IV 755.

[14] S.J. Young, M.G. Brown, J.T. Foote, G.J.F. Jones, K.S. Jones, “Acoustic indexing for multimedia retrieval and browsing”, Proceedings of the 22nd International Conference on Acoustics, Speech and Signal Processing (ICASSP 97), Munich, Germany, 1997, pp. 1:199-202. [15] S. Whittaker, J. Hirschberg, J. Choi, D. Hindle, F. Pereira, A. Singhal, “SCAN: Designing and evaluating user interfaces to support retrieval from speech archives”, Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR '99), Berkeley, CA, 1999, pp. 26-33. [16] M. Witbrock, A.G. Hauptmann, “Speech recognition and information retrieval: experiments in retrieving spoken documents”, Proceedings of the 1997 DARPA Speech Recognition Workshop, Chantilly, VA, February 2-5, 1997, pp. 160-164. [17] J. Glass, T.J. Hazen, L. Hetherington, C. Wang. Analysis and processing of lecture audio data: preliminary investigations. Proceedings of the HLT-NAACL 2004 Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, Boston, MA, 2004, pp. 9-12. [18] A. Haubold, J.R. Kender, “Analysis and Visualization from Audio Transcripts of Instructional Videos”, IEEE International Workshop on Multimedia Content-based Analysis and Retrieval (MCBAR 2004), December 2004, pp. 570-573. [19] Y. Yang, J. Wilbur, “Using corpus statistics to remove redundant words in text categorization”, Journal of the American Society for Information Science, Volume 47, Issue 5, May, 1996, pp. 357-369. [20] M. Lin, J.F. Nunamaker, M. Chau, H. Chen, “Segmentation of lecture videos based on text: a method combining multiple linguistic features”, Proceedings of the 37th Hawaii International Conference on System Sciences, Big Island, Hawaii, 2004, pp. 3-11. [21] D. Ponceleon, S. Srinivasan, “Automatic discovery of salient segments in imperfect speech transcripts”, Proceedings of 10th International Conference on Information Knowledge Management (CIKM '01), Atlanta, GA, 2001, pp. 490-497. [22] C. Fellbaum, et.al., “WordNet – An Electronic Lexical Database”, MIT Press, May 1998.

slide-29
SLIDE 29

29 [23] A. Natsev, M.R. Naphade, J.R. Smith, “Semantic representation: search and mining of multimedia”, Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD ’04), Seattle, WA, 2004, pp. 641-646. [24] J.R. Kender, M.R. Naphade, “Visual Concepts for News Story Tracking: Analyzing and Exploiting the NIST TRECVID Video Annotation Experiment”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ‘05), San Diego, CA, 2005, pp. 1174-1181. [25] A. Haubold, A. Natsev, M.R. Naphade, “Semantic Multimedia Retrieval Using Lexical Query Expansion and Model-based Reranking”, IEEE Conference on Multimedia and Expo (ICME 2006), July 2006. [26] A. G. Hauptmann, R. Jin, T. D. Ng, “Video retrieval using speech and image information”, Proceedings of 15th Electronic Imaging Conference, Santa Clara, CA, 2003. [27] H. Sundaram, S.-F. Chang, “Determining computable scenes in films and their structures using audio-visual memory models”, Proceedings of the 8th ACM international conference on Multimedia, Los Angeles, CA, 2000, pp. 95-104. [28] H. Lee, A.F. Smeaton, J. Furner, “User interface issues for browsing digital video”, Proceedings of the 21st Colloquium on Information Retrieval (IRSG '99), Glasgow, UK, 1999. [29] B. Shneiderman, “The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations”, Proceedings of the IEEE Visual Languages, College Park, MD, 1996, pp. 336- 343. [30] F.C. Li, A. Gupta, E. Sanocki, L. He, Y. Rui, “Browsing digital video”, Proceedings of the Conference on Human Factors in Computing Systems (CHI '00), The Hague, Netherlands, 2000,

  • pp. 169-176.

[31] A. Girgensohn, J. Boreczky, L. Wilcox, “Keyframe-based user interfaces for digital video”, IEEE Computer, Volume 34, Number 9, September 2001, pp. 61-67. [32] G.D. Abowd, C.G. Atkeson, A. Feinstein, C. Hmelo, R. Kooper, S. Long, N. Sawhney, M. Tani, “Teaching and learning as multimedia authoring: The classroom 2000 project”, Proceedings of the 4th ACM International Multimedia Conference and Exhibition (MULTIMEDIA 96), Boston, Massachusetts, 1996, pp. 187-198. [33] M. Worring, G.P. Nguyen, L. Hollink, J.C. van Gemert, D.C. Koelma, “Accessing video archives using interactive search”, Proceedings of the 2004 International Conference on Multimedia and Expo, Taipei, Taiwan, 2004. [34] L. Tang, J.R. Kender, “Designing an Intelligent User Interface for Instructional Video Indexing and Browsing”, Proceedings of the 2006 International Conference on Intelligent User Interfaces, Sydney, Australia, 2006, pp. 318-320.

slide-30
SLIDE 30

30 [35] E. Altman, Y. Chen, W.C. Low, “Semantic exploration of lecture videos”, Proceedings of the 10th ACM International Conference on Multimedia, Juan-les-Pins, France, 2002, pp. 416- 417. [36] S. Whittaker, J. Hirschberg, J. Choi, D. Hindle, F. Pereira, A. Singhal, “SCAN: Designing and evaluating user interfaces to support retrieval from speech archives”, Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR '99), Berkeley, CA, 1999, pp. 26-33. [37] S.J. Young, M.G. Brown, J.T. Foote, G.J.F. Jones, K.S. Jones, “Acoustic indexing for multimedia retrieval and browsing”, Proceedings of the 22nd International Conference on Acoustics, Speech and Signal Processing (ICASSP 97), Munich, Germany, 1997, pp. 1:199-202. [38] J.A. Brotherton, G.D. Abowd, “Lessons learned from eClass: assessing automated capture and access in the classroom”, ACM Transactions on Computer-Human Interaction, Vol. 11, No. 2, June 2004, pp. 121-155. [39] D.-S. Lee, B. Erol, J.J. Hull, “Segmenting People in Meeting Videos Using Mixture Background and Object Models”, Proceedings of the 3rd IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, Hsinchu, Taiwan, 2002, pp. 791- 798. [40] A. Gruenstein, J. Niekrasz, M. Purver, “Meeting Structure Annotation: Data and Tools”, Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal, 2005,

  • pp. 117-127.

[41] H.L. Wang, S.F. Chang, “A Highly Efficient System for Automatic Face Region Detection in MPEG Video”, IEEE Transactions on Circuits and System for Video Technology, Vol. 7, No. 4, pp. 13, August 1997. [42] R. Cutler, L.S. Davis, “Look Who's Talking: Speaker Detection using Video and Audio Correlation”, IEEE International Conference on Multimedia and Expo, New York, NY, 2000, pp. 1589-1592. [43] J.P. Campbell, JR., “Speaker Recognition: A Tutorial,” Proceedings of the IEEE, v1.85, No. 9, IEEE Press, pp. 1437-1462, 1997. [44] C.D. Manning, H. Schűtze, “Foundations of Statistical Natural Language Processing,” MIT Press, Cambridge, MA. [45] A. Haubold, J. Landers, J.R. Kender, “Segmentation, Identification, and Indexing of Presentations and Speakers in Unedited Videos,” ACM Multimedia Conference (MM 2006), in preparation.