Hierarchical Segmentation of Presentation Videos through Visual and - - PDF document

hierarchical segmentation of presentation videos through
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Segmentation of Presentation Videos through Visual and - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/224680108 Hierarchical Segmentation of Presentation Videos through Visual and Text Analysis Conference Paper September 2006 DOI:


slide-1
SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/224680108

Hierarchical Segmentation of Presentation Videos through Visual and Text Analysis

Conference Paper · September 2006

DOI: 10.1109/ISSPIT.2006.270818 · Source: IEEE Xplore

CITATIONS

4

READS

42

Some of the authors of this publication are also working on these related projects: NoSQL, new approaches for big data analytics, image understanding. View project All content following this page was uploaded by Aijuan Dong on 08 January 2015.

The user has requested enhancement of the downloaded file.

slide-2
SLIDE 2

Hierarchical Segmentation of Presentation Videos through Visual and Text Analysis

Honglin Li and Aijuan Dong

Department of Computer Science, North Dakota State University Fargo, ND 58105 {honglin.li, aijuan.dong}@ndsu.edu

Abstract - Presentation videos play an important role in information sharing and exchange. To effectively utilize these video assets, one

  • f the important steps is to segment a long video stream into

smaller, semantic units. In this paper, we investigate hierarchical segmentation of presentation videos by combining visual and text

  • analysis. Slide-level segmentation employs visual information and

computes a sequence of slide-level video segments so that the projected slide image of each such segment does not change. Topic- level segmentation makes use of extracted slide text and generates a sequence of topic-level video segments so that the topic of each such video segment does not change. This proposed segmentation procedure has been tested against various presentation videos and experimental results are presented and discussed. Keywords - hierarchical video segmentation, presentation video, visual information, text analysis, and topic words.

  • 1. INTRODUCTION

With recent advances in multimedia processing and automatic presentation recording, a large number of presentation videos are produced from conferences, lectures, meetings, and corporate trainings. These presentation videos cover a wide spectrum of topics and play an important role in information sharing and exchange. However, due to unstructured and liner features of videos, people often feel difficulties in locating a specific piece of information in a presentation video. To ensure effective exploitation of these video assets, efficient and flexible access mechanisms must be provided. Research found multimedia users strongly prefer hierarchical video access. With hierarchical presentation, video content is organized at different granularity levels, which allows a user to flexibly access some video segments

  • f his/her particular interest. In a search scenario, instead of

returning a whole video that contains a lot of irrelevant information, the most relevant video segment can be returned, thus increases the degree of video retrieval relevancy. To provide hierarchical video access, the first and important step is to hierarchically segment a long video stream into smaller, semantic units. A variety of techniques have been proposed to segment presentation videos. Earlier work from the Cornell Lecture Browser [1] uses a feature-based algorithm to segment a slide video stream. First, frames are clipped, filtered and adaptively thresholded to produce binary images. Then, feature differences between binary images are calculated and used to segment a slide video stream. Later on, Yamamoto et

  • al. [2] propose topic segmentation of lecture videos by

associating lecture speech with lecture textbook. The association is performed by computing the similarity between topic vectors obtained from lecture textbook and a sequence

  • f lecture vectors obtained from lecture speech through

spontaneous speech recognition. In another paper, a content density function is proposed to segment instructional videos [3]. The content density function draws guidance from the

  • bservation that topic boundaries coincide with ebb and flow
  • f the “density” of content shown in videos. Recently, Lin et

al [4] investigate a linguistics-based approach for lecture video segmentation. Multiple linguistic-based segmentation features from lecture speech, such as noun phrases and cue phrases, are extracted and explored. In spite of the successes, most approaches described above focus on linearly segmenting video streams into smaller units. In our study, we noticed that a presentation usually consists of many topics, and each topic covers several slides (Figure 1). This structure enables hierarchical segmentation, indexing and access. This paper focuses on hierarchical segmentation of presentation videos. Specifically, two-level video segmentation is investigated in our work: topic-level and slide-level. As in most video segmentations, visual information alone cannot reliably detect topic change. Segmentation at topic-level usually bases on related text

  • analysis. In this paper, we study segmentation of presentation

videos at topic-level through extracted slide text analysis. Segmentation at slide-level employs visual information. To map segmentation results from slide text analysis back to video segmentation and achieve hierarchical segmentation, matching between extracted key frames and converted slide images is performed through image edge analysis. The rest of the paper is organized as follows. We give an

  • verview of the approach in Section 2, then discuss in detail

slide-level segmentation and topic-level segmentation in Section 3 and 4 respectively. Experimental results are given

Figure 1. Hierarchical view of presentations Topic 1 Presentation

Topic n Slide 1

Slide m

1

slide-3
SLIDE 3

in Section 5. Section 6 concludes the paper and points out some future research.

  • 2. OVERVIEW

Hierarchical segmentation

  • f

presentation videos discussed here (Figure 2) employs two types of data: slide video streams captured by a stationary camera and PowerPoint slide files. Slide-level segmentation operates on slide video streams, while topic-level segmentation makes use of extracted slide text. At the end, slide-level segmentation creates a sequence of slide-level video

  • segments. Within each such segment, the projected slide

image does not change; while topic-level segmentation generates a sequence of topic-level video segments, each of which discusses one or more slides. Within each such segment, the topic does not change. Figure 2 shows the first step in topic-level segmentation is text-based segmenting through Topic Words Introduction (TWI), which generates a sequence of slide blocks, each of which discusses one topic. To associate each slide block with its corresponding topic-level video segments, the temporal relationship between a slide video stream and slides must be

  • established. This is accomplished by matching slide images

converted from PowerPoint slides with key frames extracted from slide-level video segments. Based on timing information of each slide, slide blocks can be mapped with topic-level video segments, thus achieve hierarchical video segmentation. In the following subsections, we discuss in detail slide- level segmentation and topic-level segmentation.

  • 3. SLIDE-LEVEL SEGMENTATION

Slide-level segmentation divides a continuous slide video stream into video segments, each of which matches one slide. More formally, given a presentation video stream v and a set

  • f

n slides,

compute a set

  • f

video segments

{ }

m

vs vs vs VS ,..., ,

1

=

, such that the projected slide image

  • f each video segment

( )

m i vsi ≤ ≤

does not change. Noticed that this definition only requires that each video segment vsi displays the same slide, but it does not impose that two adjacent segments display different slides. Thus, extra segments (false positives) are acceptable. If the matching process detects the same slide is shown in two consecutive video segments, then these segments will be

  • combined. By allowing extra segments, it is less likely that

slide transitions go undetected. Slide level segmentation discussed here employs local color histogram difference. We compare the local color histogram of successive frames. When the difference is large, a slide-level boundary is declared. This approach is simple, but works well for presentation videos since these videos do not have special effects such as fading, dissolve and wipe, and most slide transitions are abrupt cuts.

  • 4. TOPIC-LEVEL SEGMENTATION

In our study, we observed that most presentations tend to follow a basic structure in spite of differences in contents and

  • formats. A typical presentation, especially a conference

presentation, starts with a title slide, then an outline/overview slide, followed by a number of content slides (Figure 3). The

  • utline/overview slide of a presentation summarizes major

topics that will be covered in content slides. Based on this

  • bservation of presentation structure, we proposed a text

segmentation algorithm — Topic Words Introduction. As discussed in the overview (Section 2), to map segmentation results of TWI back to video segmentation, image matching between extracted key frames and converted slide images is required. Therefore, in the following sub- sections, we discuss topic-level segmentation as a two-step process: Topic Words Introduction and image matching.

Figure 2. Hierarchical segmentation of presentation videos Topic-level boundaries Slide video stream PowerPoint Slides Mapping Process Key frames Slide images Slide text Two-level video segments Time stamps

  • f slides

Presentation slide blocks

Slide-level boundaries

Frames Digitizing and decompressing Slide-Level Segmenting Image Matching Converting Slides To Images Extracting Slide Text Text-based Segmenting (TWI)

2

slide-4
SLIDE 4

4.1. Topic Words Introduction Topic Words Introduction (TWI) segments a presentation into topically coherent slide blocks. More formally, given a presentation p and a set of n content slides, compute a set of slide blocks

{ }

k

sb sb sb SB ,...,

1 ,

=

, such that the topic of each

( )

k i sbi ≤ ≤

does not change. Topic Words Introduction algorithm works on slide texts that are automatically extracted. Specifically, for each presentation slide file, extract slide content from its

  • utline/overview slide and slide titles from its content slides.

With the extracted text, Topic Words Introduction algorithm consists of three main phases: morphological analysis, lexical score determination and boundary identification. Morphological analysis. The purpose of this phase is to determine the terms to be used in the later phases. With a simple regular expression pattern matcher and a stopword list, punctuation and uninformative words are removed. The remaining slide text is converted to streams of tokens including words, numbers and symbols. A stemming algorithm [5] is then applied on these tokens to obtain word

  • stems. These stems are registered terms of a presentation. An

example output of extracted text is illustrated in Figure 4 (a) and 4(b). Lexical score determination. The purpose of this phase is to measure the similarity between a topic and a slide. Since most presenters summarize their major topics in

  • utline/overview slides, analyzing extracted text from its
  • utline/overview slide can identify topics of a presentation. In
  • ur study, for each presentation, we take each line of text

from its outline/overview slide (Figure 4(a)) as one topic. For example, “lesson learn” is one identified topic. If there is more than one level in outline/overview slides, then the content of first level is used. A dictionary of word stem frequencies is constructed for each line of text and is represented by a vector of frequency counts. These vectors are called topic vectors in our discussion. Content slides are represented by their titles. Each line in Figure 4(b) represents one slide title. Similarly, a dictionary

  • f word stem frequencies is constructed for each slide title.

This is again represented as a vector of frequency counts. These vectors are called content vectors in our discussion. To segment presentations at topic level, we calculate lexical scores between topic vectors and content vectors. Lexical score measures lexical similarity between two vectors and is represented by cosine similarity measure (Formula (1)) [6].

∑ ∑ ∑

=

t t c t t t c t t t t

j i j i

w w w w j i score

2 2

, , , ) , (

,

(1) Where ti is a topic vector, cj a content vector, t ranges over all the registered terms of ti and cj, wt,ti is the weight assigned to term t in topic vector ti and wt,cj is the term weight assigned to term t in content vector cj. Here, the weights on the terms are simply their frequency counts. For a presentation with k topics and n content slides, each topic has n lexical scores and the total number of lexical score calculation is kn. Boundary identification. Based on lexical cohesion theory, the more words two vectors share, the more strongly they are semantically related. Thus, a lexical score between a topic vector and a content vector measures how strong these two are related and is used here to determine topic boundary. The larger the score, the more likely the boundary occurs at that content slide. With this discussion, steps for boundary identification are stated in Figure 5. For each topic i, if there exists lexical score greater than zero (line 2), then its boundary is set where the first

4(b). Slide titles from content slides Figure 4(a). Slide content from outline/overview slide

  • 1. background
  • 2. barrier
  • 3. experi gridblast keck center
  • 4. lesson learn
  • 5. acknowledg

……

  • 4. barrier
  • 5. keck center
  • 6. origin nongridawar configure keck center
  • 7. scal large compare genom
  • 8. schemat view web portal
  • 9. current gridawar configure
  • 10. compon gridblast

……

Figure 3. A typical presentation structure : Title : Outline/overview

: Content slide

S2 S3 S3 S2 S3

S3 S1 S1

3

slide-5
SLIDE 5

maximum lexical score occurs (line 3). Otherwise, the algorithm locates its previous and subsequent boundaries and calculates the lexical scores of adjacent content vectors within these boundaries. After that, set a boundary where the lexical score is greater than threshold 1 T (line4-7). Instead of comparing to zero (line 2), a threshold may be

  • used. Due to limited terms in both topic vectors and content

vectors, we found zero is a reasonable threshold here. This will be demonstrated in the experiment section. As for threshold 1 T , it can be varied to achieve correspondingly varying precision/recall trade-offs. A higher recall but low precision can be found by setting 1 T to a lower value. 4.2. Image matching The purpose of image matching is to associate slide blocks with topic-level video segments. Most of key frames extracted have borders and/or overlaid presenter image at one corner (Figure 6). Unlike slide level segmentation that works

  • n frames that are captured with the same stationary camera,

image matching with local color histogram difference cannot give satisfying results. Image matching is therefore performed by slide edge analysis. The first step in image matching is to align extracted key frames with converted slide images. To accomplish this alignment, we first crop key frames and slide images (Figure 6). Since all frames are captured with the same stationary camera, the clipping factors only need to be determined once per presentation. Then resize slide image to the same size as cropped key frames. Bilinear interpolation is applied in this process. Next, we extract edge information of both resized key frames and slide images. Specifically, we apply Sobel filter

  • n both images. Based on work in [1], the difference between

a filtered key frame and a filter slide image is then computed as follows: Given Sobel-filter key frame

1 f

and Sobel-filtered slide image 1

s , let 1 b be the number of black pixels in 1 f

, 1

d be

the number of black pixels in 1

f

whose corresponding pixel in 1

s is not black, 2 b

be the number of black pixels in 1

s ,

and

2 d

be the number of black pixels in

1 s whose

corresponding pixel in 1

f

is not black, then the difference ∆ is defined as

) 2 1 /( ) 2 1 ( b b d d + + = ∆

(2) The pair with the smallest ∆ is considered as a matching

  • pair. When multiple key frames extracted from adjacent

video segments match the same slide image, their corresponding segments are combined. Image matching adds timing information to each slide. Based on this timing information, a set of slide blocks can be associated with topic-level presentation video segments. Key frame Cropped key frame Filtered key frame

Slide image

Cropped and resized slide image Filtered slide image

Figure 6. Image matching Figure 5. Boundary Identification

  • 1. For each topic (

)

k i i ≤ ≤

  • 2. If there is lexical score(s) greater than zero

3. Set the boundary i where the first maximum lexical score is

  • 4. Else

5. Locate the boundary 1 − i and boundary i + 1 6. Calculate lexical scores of adjacent content vectors within these two boundaries 7. Set a boundary where the lexical score is greater than the threshold 1 T

  • 8. End if

4

slide-6
SLIDE 6
  • 5. EXPERIMENT RESULTS

In this section, we present our experimental results for topic-level segmentation and slide-level segmentation. F- score is adopted for performance evaluation. It is defined as

R P R P F + ⋅ ⋅ = 2 , where p is precision and R is recall. In text-

based segmentation with TWI and slide-level segmentation, p = (number of correctly detected segments)/ (number of detected segments) and R = (number of correctly detected segments)/ (number of true segments). In image matching, p = (number of correctly matched key frames)/ (number of extracted key frames) and R = (number of correctly matched key frames with combining)/ (number of slide images). “with combining” here means that multiple key frames are treated as one if they are extracted from adjacent video segments and match the same slide image. The higher the F-score is, the better the performance is. In slide-level segmentation, three presentation videos from 3rd Virtual Conference on Genomic and Bioinformatics are used. We intentionally set a low threshold for local color histogram difference, thus the experiment (Table 1) has low precision (ave. = 0.86) and high recall (ave. = 0.97). High recall rate means extra segments (false positives). These extra segments (false positives) reduce the chance of slide changes going undetected.

No. Video Length

  • No. of

true seg.

  • No. of

detected seg.

  • No. of correctly

detected seg. P R F 1

  • 11min. 49 sec.

11 15 11 0.73 1.00 0.84 2 52 min. 6 sec. 47 50 46 0.92 0.98 0.94 3

  • 55min. 49 sec.

19 19 18 0.94 0.94 0.94 Ave. 0.86 0.97 0.91

In topic-level segmentation, we use nine conference

  • presentations. Out of the nine presentations, three of

presentations are from 3rd Virtual Conference on Genomic and Bioinformatics, one from SPIE AeroSense 2001, three from 9th CAA Conference 2005, and the rest two from other

  • conferences. In spite of difference in subjects and formats, all

the presentations have the same basic structure as described in Section 3. In our study, we found most presenters have a strong tendency to clearly restate a topic before he/she starts it, using terms that are the same as or very similar to what he/she has in the outline/overview slide. Thus topic-level segmentation achieves an average F-score of 0.97 (Table 2). In this experiment, threshold T1 (section 4.1.) is set as m – d where m is the mean lexical scores of adjacent content slides and d is the corresponding standard deviation. Uses of acronyms affect the segmentation performance if an acronym is not properly introduced, for example, using “Support Vector Machine” only in outline/overview slide and “SVM”

  • nly later in content slides.

No.

  • No. of

slides

  • No. of

true seg.

  • No. of

detected seg. P R F 1 76 3 3 1.00 1.00 1.00 2 48 3 3 1.00 1.00 1.00 3 28 5 5 1.00 1.00 1.00 4 17 7 9 0.78 1.00 0.88 5 21 4 4 1.00 1.00 1.00 6 36 4 4 0.75 1.00 0.85 7 37 6 6 1.00 1.00 1.00 8 32 9 9 1.00 1.00 1.00 9 32 6 6 1.00 1.00 1.00 Ave. 0.97

In image matching, key frames are extracted from video segments obtained from slide-level segmentation and slides images are converted from corresponding PowerPoint slide

  • file. As discussed in Section 4.2., image matching bases on

edge comparison. In effect, this matching method compares shapes of image components including text, graphics, tables, and so on. Therefore, inaccurate cropping factors affect the performance of image matching. In addition, if a slide is skipped during presentation, then there will be no matching video content and no matching key frame exists. However, the current matching method still returns the most closely matched key frame. This will affect the image matching

  • performance. To solve this problem, a threshold method

should be investigated in later work..

Table 2. Experimental results for topic level segmentation Table 1. Experimental results for slide level segmentation

5

slide-7
SLIDE 7
  • 6. CONCLUSION AND FUTURE WORK

Presentation videos play an important role in information sharing and exchange. In this paper, we investigated hierarchical segmentation of presentation videos through visual and text analysis. Specifically, two-level video segmentation is studied in our work: topic-level and slide-

  • level. We introduced Topic Words Introduction (TWI) for

test-based segmentation. Experiment results show that TWI can effectively segment a presentation into topically coherent slide blocks with an average F-score of 0.97. Slide-level segmentation bases on local color histogram difference analysis. To map text-based segmentation back to presentation video segmentation, image matching between converted slide images and extracted key frames are performed based on image edge analysis. With our data set, the F–scores for slide-level segmentation and image matching are 0.91 and 0.94 respectively. In this paper, we focus on presentations with a typical structure as described in Section 2. In the future, we will work on presentations that do not have this structure. We envision intelligent text analysis that integrates advanced techniques in machine learning and artificial intelligence will provide a viable solution to this problem. REFERENCES

[1]. S. Mukhopadhyay and B. Smith, “Passive capture and structuring of lectures”, Proceedings of the 7th ACM International Conference on Multimedia, Orlando, Florida, USA, October 1999, pp. 477-487. [2]. N. Yamanoto, J. Ogata and Y. Ariki, “Topic segmentation and retrieval systems for lecture videos based on spontaneous speech recognition”, Proceedings of EUROSPEECH 2003, Geneva, Switzerland, September 1-4, 2003, pp961-964. [3]. D. Phung, S. Venkatesh and C. Dorai, “High level segmentation of instructional videos based on content density”, Proceedings of Multimedia ’02, Juan-les-Pins, France, December, 2002, pp295-298. [4]. M. Liu, et al. “Segmentation of lecture videos based on text: a method for combining multiple linguistic features”, Proc. of the 37th Hawaii International Conference on System Sciences, 2004. [5]. M. Porter, An algorithm for suffix stripping. Program, 14(3):130-137, July, 1980. [6]. M. Hearst, “TextTiling: segmenting text into multi-paragraph subtopic passages”, Computational Linguistics, vol 23 (1), 1994. No.

  • No. of

slides

  • No. of

Key Frames

  • No. of correct matching

w/o combining

  • No. of correct matching

w/ combining P R F 1 11 15 14 10 0.93 0.90 0.91 2 47 50 47 44 0.94 0.93 0.93 3 19 19 19 18 1.00 0.94 0.97 Ave. 0.94 Table 3. Experimental results for image matching

6

View publication stats View publication stats