Discovery and Fusion of Salient Multi-modal Features towards News - PowerPoint PPT Presentation

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation - @ TRECVID 2003 Workshop Winston Hsu 1 , Shih-Fu Chang 1 , Lyndon Kennedy 1 , Chih-wei Huang 1 , Ching-Yung Lin 2 , and Giridharan Iyengar 3 1 Dept. of Electrical Engineering, Columbia University, New York, NY 2 IBM T. J. Watson Research Center, Hawthorne, NY 3 IBM T. J. Watson Research Center, Yorktown Heights, NY 11/17/2003 digital video | multimedia lab - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Story Segmentation � Story definition (from LDC) � A N ews story is defined as a segment of a news broadcast with a coherent news focus which contains at least two independent, declarative clauses. � M isc. segments like commercials, reporter chitchat, station identifications, public service, long musical (>9 sec), interludes, etc… M N N N M N N M N N M N digital video | multimedia lab -2- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Challenging problems due to diverse syntax sports >> samples : visual anchors music/animation : story * Visual anchors alone account for 51% and 67% of stories only on ABC/CNN Modalities Set P R F1 Anchor Face ABC 0.67 0.67 0.67 CNN 0.80 0.38 0.51 digital video | multimedia lab -3- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Our Goal � A robust statistical framework to fuse diverse features from different modalities � An unified framework that can be adapted to different new video sources � Automatically generate customized models (parameters) for CNN and ABC channels with the same framework � An efficient mechanism for inducing dominant features for any specific domain � Allow us to handle large pools of features smoothly � Allow us to incorporate computational noisy feature detectors ::More information, " Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation ," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004. digital video | multimedia lab -4- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Need for Multi-modal Fusion t � Issue: a story boundary at the candidate point ? k � Use the perceptual multi-modal features computed from surrounding windows to infer decisions with observation x k to estimate posterior probability q ( b|x k ) B c B p B n t k t k+1 t k-1 a anchor face? motion energy changes? a commercial starts in 15 sec. change from music to speech? significant pause occurs? just starts a speech segment? {cue phrase} i appears {cue phrase} j appears digital video | multimedia lab -5- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Our Proposed Framework – Exponential Model w/ Perceptual Binary Features training samples f i b 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 Raw features: 2 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 3 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 � Face 4 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 5 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 � Motion 6 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 7 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 � Significant Pause 8 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 9 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 � Speech segment … 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 ∑ � Commercial λ ⋅ f ( , ) x b 1 i i = ∈ q b x ( | ) e , f x b b ( , ), {0,1} i λ � Text segmentation score i Z ( ) x λ � … * Use supervised learning to find optimal exponential λ i weight for binary feature i digital video | multimedia lab -6- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Parameter Estimation = T {( x b , )} q b x ( | ) Estimate from training data by minimizing � λ k k Kullback-Leibler divergence, defined as � q � p p b x ( | ) ∑∑ = � � D p q ( || ) p b x ( , )log λ � D p q ( || ) q ( | ) b x λ x b ∑∑ = − + � � p x b ( , )log q ( | ) b x constant( ) p λ x b Also maximize the log-likelihood (with max extreme “0”) � ≡ ∑∑ � estimated model L ( q ) p x b ( , )log q b x ( | ) � λ λ p x b empirical distribution λ Iteratively find � i ∑   � p x b f x b ( , ) ( , ) 1 log ′ λ = λ + ∆ λ   i ∆ λ = x b , ∑ i i i   i � M p x q b x f x b ( ) ( | ) ( , )   λ i x b , •Because of the convexity of objective function, the iterative process is guaranteed to the global optima. •Our Matlab implementations show efficient convergence in ~30 mins when using 30 features and 11,705 training samples digital video | multimedia lab -7- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Feature Selection � Input: collection of candidate features, training samples, and the desired model size � Output: selected features and their corresponding exponential weights q α h Current model augmented with feature with weight , � α h x b ( , ) e q b x ( | ) = q ( | ) b x α , h Z ( ) x α q Select the candidate which improves current model the most, � at each iteration; { } { } = � − � * h arg max sup D p q ( || ) D p q ( || ) reduce divergence α , h ∈ α h C { } { } = − arg max sup L ( q ) L ( ) q � α � p , h p ∈ α h C increase log-likelihood digital video | multimedia lab -8- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Examples of Raw Features � Features exist at different time Modality Raw Features Time Index Value scales, asynchronous points motion segment real � Need an unified wrapper to convert shot boundary point boolean Video them to consistent representation & face segment real imitate human perceptions commercial segment boolean pause point real text seg. score pitch jump point real Audio music significant pause point real comm. musc./spch. disc. segment boolean spch seg./rapidity segment real sigpas ASR cue terms point boolean face Text V-OCR cue terms point boolean shot text seg. score point real combinatorial point boolean motion Misc. sports segment boolean candidate point digital video | multimedia lab -9- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Feature Wrapper { } { } r f ( ) t g q b ⋅ i ( | ) j Maximum Feature Library { } Feature Wrapper Entropy ⋅ r { } r f ( ) h λ F ( f , t , dt v B , , ) ; i w i k k k dt � delta interval: r f B � observation windows: i v dt : delta operation � binarization levels : r f � raw features: i t � candidate point: k ∆ r f i v : binarization thresholds B1 B1 B1 1 1 1 B2 B2 1 1 B3 0 digital video | multimedia lab -10- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Selected Features (from CNN) * The first 10 “A+V” features automatically discovered for the CNN channel λ interpretation no raw feature set gain An anchor face segment just starts after the boundary point >> 1 Anchor Face 0.3879 0.4771 A significant pause within the non-commercial section appears in 2 Significant pause & 0.0160 0.7471 >> surrounding observation window. non-commercial An audio pause with the duration larger than 2.0 second appears 3 Pause 0.0058 0.2434 after the boundary point. The surrounding observation window has a significant pause with 4 Significant pause 0.0024 0.7947 the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second. A speech segment before the candidate point 5 Speech segment 0.0019 -0.3566 A speech segment starts in the surrounding observation window 6 Speech segment 0.0015 0.3734 A commercial starts in 15 to 20 seconds after the candidate point. 7 Commercial 0.0015 1.0782 >> A speech segment ends after the candidate point 8 Speech segment 0.0022 -0.4127 An anchor face segment occupies at least 10% of next window 9 Anchor face 0.0016 0.7251 The surrounding observation window has a pause with the 10 Pause 0.0008 0.0939 duration larger than 0.25 second. digital video | multimedia lab -11- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Significant Pause Significant Pause Uniform ε Set P R F1 P R F1 ABC 5.0 0.20 0.38 0.26 0.10 0.22 0.14 2.5 0.16 0.34 0.22 0.10 0.22 0.14 CNN 5.0 0.40 0.45 0.42 0.20 0.24 0.22 2.5 0.37 0.43 0.39 0.20 0.24 0.22 ::sigpas_seg_0202cnn digital video | multimedia lab -12- - Winston H.-M. Hsu - COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Discovery and Fusion of Salient Multi-modal Features towards News - PowerPoint PPT Presentation

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation - @ TRECVID 2003 Workshop Winston Hsu 1 , Shih-Fu Chang 1 , Lyndon Kennedy 1 , Chih-wei Huang 1 , Ching-Yung Lin 2 , and Giridharan Iyengar 3 1 Dept. of

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

The Expressive Power of Backround Modal Dependence Logic Modal logic Team semantics Modal

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

High resolution image fusion via fusion frames Shidong Li San Francisco State University

A Review on Salient Object Detection Feng Lin Salient Object Detection Target Detect and

Multi-modal Face Recognition Hu Han hanhu@ict.ac.cn http: / / vipl.ict.ac.cn/ members/ hhan

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Modal logic Benzm uller/Rojas, 2014 Artificial Intelligence 2 What is Modal Logic?

W HAT IS EHD? Introduction EHD without cross-flow Modal Dielectric fluid Non-modal EHD with

Why is modal logic decidable Petros Potikas NTUA 9/5/2017 Petros Potikas (NTUA) Modal logic

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Reunert Year-end Results 12 months ended 30 September 2009 1 Salient features Strong balance

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Fusing Generic Objectness and Visual Saliency for Salient Object Detection Yasin KAVAK

Scaling and Time Warping in Time Series Querying Ada Wai-chee Fu 1 Eamonn Keogh 2 Leo Yung Hang

Programming Up-to-Congruence, Again Stephanie Weirich University of Pennsylvania August 12, 2014

Well-Posedness and Adiabatic Limit for Quantum Zakharov System Yung-Fu Fang (joint work with

Detecting Overlapping and Correlated Communities without Pure Nodes: Identifiability and

AIRS Level 3 and Visualization AIRS Science Team Meeting May 3-6, 2005 Stephanie Granger

AIRS Project Status AIRS Science Team Meeting October 14, 2008 Tom Pagano California Institute

F I R S T S C I E N C E W I T H Seen through the lens of: why you should care about Lyman-alpha

Journal of Facilities Management Positioning the facilities manager s role throughout the