 
              Visual Language Perception from Videos MOHIT GUPTA ADVISOR: AMITABHA MUKERJEE
Introduction and Motivation  Human’s process and store what they perceive in a highly abstracted, condensed format  For e.g. …  Computers on the other hand are much less efficient in this department  Possibilities if computers could condense perception  Significant dip in information size (less memory requirement)  ‘show me who is the villain in this movie and when does he enter’ will become a valid question for a computer  Absolutely new; no similar work has been done
Methodology  Scene Segmentation  Using change in histogram method  Heuristic for start or end of speech-silence boundary  Strong heuristic for change in speaker  Sound Segmentation  Classifying voice, silence and miscellaneous (music, audience laughing etc.)  Threshold-ing energy of signal, zero-crossing rate, pitch detection by Yin algorithm  Diarization of voices (separating voices of different speakers)  Voice features like MFCCs are most significant for speaker recognition  Associating faces with speech  Detect faces in frames containing speech using Haar-based features  Tag face with the speech stream for a speaker based on majority-first approach
Methodology  Sound Segmentation  Classifying voice, silence and miscellaneous (music, audience laughing etc.)  Threshold-ing energy of signal, zero-crossing rate, pitch detection
Methodology  Sound Segmentation  Classifying voice, silence and miscellaneous (music, audience laughing etc.)  Threshold-ing energy of signal, zero-crossing rate, pitch detection
Methodology  Associating faces with speech  Detect faces in frames containing speech  Using acquired speech boundaries and detecting faces in each segment
Subtitles and speech  The pitch plot also separates words with high recall but low precision  Subtitle alignment in small-error domain successfully achieved by maximizing the common pitch-subtitle boundaries
Applications  Surround Sound Effects  Using the knowledge of who is speaking in a frame and the location of his face  Background sounds separated from speech and attenuated to get more vocals  Information abstraction and retrieval  Efficiency in memory usage  Model voice, face and scene; use text to produce speech and video on the fly  Asking the computer to seek the video to the instance the villain is first seen
References [1] Tran, Luan, et al. "Pitch reduced patterns relative to photolithography features." U.S. Patent No. 7,253,118. 7 Aug. 2007. [2] Swe, Ei Mon Mon, and Moe Pwint. "An Efficient Approach for Classification of Speech and Music." Advances in Multimedia Information Processing-PCM 2008 . Springer Berlin Heidelberg, 2008. 50-60. [3] Cotton, Courtenay. "A Three-Feature Speech/Music Classification System." (2006). [4] Shah, Sejal, and Archana Bhise. "Fast Speaker Recognition using Efficient Feature Extraction Technique." International Journal of Computer Science 2. [5] Hossen, Abdulnasir, and Said Al-Rawahi. "A Text – Independent Speaker Identification System Based on the Zak Transform." Signal Processing an International Journal (SPIJ) 4.2: 68. [6] Zhao, Xianyu, et al. "SVM-based speaker verification by location in the space of reference speakers." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on . Vol. 4. IEEE, 2007.
Recommend
More recommend