z
play

z t t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t - PowerPoint PPT Presentation

G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M


  1. G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M USIC C OMMENTATOR

  2. B ACKGROUND Good arrangement temporal positions associated with Short comments Comments Time Pretty cool! I am impressed Free-form tags given to the entire clip Snapshot from Nico Nico Douga (an influential video-sharing service in Japan) at different times in the real world although they gave comments Users can feel as if they enjoy together for human communication within the clip � Importance of expressing music in language � Language is an understandable common medium

  3. E MERGING P EHENOMEN IN J APAN collaborate to create something at the same time Temporal comments and barrage Sophisticated ASCII art � Commenting itself becomes entertainment � Commenting is an advanced form of collaboration � Users add effects to the video by giving comments � Commenting is a casual way of exhibiting creativity � Temporal comments strengthen a sense of togetherness � Users can feel as if they enjoy all together and � Called pseudo-synchronized communication

  4. M OTIVATION a computer that can express music in language are learned through communication using language could be annotated in music clips Linguistic expression (giving comments) Unseen musical audio signal � Facilitate human communication by developing � Mediated by human-machine interaction � Hypothesis: Linguistic expression is based on learning � Linguistic expressions of various musical properties � Humans acquire a sense of what temporal events

  5. A PPROACH that associates music and language that have been given comments by many users temporal positions of an unseen audio signal Linguistic expression (giving comments) Unseen musical audio signal � Propose a computational model of commenting � Give comments based on machine learning techniques � Train a model from many musical audio signals � Generate suitable comments at appropriate

  6. K EY F EATURES song and has a Ours playing cool the with impressed am I Conv. energetic rock mood. This is a Conv. Ours in an appropriate order positions in a target music clip ! � Deal with temporally allocated comments � Our study: Give comments to appropriate temporal � Conventional studies: Provide tags for an entire clip � Impression-word tags � Genre tags � Generate comments as sentences � Our study: Concatenate an appropriate number of words � Conventional studies: Only select words in a vocabulary � Word orders are not taken into account � Slots of template sentences are filled with words

  7. A PPLICATIONS TO E NTERTAINMENT by using features of both music and comments could be manipulated by using language could be explained by using language Nice guitar Beautiful voice Interlude Quiet intro � Semantic clustering & segmentation of music � The performance could be improved � Users can selectively enjoy their favorite segments � Linguistic interfaces for manipulating music � Segment-based retrieval & recommendation � Retrieval & recommendations results

  8. P ROBLEM S TATEMENT and are allocated at appropriate temporal positions Model Model � Learning phase � Input � Audio signals of music clips � Attached user comments � Output � Commenting model � Commenting phase � Input � Audio signal of a target clip � Attached user comments � Commenting model � Output � Comments that have suitable lengths and contents

  9. F EATURE E XTRACTION per comment Comment features 3000[ms] Time Acoustic features 256[ms] Time co-efficients (MFCCs): 13 dim. � Extract features from each frame � Acoustic features � Timber feature: 28 dim � Mel-frequency cepstrum � Energy: 1 dim. � Dynamic property: 13+1 dim. � Textual features � Comment content: 2000 dim. � Average bag-of-words per comment � Comment density: 1 dim. � Number of user comments � Comment length: 1dim. � Average number of words

  10. B AG - OF -W ORDS F EATURE Count number of each word 4. Counting 3. Assimilation guitar play he 1. Morph. analysis He+played+the+guitar+(^_^) 2. Screening guitar played He He played the guitar (^_^) is equal to vocabulary size 4. 1. particles, auxiliary verbs Morphological analysis 2. Remove auxiliary words he:1 play:1 guitar:1 3. Assimilate same-content words same part-of-speech and basic form � Identify � Part-of-speech � Basic form � Symbols / ASCII arts � Conjunctions, interjections � Do not distinguish words that have � Example:“take”=“took”=“taken � The dimension of bag-of-words features

  11. Bag-of-words feature C OMMENTING M ODEL State sequence in a music clip Multinomial Model(GMM) Gaussian Mixture Comment length Dynamic property MFCCs and energy Gaussian Gaussian Acoustic features → Extend Hidden Markov Model (HMM) Comment density Textual features � Three requirements � All features can be simultaneously modeled � Temporal sequences of features can be modeled � All features share a common dynamical behavior ( n ) ( n ) ( n ) z z z t − t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t

  12. Commenting phase acoustic and textual features Feature extraction Bi-gram Tri-gram analysis Molphological Uni-gram General language model Music clips with temporal positions User comments associated Audio signals M USIC C OMMENTATOR Joint probabilistic model of ② Audio signal Existing user comments Target music clip ① ③ Generate sentences Assembling Learning phase � Comment generation based on machine learning � Consistent in a maximum likelihood (ML) principal “Cool performance” “She has a beautiful voice” Outlining Determine contents&positions “Beautiful” and “this” “cool” and “voice” are likely to jointly occur is likely to occur “This is a beautiful performance” “Cool voice”

  13. Timber, Content, Density, Length Complete Likelihood Posterior Posterior Posterior K =200 (#states) = Objective (Q fuction) L EARNING P HASE � ML Estimation of HMM parameters � Three kinds of parameters π L π { , , } � Initial-state probability 1 K ≤ ≤ { A jk | 1 j , k K } � Transition probability φ L φ { , , } � Output probability 1 K � E-step: Calculate posterior probabilities of latent states � M-step: Independently update output probabilities T T ∏ ∏ θ = π z z p ( O , Z | ) p ( z | ) p ( z | z ) p ( o | z ) z − 1 t t 1 t t − + t 1 t 1 t = = t 2 t 1 ∑ θ θ = θ θ Q ( ; ) p ( Z | O , ) log p ( O , Z | ) o o o old old − + t 1 t t 1 Z K T K K { a , w , d , l } ∑ ∑∑∑ = γ π + ξ ( z ) log ( z , z ) log A t t t t − 1 , k k t 1 , j t , k jk = = = = k 1 t 2 j 1 k 1 T K ∑∑ φ + φ + γ φ log p ( a | ) log p ( w | ) ( z ) log p ( o | ) t a , k t w , k t , k t k + φ + φ = = t 1 k 1 log p ( d | ) log p ( l | ) t d , k t l , k

  14. C OMMENTING P HASE ← Gaussian … … … … ○: Word … SilE SilB ← ??? Computed by the Viterbi algorithm using bi- and tri-grams � ML Estimation of comment sentences � Assume a generative model of word sequences ˆ arg max arg max = = ˆ { c , l } p ( c , l ) p ( c | l ) p ( l ) { c , l } { c , l } p ( l ) Probability that length is l Probability that sequence is c p ( c | l ) when length is l 1 ⎛ ⎞ l l ∏ = ⎜ ⎟ p ( c | l ) p ( w | SilB ) p ( w | w , w ) p ( SilE | w , w ) ⎜ ⎟ − − − 1 i i 2 i 1 l 1 l ⎝ ⎠ = i 2 − − i 2 i 1 i l 1

  15. Bag-of-words Gaussian sentences! Cannot generate State sequence in a target clip Multinomial Gaussian GMM Length Dynamic property MFCCs & Energy user comments in a target clip O UTLINING S TAGE Acoustic faatures ML textual features Density Textual features � Determine content and positions of comments � Input acoustic and textual features � Input only acoustic features if there are no existing � Estimate a ML state sequence � Use the Viterbi algorithm � Calculate ML textual features at each frame ( n ) ( n ) ( n ) z z z t − t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t

  16. P ROBLEMS AND S OLUSIONS a Use general bi- and tri-grams contained sentences are for composing All words required Which is more suitable? This is a good performance This performance is good and was Performance be learned from all user comments � No probabilities of words required for sentences � Bag-of-words feature=Reduced uni-gram � Verb conjugations are not taken into account � Auxiliary words are removed � No probabilities of word concatenations � Bi- and tri- grams are not taken into account

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend