z t t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t - PowerPoint PPT Presentation

G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M USIC C OMMENTATOR

B ACKGROUND Good arrangement temporal positions associated with Short comments Comments Time Pretty cool! I am impressed Free-form tags given to the entire clip Snapshot from Nico Nico Douga (an influential video-sharing service in Japan) at different times in the real world although they gave comments Users can feel as if they enjoy together for human communication within the clip � Importance of expressing music in language � Language is an understandable common medium

E MERGING P EHENOMEN IN J APAN collaborate to create something at the same time Temporal comments and barrage Sophisticated ASCII art � Commenting itself becomes entertainment � Commenting is an advanced form of collaboration � Users add effects to the video by giving comments � Commenting is a casual way of exhibiting creativity � Temporal comments strengthen a sense of togetherness � Users can feel as if they enjoy all together and � Called pseudo-synchronized communication

M OTIVATION a computer that can express music in language are learned through communication using language could be annotated in music clips Linguistic expression (giving comments) Unseen musical audio signal � Facilitate human communication by developing � Mediated by human-machine interaction � Hypothesis: Linguistic expression is based on learning � Linguistic expressions of various musical properties � Humans acquire a sense of what temporal events

A PPROACH that associates music and language that have been given comments by many users temporal positions of an unseen audio signal Linguistic expression (giving comments) Unseen musical audio signal � Propose a computational model of commenting � Give comments based on machine learning techniques � Train a model from many musical audio signals � Generate suitable comments at appropriate

K EY F EATURES song and has a Ours playing cool the with impressed am I Conv. energetic rock mood. This is a Conv. Ours in an appropriate order positions in a target music clip ! � Deal with temporally allocated comments � Our study: Give comments to appropriate temporal � Conventional studies: Provide tags for an entire clip � Impression-word tags � Genre tags � Generate comments as sentences � Our study: Concatenate an appropriate number of words � Conventional studies: Only select words in a vocabulary � Word orders are not taken into account � Slots of template sentences are filled with words

A PPLICATIONS TO E NTERTAINMENT by using features of both music and comments could be manipulated by using language could be explained by using language Nice guitar Beautiful voice Interlude Quiet intro � Semantic clustering & segmentation of music � The performance could be improved � Users can selectively enjoy their favorite segments � Linguistic interfaces for manipulating music � Segment-based retrieval & recommendation � Retrieval & recommendations results

P ROBLEM S TATEMENT and are allocated at appropriate temporal positions Model Model � Learning phase � Input � Audio signals of music clips � Attached user comments � Output � Commenting model � Commenting phase � Input � Audio signal of a target clip � Attached user comments � Commenting model � Output � Comments that have suitable lengths and contents

F EATURE E XTRACTION per comment Comment features 3000[ms] Time Acoustic features 256[ms] Time co-efficients (MFCCs): 13 dim. � Extract features from each frame � Acoustic features � Timber feature: 28 dim � Mel-frequency cepstrum � Energy: 1 dim. � Dynamic property: 13+1 dim. � Textual features � Comment content: 2000 dim. � Average bag-of-words per comment � Comment density: 1 dim. � Number of user comments � Comment length: 1dim. � Average number of words

B AG - OF -W ORDS F EATURE Count number of each word 4. Counting 3. Assimilation guitar play he 1. Morph. analysis He+played+the+guitar+(^_^) 2. Screening guitar played He He played the guitar (^_^) is equal to vocabulary size 4. 1. particles, auxiliary verbs Morphological analysis 2. Remove auxiliary words he:1 play:1 guitar:1 3. Assimilate same-content words same part-of-speech and basic form � Identify � Part-of-speech � Basic form � Symbols / ASCII arts � Conjunctions, interjections � Do not distinguish words that have � Example:“take”=“took”=“taken � The dimension of bag-of-words features

Bag-of-words feature C OMMENTING M ODEL State sequence in a music clip Multinomial Model(GMM) Gaussian Mixture Comment length Dynamic property MFCCs and energy Gaussian Gaussian Acoustic features → Extend Hidden Markov Model (HMM) Comment density Textual features � Three requirements � All features can be simultaneously modeled � Temporal sequences of features can be modeled � All features share a common dynamical behavior ( n ) ( n ) ( n ) z z z t − t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t

Commenting phase acoustic and textual features Feature extraction Bi-gram Tri-gram analysis Molphological Uni-gram General language model Music clips with temporal positions User comments associated Audio signals M USIC C OMMENTATOR Joint probabilistic model of ② Audio signal Existing user comments Target music clip ① ③ Generate sentences Assembling Learning phase � Comment generation based on machine learning � Consistent in a maximum likelihood (ML) principal “Cool performance” “She has a beautiful voice” Outlining Determine contents&positions “Beautiful” and “this” “cool” and “voice” are likely to jointly occur is likely to occur “This is a beautiful performance” “Cool voice”

Timber, Content, Density, Length Complete Likelihood Posterior Posterior Posterior K =200 (#states) = Objective (Q fuction) L EARNING P HASE � ML Estimation of HMM parameters � Three kinds of parameters π L π { , , } � Initial-state probability 1 K ≤ ≤ { A jk | 1 j , k K } � Transition probability φ L φ { , , } � Output probability 1 K � E-step: Calculate posterior probabilities of latent states � M-step: Independently update output probabilities T T ∏ ∏ θ = π z z p ( O , Z | ) p ( z | ) p ( z | z ) p ( o | z ) z − 1 t t 1 t t − + t 1 t 1 t = = t 2 t 1 ∑ θ θ = θ θ Q ( ; ) p ( Z | O , ) log p ( O , Z | ) o o o old old − + t 1 t t 1 Z K T K K { a , w , d , l } ∑ ∑∑∑ = γ π + ξ ( z ) log ( z , z ) log A t t t t − 1 , k k t 1 , j t , k jk = = = = k 1 t 2 j 1 k 1 T K ∑∑ φ + φ + γ φ log p ( a | ) log p ( w | ) ( z ) log p ( o | ) t a , k t w , k t , k t k + φ + φ = = t 1 k 1 log p ( d | ) log p ( l | ) t d , k t l , k

C OMMENTING P HASE ← Gaussian … … … … ○: Word … SilE SilB ← ？？？ Computed by the Viterbi algorithm using bi- and tri-grams � ML Estimation of comment sentences � Assume a generative model of word sequences ˆ arg max arg max = = ˆ { c , l } p ( c , l ) p ( c | l ) p ( l ) { c , l } { c , l } p ( l ) Probability that length is l Probability that sequence is c p ( c | l ) when length is l 1 ⎛ ⎞ l l ∏ = ⎜ ⎟ p ( c | l ) p ( w | SilB ) p ( w | w , w ) p ( SilE | w , w ) ⎜ ⎟ − − − 1 i i 2 i 1 l 1 l ⎝ ⎠ = i 2 − − i 2 i 1 i l 1

Bag-of-words Gaussian sentences! Cannot generate State sequence in a target clip Multinomial Gaussian GMM Length Dynamic property MFCCs & Energy user comments in a target clip O UTLINING S TAGE Acoustic faatures ML textual features Density Textual features � Determine content and positions of comments � Input acoustic and textual features � Input only acoustic features if there are no existing � Estimate a ML state sequence � Use the Viterbi algorithm � Calculate ML textual features at each frame ( n ) ( n ) ( n ) z z z t − t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t

P ROBLEMS AND S OLUSIONS a Use general bi- and tri-grams contained sentences are for composing All words required Which is more suitable? This is a good performance This performance is good and was Performance be learned from all user comments � No probabilities of words required for sentences � Bag-of-words feature＝Reduced uni-gram � Verb conjugations are not taken into account � Auxiliary words are removed � No probabilities of word concatenations � Bi- and tri- grams are not taken into account

z t t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t - PowerPoint PPT Presentation

G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M

Speed and Size-Optimized Implementations of the PRESENT Cipher for Tiny AVR Devices Kostas

Mind the Gap Architecture versus Code Berlin Expert Days September 2016 Oliver B. Fischer -

Intro to Neo4j for Developers Jennifer Reif Developer Relations Engineer, Neo4j

Computer Architecture Summer 2020 From C to Binary Tyler Bletsch Duke University Slides are

Conditional Programming The if-statement: if condition statements end Example 1: choice =

The OCaml MOOC Benjamin Canou, Yann Rgis-Gianas (joint work with agdas Bozman, Roberto

Process Automation: Improve your productivity Jorge Dias http://mrdias.com Twitter: @dias_jorge

Best Pracces for Mullingual Linked Open Data Dominic Jones,

Lecture 13. More monads and applicatives Functional Programming 2018/19 Alejandro Serrano [

Accessibility Accessibility in OpenACS and .LRN in OpenACS and .LRN Tutorial Emmanuelle

Understanding and Tuning Performance in PETSc (on emerging manycore, GPGPU, and traditional

FIM : Fbi IMproved 21.03.2013 A flexible image viewer muc.ccc.de svn export

Formal semantics of Cypher: towards a standard language for querying property graphs N. Francis 1

CS 457 Networking and the Internet Fall 2016 Electronic Mail: SMTP [RFC 2821] uses TCP

Compiling occam to C with Tock Adam Sampson ats@offog.org University of Kent

Mind your Language(s)! A discussion about languages and security Olivier Levillain & Pierre

3GB3 C HAPTER 4: G AME W ORLDS D EFINITION : W HAT IS A GAME WORLD ? Artificial universe,

CLiMB ToolKit ToolKit: A Case Study : A Case Study CLiMB of Iterative Evaluation of Iterative

Token to Words Expanding identified token to words numbers+type = word list

Languages of the World Antonis Anastasopoulos Site https://phontron.com/class/mtandseq2seq2019/

Bullet Cache Balancing speed and usability in a cache server Ivan Voras

NoSQL: HBase and Neo4j A.A. 2019/20 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -

Fantastic Attacks and How Kalipso can find them Kamila Babayeva Sebastian Garcia

Org-mode Nick Higham April 22, 2013 Nick Higham Org-mode 1 / 7 University of Manchester What

z t t + 1 t 1 ( n ) d t ( n ) l ( n ) t a ( n ) w t t - PowerPoint PPT Presentation

G ENERATING C OMMENTS S YNCHRONIZED WITH M USICAL A UDIO S IGNALS BY A J OINT P ROBABILISTIC M ODEL OF A COUSTIC AND T EXTUAL F EATURES Kazuyoshi Yoshii Masataka Goto National Institute of Advanced Inductrial Science and Technology (AIST) M

Speed and Size-Optimized Implementations of the PRESENT Cipher for Tiny AVR Devices Kostas

Mind the Gap Architecture versus Code Berlin Expert Days September 2016 Oliver B. Fischer -

Intro to Neo4j for Developers Jennifer Reif Developer Relations Engineer, Neo4j

Computer Architecture Summer 2020 From C to Binary Tyler Bletsch Duke University Slides are

Conditional Programming The if-statement: if condition statements end Example 1: choice =

The OCaml MOOC Benjamin Canou, Yann Rgis-Gianas (joint work with agdas Bozman, Roberto

Process Automation: Improve your productivity Jorge Dias http://mrdias.com Twitter: @dias_jorge

Best Prac*ces for Mul*lingual Linked Open Data Dominic Jones,

Lecture 13. More monads and applicatives Functional Programming 2018/19 Alejandro Serrano [

Accessibility Accessibility in OpenACS and .LRN in OpenACS and .LRN Tutorial Emmanuelle

Understanding and Tuning Performance in PETSc (on emerging manycore, GPGPU, and traditional

FIM : Fbi IMproved 21.03.2013 A flexible image viewer muc.ccc.de svn export

Formal semantics of Cypher: towards a standard language for querying property graphs N. Francis 1

CS 457 Networking and the Internet Fall 2016 Electronic Mail: SMTP [RFC 2821] uses TCP

Compiling occam to C with Tock Adam Sampson ats@offog.org University of Kent

Mind your Language(s)! A discussion about languages and security Olivier Levillain &amp; Pierre

3GB3 C HAPTER 4: G AME W ORLDS D EFINITION : W HAT IS A GAME WORLD ? Artificial universe,

CLiMB ToolKit ToolKit: A Case Study : A Case Study CLiMB of Iterative Evaluation of Iterative

Token to Words Expanding identified token to words numbers+type = word list

Languages of the World Antonis Anastasopoulos Site https://phontron.com/class/mtandseq2seq2019/

Bullet Cache Balancing speed and usability in a cache server Ivan Voras

NoSQL: HBase and Neo4j A.A. 2019/20 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -

Fantastic Attacks and How Kalipso can find them Kamila Babayeva Sebastian Garcia

Org-mode Nick Higham April 22, 2013 Nick Higham Org-mode 1 / 7 University of Manchester What

Best Pracces for Mullingual Linked Open Data Dominic Jones,

Mind your Language(s)! A discussion about languages and security Olivier Levillain & Pierre