joint emotion analysis via multi task gaussian processes
play

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel - PowerPoint PPT Presentation

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 2 / 23


  1. Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014

  2. Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 2 / 23

  3. Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 3 / 23

  4. Emotion Analysis Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008]; 4 / 23

  5. Emotion Analysis Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008]; Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 0 60 Panda cub makes her debut 0 59 0 4 / 23

  6. Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. 5 / 23

  7. Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; 5 / 23

  8. Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained → Prone to bias and noise; 5 / 23

  9. Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained → Prone to bias and noise; Disclaimer: this work is not about features (at the moment...) 5 / 23

  10. Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: 6 / 23

  11. Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. 6 / 23

  12. Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a “ground truth”. 6 / 23

  13. Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a “ground truth”. For Emotion Analysis, we need a multi-task model that is able to take into account possible anti-correlations, avoiding negative transfer. Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 0 60 Panda cub makes her debut 0 59 0 6 / 23

  14. Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 7 / 23

  15. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: 8 / 23

  16. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) 8 / 23

  17. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) Mean function 8 / 23

  18. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) Kernel function 8 / 23

  19. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) Prior p ( y | X ) 8 / 23

  20. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) Likelihood p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) 8 / 23

  21. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) Marginal likelihood 8 / 23

  22. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) Posterior p ( y | X ) � p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , f , X , y ) p ( f | X , y ) df f 8 / 23

  23. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) � p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , f , X , y ) p ( f | X , y ) df f Likelihood (test) 8 / 23

  24. Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) � p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , f , X , y ) p ( f | X , y ) df f Predictive distribution 8 / 23

  25. GP Regression Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; 1 AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23

  26. GP Regression Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; Kernel: Many options available. In this work we use the Radial Basis Function (RBF) kernel 1 : F � � i ) 2 − 1 ( x i − x ′ k ( x , x ′ ) = α 2 � f × exp 2 l i i =1 1 AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23

  27. The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ 10 / 23

  28. The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ Kernel on data points (like RBF, for instance) 10 / 23

  29. The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ Coregionalisation matrix: encodes task covariances 10 / 23

  30. The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ B can be parameterised and learned by optimizing the model marginal likelihood. 10 / 23

  31. PPCA model [Bonilla et al., 2008] decomposes B using PPCA: B = UΛU T + diag( α ) , 11 / 23

  32. PPCA model [Bonilla et al., 2008] decomposes B using PPCA: B = UΛU T + diag( α ) , To ensure numerical stability, we employ the incomplete-Cholesky decomposition over UΛU T : L T + diag( α ) , B = ˜ L˜ 11 / 23

  33. PPCA model L 11 L 21 L 31 L 41 L 51 L 61 ˜ L 12 / 23

  34. PPCA model L 11 L 11 L 21 L 31 L 41 L 51 L 61 L 21 L 31 × L 41 L 51 L 61 ˜ × ˜ L T L 12 / 23

  35. PPCA model L 11 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 α 2 L 31 α 3 × + diag( ) = L 41 α 4 L 51 α 5 L 61 α 6 ˜ × ˜ + diag( α ) = L T L 12 / 23

  36. PPCA model L 11 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 α 2 L 31 α 3 × + diag( ) = L 41 α 4 L 51 α 5 L 61 α 6 ˜ × ˜ + diag( α ) = L T B L 12 / 23

  37. PPCA model 12 hyperparameters L 11 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 α 2 L 31 α 3 × + diag( ) = L 41 α 4 L 51 α 5 L 61 α 6 ˜ × ˜ + diag( α ) = L T B L 12 / 23

  38. PPCA model 18 hyperparameters L 11 L 12 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 L 22 L 12 L 22 L 32 L 42 L 52 L 62 α 2 L 31 L 32 α 3 × + diag( ) = L 41 L 42 α 4 L 51 L 52 α 5 L 61 L 62 α 6 ˜ × ˜ + diag( α ) = L T B L 13 / 23

  39. PPCA model 24 hyperparameters L 11 L 12 L 13 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 L 22 L 23 L 12 L 22 L 32 L 42 L 52 L 62 α 2 L 31 L 32 L 33 L 13 L 23 L 33 L 43 L 53 L 63 α 3 × + diag( ) = L 41 L 42 L 43 α 4 L 51 L 52 L 53 α 5 L 61 L 62 L 63 α 6 ˜ × ˜ + diag( α ) = L T B L 14 / 23

  40. Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 15 / 23

  41. Experimental Setup Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 16 / 23

  42. Experimental Setup Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; 16 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend