Joint Emotion Analysis via Multi-task Gaussian Processes Daniel - - PowerPoint PPT Presentation

joint emotion analysis via multi task gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel - - PowerPoint PPT Presentation

Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 2 / 23


slide-1
SLIDE 1

Joint Emotion Analysis via Multi-task Gaussian Processes

Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014

slide-2
SLIDE 2

1

Introduction

2

Multi-task Gaussian Process Regression

3

Experiments and Discussion

4

Conclusions and Future Work

2 / 23

slide-3
SLIDE 3

1

Introduction

2

Multi-task Gaussian Process Regression

3

Experiments and Discussion

4

Conclusions and Future Work

3 / 23

slide-4
SLIDE 4

Emotion Analysis

Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008];

4 / 23

slide-5
SLIDE 5

Emotion Analysis

Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008];

Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 60 Panda cub makes her debut 59

4 / 23

slide-6
SLIDE 6

Why Multi-task?

Learn a model that shows sound and interpretable correlations between emotions.

5 / 23

slide-7
SLIDE 7

Why Multi-task?

Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly;

5 / 23

slide-8
SLIDE 8

Why Multi-task?

Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained → Prone to bias and noise;

5 / 23

slide-9
SLIDE 9

Why Multi-task?

Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained → Prone to bias and noise; Disclaimer: this work is not about features (at the moment...)

5 / 23

slide-10
SLIDE 10

Multi-task learning and Anti-correlations

Most multi-task models used in NLP assume some degree of correlation between tasks:

6 / 23

slide-11
SLIDE 11

Multi-task learning and Anti-correlations

Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data.

6 / 23

slide-12
SLIDE 12

Multi-task learning and Anti-correlations

Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a “ground truth”.

6 / 23

slide-13
SLIDE 13

Multi-task learning and Anti-correlations

Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a “ground truth”. For Emotion Analysis, we need a multi-task model that is able to take into account possible anti-correlations, avoiding negative transfer.

Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 60 Panda cub makes her debut 59

6 / 23

slide-14
SLIDE 14

1

Introduction

2

Multi-task Gaussian Process Regression

3

Experiments and Discussion

4

Conclusions and Future Work

7 / 23

slide-15
SLIDE 15

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data:

8 / 23

slide-16
SLIDE 16

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′))

8 / 23

slide-17
SLIDE 17

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) Mean function

8 / 23

slide-18
SLIDE 18

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) Kernel function

8 / 23

slide-19
SLIDE 19

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) p(f |X, y) = p(y|X, f )p(f ) p(y|X) Prior

8 / 23

slide-20
SLIDE 20

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) p(f |X, y) = p(y|X, f )p(f ) p(y|X) Likelihood

8 / 23

slide-21
SLIDE 21

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) p(f |X, y) = p(y|X, f )p(f ) p(y|X) Marginal likelihood

8 / 23

slide-22
SLIDE 22

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) p(f |X, y) = p(y|X, f )p(f ) p(y|X) p(y∗|x∗, X, y) =

  • f

p(y∗|x∗, f , X, y)p(f |X, y)df Posterior

8 / 23

slide-23
SLIDE 23

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) p(f |X, y) = p(y|X, f )p(f ) p(y|X) p(y∗|x∗, X, y) =

  • f

p(y∗|x∗, f , X, y)p(f |X, y)df Likelihood (test)

8 / 23

slide-24
SLIDE 24

Gaussian Processes

Let (X, y) be the training data and f (x) the latent function that models that data: f (x) ∼ GP(µ(x), k(x, x′)) p(f |X, y) = p(y|X, f )p(f ) p(y|X) p(y∗|x∗, X, y) =

  • f

p(y∗|x∗, f , X, y)p(f |X, y)df Predictive distribution

8 / 23

slide-25
SLIDE 25

GP Regression

Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior;

1AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23

slide-26
SLIDE 26

GP Regression

Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; Kernel: Many options available. In this work we use the Radial Basis Function (RBF) kernel1: k(x, x′) = α2

f × exp

  • −1

2

F

  • i=1

(xi − x′

i )2

li

  • 1AKA Squared Exponential, Gaussian or Exponential Quadratic kernel.

9 / 23

slide-27
SLIDE 27

The Intrinsic Coregionalisation Model

Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x′, d′)) = kdata(x, x′) × Bd,d′

10 / 23

slide-28
SLIDE 28

The Intrinsic Coregionalisation Model

Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x′, d′)) = kdata(x, x′) × Bd,d′ Kernel on data points (like RBF, for instance)

10 / 23

slide-29
SLIDE 29

The Intrinsic Coregionalisation Model

Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x′, d′)) = kdata(x, x′) × Bd,d′ Coregionalisation matrix: encodes task covariances

10 / 23

slide-30
SLIDE 30

The Intrinsic Coregionalisation Model

Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k((x, d), (x′, d′)) = kdata(x, x′) × Bd,d′ B can be parameterised and learned by optimizing the model marginal likelihood.

10 / 23

slide-31
SLIDE 31

PPCA model

[Bonilla et al., 2008] decomposes B using PPCA: B = UΛUT + diag(α),

11 / 23

slide-32
SLIDE 32

PPCA model

[Bonilla et al., 2008] decomposes B using PPCA: B = UΛUT + diag(α), To ensure numerical stability, we employ the incomplete-Cholesky decomposition over UΛUT: B = ˜ L˜ LT + diag(α),

11 / 23

slide-33
SLIDE 33

PPCA model

L11 L21 L31 L41 L51 L61

˜ L

12 / 23

slide-34
SLIDE 34

PPCA model

L11 L21 L31 L41 L51 L61

˜ L ×

L11L21L31L41L51L61

× ˜ LT

12 / 23

slide-35
SLIDE 35

PPCA model

L11 L21 L31 L41 L51 L61

˜ L ×

L11L21L31L41L51L61

× ˜ LT + diag( ) =

α1 α2 α3 α4 α5 α6

+ diag(α) =

12 / 23

slide-36
SLIDE 36

PPCA model

L11 L21 L31 L41 L51 L61

˜ L ×

L11L21L31L41L51L61

× ˜ LT + diag( ) =

α1 α2 α3 α4 α5 α6

+ diag(α) = B

12 / 23

slide-37
SLIDE 37

PPCA model

L11 L21 L31 L41 L51 L61

˜ L ×

L11L21L31L41L51L61

× ˜ LT + diag( ) =

α1 α2 α3 α4 α5 α6

+ diag(α) = B 12 hyperparameters

12 / 23

slide-38
SLIDE 38

PPCA model

18 hyperparameters ˜ L × ˜ LT + diag(α) = B × + diag( ) =

L11 L21 L31 L41 L51 L61 L12 L22 L32 L42 L52 L62 L11L21L31L41L51L61 L12L22L32L42L52L62

α1 α2 α3 α4 α5 α6

13 / 23

slide-39
SLIDE 39

PPCA model

24 hyperparameters ˜ L × ˜ LT + diag(α) = B × + diag( ) =

L11 L21 L31 L41 L51 L61 L12 L22 L32 L42 L52 L62 L13 L23 L33 L43 L53 L63 L11L21L31L41L51L61 L12L22L32L42L52L62 L13L23L33L43L53L63

α1 α2 α3 α4 α5 α6

14 / 23

slide-40
SLIDE 40

1

Introduction

2

Multi-task Gaussian Process Regression

3

Experiments and Discussion

4

Conclusions and Future Work

15 / 23

slide-41
SLIDE 41

Experimental Setup

Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007];

16 / 23

slide-42
SLIDE 42

Experimental Setup

Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion;

16 / 23

slide-43
SLIDE 43

Experimental Setup

Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; Bag-of-words representation as features;

16 / 23

slide-44
SLIDE 44

Experimental Setup

Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; Bag-of-words representation as features; Pearson’s correlation score as evaluation metric;

16 / 23

slide-45
SLIDE 45

Learned Task Covariances

100 sentences

17 / 23

slide-46
SLIDE 46

Prediction Results

Split: 100/900

18 / 23

slide-47
SLIDE 47

Prediction Results

Split: 100/900

18 / 23

slide-48
SLIDE 48

Prediction Results

Split: 100/900

18 / 23

slide-49
SLIDE 49

Training Set Size Influence

Split: 100+/100

19 / 23

slide-50
SLIDE 50

1

Introduction

2

Multi-task Gaussian Process Regression

3

Experiments and Discussion

4

Conclusions and Future Work

20 / 23

slide-51
SLIDE 51

Conclusions and Future Work

Conclusions

21 / 23

slide-52
SLIDE 52

Conclusions and Future Work

Conclusions The proposed model is able to learn sensible correlations and anti-correlations;

21 / 23

slide-53
SLIDE 53

Conclusions and Future Work

Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines;

21 / 23

slide-54
SLIDE 54

Conclusions and Future Work

Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work

21 / 23

slide-55
SLIDE 55

Conclusions and Future Work

Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work Modelling the label distribution (different priors, different likelihoods)

21 / 23

slide-56
SLIDE 56

Conclusions and Future Work

Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work Modelling the label distribution (different priors, different likelihoods) Multiple multi-task levels (for example, MTurk data [Snow et al., 2008]);

21 / 23

slide-57
SLIDE 57

Conclusions and Future Work

Conclusions The proposed model is able to learn sensible correlations and anti-correlations; For small datasets, it also outperforms single-task baselines; Future Work Modelling the label distribution (different priors, different likelihoods) Multiple multi-task levels (for example, MTurk data [Snow et al., 2008]); Other multi-task GP models [´ Alvarez et al., 2012, Hensman et al., 2013];

21 / 23

slide-58
SLIDE 58

Joint Emotion Analysis via Multi-task Gaussian Processes

Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014

slide-59
SLIDE 59

Error Analysis

23 / 23

slide-60
SLIDE 60

´ Alvarez, M. A., Rosasco, L., and Lawrence, N. D. (2012). Kernels for Vector-Valued Functions: a Review. Foundations and Trends in Machine Learning, pages 1–37. Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I. (2008). Multi-task Gaussian Process Prediction. Advances in Neural Information Processing Systems. Cohn, T. and Specia, L. (2013). Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation. In Proceedings of ACL. Hensman, J., Lawrence, N. D., and Rattray, M. (2013). Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC Bioinformatics, 14:252. Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. (2008). Cheap and Fast - But is it Good?: Evaluating Non-Expert Annotations for Natural Language Tasks.

23 / 23

slide-61
SLIDE 61

In Proceedings of EMNLP. Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14 : Affective Text. In Proceedings of SEMEVAL. Strapparava, C. and Mihalcea, R. (2008). Learning to identify emotions in text. In Proceedings of the 2008 ACM Symposium on Applied Computing.

23 / 23