TDNN: A Two-stage Deep Neural Network for Prompt-independent - - PowerPoint PPT Presentation

tdnn a two stage deep neural network for prompt
SMART_READER_LITE
LIVE PREVIEW

TDNN: A Two-stage Deep Neural Network for Prompt-independent - - PowerPoint PPT Presentation

TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring Outline Background Method Experiments Conclusions What is Automated Essay Scoring (AES) ? Computer produces summative assessment for


slide-1
SLIDE 1

TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring

slide-2
SLIDE 2

Outline

  • Background
  • Method
  • Experiments
  • Conclusions
slide-3
SLIDE 3

What is Automated Essay Scoring (AES)?

  • Computer produces summative assessment for evaluation
  • Aim: reduce human workload
  • AES has been put into practical use by ETS from 1999
slide-4
SLIDE 4

Prompt-specific and -Independent AES

  • Most existing AES approaches are prompt-specific

– Require human labels for each prompt to train – Can achieve satisfying human-machine agreement

  • Quadradic weighted kappa (QWK) > 0.75 [Taghipour & Ng, EMNLP 2016]
  • Inter-human agreement: QWK=0.754
  • Prompt-independent AES remains a challenge

– Only non-target human labels are available

slide-5
SLIDE 5

Challenges in Prompt-independent AES

Prompt 1: Winter Olympics Prompt 2: Rugby World Cup Prompt 3: Australian Open

Source Prompts Model Learn

World Cup 2018

Target Prompt Predict

slide-6
SLIDE 6

Challenges in Prompt-independent AES

Prompt 1: Winter Olympics Prompt 2: Rugby World Cup Prompt 3: Australian Open

Source Prompts Model Learn

World Cup 2018

Target Prompt Predict

Unavailability of rated essays written for the target prompt

slide-7
SLIDE 7

Challenges in Prompt-independent AES

  • Previous approaches learn on source prompts

– Domain adaption [Phandi et al. EMNLP 2015] – Cross-domain learning [Dong & Zhang, EMNLP 2016] – Achieved Avg. QWK = 0.6395 at best with up to 100 labeled target essays

Prompt 1: Winter Olympics Prompt 2: Rugby World Cup Prompt 3: Australian Open

Source Prompts Model Learn

World Cup 2018

Target Prompt Predict

slide-8
SLIDE 8

Challenges in Prompt-independent AES

Prompt 1: Winter Olympics Prompt 2: Rugby World Cup Prompt 3: Australian Open

Source Prompts Model Learn

World Cup 2018

Target Prompt Predict

Off-topic: essays written for source prompts are mostly irrelevant

slide-9
SLIDE 9

Outline

  • Background
  • Method
  • Experiments
  • Conclusions
slide-10
SLIDE 10

TDNN: A Two-stage Deep Neural Network for Prompt- independent AES

  • Based on the idea of transductive transfer learning
  • Learn on target essays
  • Utilize the content of target essays to rate
slide-11
SLIDE 11

The Two-stage Architecture

  • Prompt-independent stage: train a shallow model to

create pseudo labels on the target prompt

slide-12
SLIDE 12

The Two-stage Architecture

  • Prompt-dependent stage: learn an end-to-end model to

predict essay ratings for the target prompts

slide-13
SLIDE 13

Prompt-independent stage

  • Train a robust prompt-independent AES model
  • Using Non-target prompts
  • Learning algorithm: RankSVM for AES
  • Pre-defined prompt-independent features
  • Select confident essays written for the target prompt
slide-14
SLIDE 14

Prompt-independent stage

  • Train a robust prompt-independent AES model
  • Using Non-target prompts
  • Learning algorithm: RankSVM
  • Pre-defined prompt-independent features
  • Select confident essays written for the target prompt

10 Predicted Scores

slide-15
SLIDE 15

Prompt-independent stage

  • Train a robust prompt-independent AES model
  • Using Non-target prompts
  • Learning algorithm: RankSVM
  • Pre-defined prompt-independent features
  • Select confident essays written for the target prompt

10 Predicted Scores 4

Predicted ratings in [0, 4] as negative examples

slide-16
SLIDE 16

Prompt-independent stage

  • Train a robust prompt-independent AES model
  • Using Non-target prompts
  • Learning algorithm: RankSVM
  • Pre-defined prompt-independent features
  • Select confident essays written for the target prompt

10 Predicted Scores 4

Predicted ratings in [8, 10] as positive examples

8

slide-17
SLIDE 17

Prompt-independent stage

  • Train a robust prompt-independent AES model
  • Using Non-target prompts
  • Learning algorithm: RankSVM
  • Pre-defined prompt-independent features
  • Select confident essays written for the target prompt

10 Predicted Scores 4

Converted to 0/1 labels

8 1

slide-18
SLIDE 18

Prompt-independent stage

  • Train a robust prompt-independent AES model
  • Using Non-target prompts
  • Learning algorithm: RankSVM
  • Pre-defined prompt-independent features
  • Select confident essays written for the target prompt
  • Common sense: ≥8 is good, <5 is bad
  • Enlarge sample size

10 4 8

slide-19
SLIDE 19

Prompt-dependent stage

  • Train a hybrid deep model for a prompt-

dependent assessment

  • An end-to-end neural network with three parts
  • f inputs:
  • Word semantic embeddings
  • Part-of-speech (POS) taggings
  • Syntactic taggings
slide-20
SLIDE 20

Architecture of the hybrid deep model

Multi-layer structure: Words – (phrases) - Sentences – Essay

slide-21
SLIDE 21

Architecture of the hybrid deep model

Glove word embeddings

slide-22
SLIDE 22

Architecture of the hybrid deep model

Part-of-speech taggings

slide-23
SLIDE 23

Architecture of the hybrid deep model

Syntactic taggings

slide-24
SLIDE 24

Architecture of the hybrid deep model

Multi-layer structure: Words – (phrases) - Sentences – Essay

slide-25
SLIDE 25

Architecture of the hybrid deep model

slide-26
SLIDE 26

Model Training

  • Training loss: MSE on 0/1 pseudo labels
  • Validation metric: Kappa on 30% non-target essays

–Select the model that can best rate

slide-27
SLIDE 27

Outline

  • Background
  • Method
  • Experiments
  • Conclusions
slide-28
SLIDE 28

Dataset & Metrics

  • We use the standard ASAP corpus

– 8 prompts with >10K essays in total

  • Prompt-independent AES: 7 prompts are used for training, 1

for testing

  • Report on common human-machine agreement metrics

– Pearson’s correlation coefficient (PCC) – Spearman’s correlation coefficient (SCC) – Quadratic weighted Kappa (QWK)

slide-29
SLIDE 29

Baselines

  • RankSVM based on prompt-independent handcrafted

features

  • Also used in the prompt-independent stage in TDNN
  • 2L-LSTM [Alikaniotis et al. , ACL 2016]
  • Two LSTM layer + linear layer
  • CNN-LSTM [Taghipour & Ng, EMNLP 2016]
  • CNN + LSTM + linear layer
  • CNN-LSTM-ATT [Dong et al. , CoNLL 2017]
  • CNN-LSTM + attention
slide-30
SLIDE 30
  • High variance of DNN models’ performance on all 8 prompts
  • Possibly caused by learning on non-target prompts
  • RankSVM appears to be the most stable baseline
  • Justifies the use of RankSVM in the first stage of TDNN

RankSVM is the most robust baseline

slide-31
SLIDE 31
  • TDNN outperforms the best baseline on 7 out of 8 prompts
  • Performance improvements gained by learning on the target

prompt

Comparison to the best baseline

slide-32
SLIDE 32

Average performance on 8 prompts

Method QWK PCC SCC Baselines RankSVM .5462 .6072 .5976 2L-LSTM .4687 .6548 .6214 CNN-LSTM .5362 .6569 .6139 CNN-LSTM-ATT .5057 .6535 .6368 TDNN TDNN(Sem) .5875 .6779 .6795 TDNN(Sem+POS) .6582 .7103 .7130 TDNN(Sem+Synt) .6856 .7244 .7365 TDNN(POS+Synt) .6784 .7189 .7322 TDNN(ALL) .6682 .7176 .7258

slide-33
SLIDE 33

Average performance on 8 prompts

Method QWK PCC SCC Baselines RankSVM .5462 .6072 .5976 2L-LSTM .4687 .6548 .6214 CNN-LSTM .5362 .6569 .6139 CNN-LSTM-ATT .5057 .6535 .6368 TDNN TDNN(Sem) .5875 .6779 .6795 TDNN(Sem+POS) .6582 .7103 .7130 TDNN(Sem+Synt) .6856 .7244 .7365 TDNN(POS+Synt) .6784 .7189 .7322 TDNN(ALL) .6682 .7176 .7258

slide-34
SLIDE 34

Average performance on 8 prompts

Method QWK PCC SCC Baselines RankSVM .5462 .6072 .5976 2L-LSTM .4687 .6548 .6214 CNN-LSTM .5362 .6569 .6139 CNN-LSTM-ATT .5057 .6535 .6368 TDNN TDNN(Sem) .5875 .6779 .6795 TDNN(Sem+POS) .6582 .7103 .7130 TDNN(Sem+Synt) .6856 .7244 .7365 TDNN(POS+Synt) .6784 .7189 .7322 TDNN(ALL) .6682 .7176 .7258

slide-35
SLIDE 35

Sanity Check: Relative Precision

How the quality of pseudo examples affects the performance of TDNN?

➢ The sanctity of the selected essays, namely, the number of positive (negative) essays that are better (worse) than all negative (positive) essays. ➢ Such relative precision is at least 80% and mostly beyond 90% on different prompts ➢ TDNN can at least learn from correct 0/1 labels

slide-36
SLIDE 36

Conclusions

  • It is beneficial to learn an AES model on the

target prompt

  • Syntactic features are useful addition to the

widely used Word2Vec embeddings

  • Sanity check: small overlap between pos/neg

examples

  • Prompt-independent AES remains an open

problem

– ETS wants Kappa>0.70 – TDNN can achieve 0.68 at best

slide-37
SLIDE 37

Thank you!