Hierarchical and Supervised Attention Yue Zhao , Xiaolong Jin, - PowerPoint PPT Presentation

Document Embedding Enhanced Event Detection with Hierarchical and Supervised Attention Yue Zhao , Xiaolong Jin, Yuanzhuo Wang, Xueqi Cheng University of Chinese Academy of Sciences CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences

 Introduction  Motivation Content  Model  Experiments  Summary 1

Introduction • Event Detection • subtask of event extraction • given a document , extract event triggers from individual sentences and further identifies the (pre-defined) type of events • Event Trigger • words in sentences that most clearly expresses occurrence of events … They have been married for three years . … Event Trigger is “married”, which represents a marry event 2

Motivation ... I knew it was time to leave . … ? ? Transport event End-Position event A single sentence may cause ambiguous ... I knew it was time to leave . Is not that a great argument for term limits ? … √ End-Position event The contextual information of a individual sentence offers more confident for classifying 3

Motivation Some shortcomings of existing works Manually designed document-level feature  Ji and Grishman, ACL, 2008 Liao and Grishman, ACL, 2010 Huang and Riloff, AAAI, 2012 Learning document embedding without supervision, cannot specifically  capture event-related information Duan et al., IJCNLP , 2017 4

DEEB-RNN : The Proposed Model ED Oriented Document Document-level Enhanced Event Detector Embedding Learning 5

Model - ED Oriented Document Embedding Learning Word-level embeddings  Word encoder  h Bi-GRU ([ w , e ]) it w it it  Word attention  u tanh( W h ) it w it   T u c it it w  Sentence representation   T  s h i it it  6 t 1

Model - ED Oriented Document Embedding Learning  Gold word-level attention signal: “Indicated”is a event trigger and is setted as 1, other words are setted as 0.  Loss function: L T          2 E ( , ) ( ) w it it   i 1 t 1 The square error as the general loss of the attention at word level to supervise the learning process. 7

Model - ED Oriented Document Embedding Learning Sentence-level embeddings  Sentence encoder  q Bi-GRU ( ) s i s i  Sentence attention  t tanh( W q ) i s i   T t c i i s  Document representation   L  d s i i  i 1 8

Model - ED Oriented Document Embedding Learning  Gold sentence-level attention signal: S1, S3 and SL are sentences with event triggers and is setted as 1, other sentences are setted as 0.  Loss function: L          2 E ( , ) ( ) s i i  i 1 The square error as the general loss of the attention at sentence level to supervise the learning process. 9

Model - Document-level Enhanced Event Detector  Event Detector:  f Bi-GRU ([ , d w , e ]) jt e jt jt softmax output layer to get the predicted probability for each word  Loss function: L T K     ( ) k J y o ( , ) I( y k )log o jt jt    j 1 t 1 k 1 cross-entropy error 10

Model - Joint Training Joint Loss Function:              J ( ) ( ( , ) J y o E ( , ) E ( , )) w s    d denotes all parameters used in DEEB-RNN  𝜄 is the training document set  𝜚 and are hyper-parameters for striking a balance  𝜈 𝜇 11

Experiments ACE 2005 Corpus  33 categories  6 sources  599 documents  5349 labeled events 12

Experiments - Configuration Partitions #Documents Training set 529 Validation set 30 Test set 40 Parameters Setting 300, 200, 300 GRU ,GRU ,GRU w s e 600, 400 W , W w s entity type embeddings 50 (randomly initialized) word embeddings 300 (Google pre-trained) dropout rate 0.5 training SGD 13

Experiments – Model analysis Model Variants ： DEEB-RNN computes attentions without supervision • DEEB-RNN1 uses only the gold word-level attention signal • DEEB-RNN2 uses only the gold sentence-level attention signal • DEEB-RNN3 employs the gold attention signals at both word and sentence levels • Models with document embeddings outperform the pure Bi-GRU method. The model with both gold attention signals at word and sentence levels performs best. 14

Experiments - Baselines Feature-based methods without document-level information : • Sentence-level(2011), Joint Local(2013) • Representation-based methods without document-level information : • JRNN(2016), Skip-CNN(2016), ANN-S2(2017) • Feature-based methods using document level information : • Cross-event(2010), PSL(2016) • Representation-based methods using document-level information : • DLRNN(2017) • 15

Experiments – Main Results Feature-based without Document-level Traditional Representation-based Event Detection without Document-level Models Using Document-level DEEB Models Our models consistently out-perform the existing state-of-the-art methods in terms of both recall and F1-measure. 16

Summary Conclusions We proposed a hierarchical and supervised attention based and document • embedding enhanced Bi-RNN method. • We explored different strategies to construct gold word- and sentence-level attentions to focus on event information. • We also showed this method achieves best performance in terms of both recall and F1-measure. Future work • Automatically determine the weights of sentence and document embeddings. • Use the architecture for another text task. 17

Thank you for your attention ！ Q&A Name ： Yue Zhao Email ： zhaoyue@software.ict.ac.cn

Hierarchical and Supervised Attention Yue Zhao , Xiaolong Jin, - PowerPoint PPT Presentation

Document Embedding Enhanced Event Detection with Hierarchical and Supervised Attention Yue Zhao , Xiaolong Jin, Yuanzhuo Wang, Xueqi Cheng University of Chinese Academy of Sciences CAS Key Lab of Network Data Science and Technology, Institute of

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

Hierarchical Clustering 4-4-16 Hierarchical clustering: the setting Unsupervised learning

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Multi-Building WiFi Fingerprinting using Bayesian and Hierarchical Supervised Machine Learning

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention and its (mis)interpretation Danish Pruthi 1 Acknowledgements Mansi Gupta Bhuwan

Opportunities and challenges for school-based community transformation 1 All children growing

New Mexico Public Education: Funding, Sufficiency, and Evidence-Based Interventions Charles

October Meeting October 2, 2018 West Reading Room 1:00 3:00 PM Patrick Henry Building

Micael ela a Mercado ado, , Ph.D. Candidate date School of Social Work Univers rsity ity

Beta Presentation Amazon Customer Review Analyzer The Capstone Experience Team Amazon Tess

CHAPTER 3: Virtual Machines and Virtualization of Clusters and Data Centers Presented by Faramarz

2013 14 Distance Educa on Ac vity at the Colleges of the Kern Community College

DRAFT have flocked to the city. Additionally, lower birth rates have impacted city demographics,