How Does Selective Mechanism Improve Self-Attention Networks? Xinwei - - PowerPoint PPT Presentation

how does selective mechanism improve self attention
SMART_READER_LITE
LIVE PREVIEW

How Does Selective Mechanism Improve Self-Attention Networks? Xinwei - - PowerPoint PPT Presentation

How Does Selective Mechanism Improve Self-Attention Networks? Xinwei Geng 1 , Longyue Wang 2 , Xing Wang 2 ,Bing Qin 1 , Ting Liu 1 , Zhaopeng Tu 2 1 Research Center for Social Computing and Information Retrieval, HIT 2 NLP Center, Tencent AI Lab


slide-1
SLIDE 1

How Does Selective Mechanism Improve Self-Attention Networks?

Xinwei Geng1, Longyue Wang2, Xing Wang2,Bing Qin1, Ting Liu1, Zhaopeng Tu2

1Research Center for Social Computing and Information Retrieval, HIT 2NLP Center, Tencent AI Lab

slide-2
SLIDE 2

Conventional Self-Attention Networks(SANs)

  • Calculate the attentive output by glimpsing the entire sequence
  • In most case, only a subset of input elements are important
slide-3
SLIDE 3

Selective Self-Attention Networks (SSANs)

  • An universal and flexible implementation of selective mechanism
  • Select a subset of input words, on top of which self-attention is

conducted

slide-4
SLIDE 4

Selector

  • Parameterize selection action a ∈ {SELECT, DISCARD} with an

auxiliary policy network

– SELECT(1) indicates that the element is selected – DISCARD(0) represents to abandon the element

  • Reinforcement Learning is utilized to train the policy network

– employ gumbel-sigmoid to approximate the sampling – G’ and G’’ are gumbel noises

–" is temperature parameter

slide-5
SLIDE 5

Experiments

slide-6
SLIDE 6

Evaluation of Word Order Encoding

  • Employ bigram order shift detection and word reordering

detection tasks to investigate the ability of capturing both local and global word orders

  • Bigram order shift detection (Conneau et al., 2018)

– inverted two random adjacent words – e.g. what are you doing out there? => what you are doing out there?

  • Word reordering detection (Yang et al., 2019)

– a random word is popped and inserted into another position – e.g. Bush held a talk with Sharon. => Bush a talk held with Sharon.

slide-7
SLIDE 7

Detection of Local Word Reordering

slide-8
SLIDE 8

Detection of Global Word Reordering

slide-9
SLIDE 9

Evaluation of Structural Modeling

  • Leverage tree depth and top constituent tasks to assess the

syntactic information embeded in the encoder representations

  • Tree Depth(Conneau et al., 2018)

– Check whether the examined model can group sentences by depth of the longest path from root to any leaf

  • Top Constituent(Conneau et al., 2018)

– Classify the sentence in terms of the sequence of top constituents immediately below the root node

slide-10
SLIDE 10

Structures Embedded in Representations

SSANs is more robust to the depth of the sentences SSANs significantly improves the prediction F1 score as the complexity of sentences increases

slide-11
SLIDE 11

Structures Modeled by Attention

  • Constructing constituency trees from the attention distributions

– attention distribution within phrases is stronger than the other (Marecek and Rosa, 2018) – When splitting a phrase with span (i, j), the target is to look for a position k maximizing the scores of the two resulting phrases – utilize Stanford CoreNLP toolkit to annotate English sentences as golden constituency trees

slide-12
SLIDE 12

Conclusion

  • We adopt an universal and flexible implementation of selective

mechanism, demonstrating its effectiveness across three NLP tasks

  • SSANs can identify the improper word orders in both local and

global ranges by learning to attend to the expected words

  • SSANs produce more syntactic representations with a better

modeling of structure by selective attention

slide-13
SLIDE 13

Thanks & QA