Entropy minimization in emergent languages Eugene Kharitonov , Rahma - PowerPoint PPT Presentation

Entropy minimization in emergent languages Eugene Kharitonov , Rahma Chaabouni, Diane Bouchacourt, Marco Baroni

Setup: signalling game (Lewis, 1969) • Two deterministic neural agents, • Sender sends a discrete message (one- Sender and Receiver, solving a task or multi-symbol) to Receiver collaboratively • Based on its own input and the • Each has its own individual input message, Receiver performs an action message Receiver’s output Sender’s input Receiver’s input 2

Setup: signalling game (Lewis, 1969) • The goal is for Receiver to perform Motivated by some task • developing agents that are able to • Both agents get the same reward that communicate with humans (Mikolov et depends on Receiver’s action al., 2016) • No supervision on the emergent • Better understanding natural language protocol itself (Hurford, 2014) message Receiver’s output Sender’s input Receiver’s input 3

Setup Suppose Receiver has only a part of the information required to perform a task, while Sender has all available information Two opposite scenarios of successful communication: • Sender tries to transmit all the information in its message • “Complex” protocol, encodes a lot of information • Sender only sends what Receiver lacks • “Simple” protocol, encodes the required mininum We measure complexity of the protocol by its entropy 4

Data processing inequalities (discrete inputs) Processing its input, Conditioning does not Sender non-increases increase entropy entropy Again, applying a function does not When task is solved, increase the entropy outputs o are (almost) equal to ground-truth l Entropy of the messages is bounded between entropy of Sender’s inputs and the amount of information that Receiver needs to solve the task 5

Q: How complex the communication protocol would be? 6

Why is this question interesting? Efficiency pressures are frequently observed in language and other biological communication systems (Ferrer i Cancho et al., 2013; Gibson et al., 2019) • Color naming: for a given accuracy, lexicon complexity is minimized (Zaslavsky et al., 2018, 2019) 7

Why is this question interesting? Would something similar happen when two agents are communicating with each other? • Can it be a general property of discrete communication systems? • Can it have some beneficial properties? 8

Methodology • We build two games, that allow us to vary the amount of information Receiver needs to perform a task • We achieve that in two ways: • By controlling the amount of information Receiver has as its own input • By controlling the complexity of the task itself, via changing the entropy of the ground-truth outputs 9

Game 1: Guess Number • Sender gets a 8-dim binary vector an input • components are i.i.d. Bernouilli variables • Receiver gets the same vector, but only k (0 … 8) dimensions are not masked • Goal is to recover the original vector B L O P [1 0 0 1 1 0 1 1] 5 dimensions masked 1 0 0 1 1 0 1 1 1 0 0 10

Game 1: Guess Number • Sender gets a 8-dim binary vector an input • components are i.i.d. Bernouilli variables • Receiver gets the same vector, but only k (0 … 8) dimensions are not masked • Goal is to recover the original vector B L I P [1 0 0 1 1 0 1 1] 1 dimension masked 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 11

Game 2: Image Classification • Sender gets two concatenated MNIST images, representing a two-digit number (00 … 99) (uniformly sampled from MNIST train data) • Numbers are split in 2, 4, 10, 20, 25, 50, 100 equally sized classes • Receiver has no side input • Agents’ goal is for Receiver to output the class B E E P [class 96] 100 classes 12

Game 2: Image Classification • Sender gets two concatenated MNIST images, representing a two-digit number (00 … 99) (uniformly sampled from MNIST train data) • Numbers are split in 2, 4, 10, 20, 25, 50, 100 equally sized classes • Receiver has no side input • Agents’ goal is for Receiver to output the class T A D A [class 0] 4 classes 13

Methodology We experiments with: • Different architectures of agents, • Different lengths of the messages & vocabulary size, • Different approaches for learning with the discrete channel: • Gumbel-Softmax relaxation (Maddison et al., 2016; Jang et al., 2016), • REINFORCE for training both agents (Williams, 1992), • SCG: REINFORCE for Sender + standard backpropagation for Receiver (Stochastic Computational Graph) (Schulman et al., 2015) • We vary hyperparameters/seeds and select the game instances where agents are sucessful in solving the task • Game success rate: 20% REINFORCE, 50..75% of Gumbel-Softmax and SCG • Measure entropy of the discrete protocol 14

Gumbel-Softmax relaxation • Closer approximates discrete messages as temperature gets lower • Allows to “interpolate” between discrete and continuous 15

Results: Guess Number Entropy of the messages Upper bound on the entropy: 8 Degenerate case of bits non- communication How much Lower bound on information the information Receiver needs to required for solving perform the task the task 16

Results: Guess Number 17

Results: Image Classification Upper bound on the entropy: 10 bits 18

Entropy Minimization • The agents only develop protocols with higher entropy when this is necessary • Entropy approaches the lower bound • Does discrete channel have other desirable properties? • Robustness to overfitting 19

Results: Robustness • Image Classification (10 classes): shuffle labels for random ½ of the digit images 20

Our findings The entropy of the protocol consistently approaches the lower bound that still allows to solve a task • In other words, the agents develop the simplest protocol they can get away with, while still solving the task The level of discreteness of this protocol impacts the tightness of this approximation Discrete channel has useful properties: • Robustness to overfitting random labels • Robustness against adversarial attacks (see the paper) 21

Why is it interesting? Efficiency pressures arise in artificial discrete communication systems • A common cause - hardness of discrete communication? Discrete protocols have useful properties • Good reasons for agents to communicate in a discrete language • That’s why (human) language is discrete in the first place? 22

Why is it interesting? Agents wouldn’t develop complex languages (protocols) unless that is necessary • Echoes earlier findings in the literature (Bouchacourt & Baroni, 2018) • If we want agents to develop complex languages, we should make sure that is absolutely required 23

Thank you!

Entropy minimization in emergent languages Eugene Kharitonov , Rahma - PowerPoint PPT Presentation

Entropy minimization in emergent languages Eugene Kharitonov , Rahma Chaabouni, Diane Bouchacourt, Marco Baroni Setup: signalling game (Lewis, 1969) Two deterministic neural agents, Sender sends a discrete message (one- Sender and

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Emergent behaviour in virtual agents Emergent behaviour in virtual agents Colin Chibaya Colin

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Entropy Change in Entropy Reversible Isobaric Process Ideal Gas in a Reversible Process Free

Entropy and The Second Law of Thermodynamics Entropy (S)

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for

Topological entropy and algebraic entropy on locally compact abelian groups - The Bridge Theorem

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Emergent Invasive Plant Program A CNPS Chapter model for early detection and effective response to

Emergent Trilateralism in Developing Asia Emergent Trilateralism in Developing Asia Long Term

Emergent Trilateralism in Developing Asia Emergent Trilateralism in Developing Asia Long Term

DReX: A Declarative Language for Efficiently Evaluating Regular String Transformations Rajeev

on learned visual embedding patrick prez Allegro Workshop Inria Rhnes-Alpes 22 July 2015

SLAM: COMPARATIVE APPROACH Khooshal Saurty 1 OUTLINE Introduction - What is SLAM? EKF SLAM

Compact Routing on the Internet AS-Graph Stephen Strowes, University of Glasgow Graham

Learning Latent Dynamics for Planning from Pixels Danijar Hafner, Timothy Lillicrap, Ian Fischer,

Latent Variable Models Volodymyr Kuleshov Cornell Tech Lecture 5 Volodymyr Kuleshov (Cornell

Advanced Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell