 
              Kernel Methods for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic Cooperation Summary of Contributions Abhimanyu Dubey and Alex Pentland Our Method Contextual Bandits Our Parameterization Media Lab and Institute for Data Systems and Society (IDSS) Algorithm Massachusetts Institute of Technology Regret Guarantees Conclusion dubeya@mit.edu ICML 2020
Kernel Methods Motivation for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 Introduction ◮ Distributed learning is an increasingly popular paradigm in ML: multiple Motivation UCB Algorithms parties collaborate to train a stronger joint model by sharing data. Basic Cooperation Summary of Contributions ◮ An alternative is to let data remain in a distributed setup, and have one ML Our Method algorithm (agent) for each data center, i.e., federated learning. Contextual Bandits Our Parameterization ◮ Each agent can communicate with other agents to (securely) share relevant Algorithm Regret Guarantees information, e.g., over a network. Conclusion ◮ The group of all agents therefore collectively cooperate to solve their own learning problems.
Kernel Methods Multi-Armed Bandits for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 Introduction Motivation UCB Algorithms Basic Cooperation Summary of Contributions Our Method Contextual Bandits Our Parameterization Algorithm Regret Guarantees Conclusion Figure: Multi-armed bandit (courtesy lilianweng.github.io ).
Kernel Methods The Upper Confidence Bound (UCB) Algorithm for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ “Optimism in the face of uncertainty” strategy – i.e. to be optimistic about Introduction an arm when we are uncertain of its utility. Motivation UCB Algorithms ◮ For each of K arms, we compute Basic Cooperation Summary of Contributions � � n k ( t − 1) Our Method r i 2 ln( t − 1) Contextual Bandits i =1 k Q k ( t ) = + . Our Parameterization n k ( t − 1) n k ( t − 1) Algorithm � �� � Regret Guarantees � �� � empirical mean Conclusion “ uncertainty ” ◮ Choose arm with largest Q k ( t ). ◮ This general family of algorithms has strong guarantees as well.
Kernel Methods Basic Cooperation: UCB with Naive Averaging for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ Basic Idea : Use observations from neighbors naively to construct Q -values. Introduction Motivation ◮ Assume 2 agents ( A and B ): UCB Algorithms Basic Cooperation Summary of � � n k ( t − 1) k , A + � n k ( t − 1) Contributions r i r i 2 ln( t − 1) Our Method i =1 i =1 k , B Q A k ( t ) = + . Contextual Bandits n A k ( t − 1) + n B n A k ( t − 1) + n B k ( t − 1) k ( t − 1) Our Parameterization � �� � Algorithm � �� � Regret Guarantees mean of both agents’ observations smaller “uncertainty” Conclusion ◮ Works well when each agent faces the same bandit problem. ◮ Can trivially be extended to other algorithms (e.g., Thompson Sampling)
Kernel Methods Naive combinations aren’t always useful for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 Introduction Motivation UCB Algorithms Basic Cooperation ◮ Consider two agents A and B , each solving a 2-armed bandit problem. Summary of Contributions ◮ For agent A , let the arms have mean payouts (0 . 8 , 0 . 2). Our Method ◮ For agent B , let the arms have mean payouts (0 . 2 , 0 . 8). Contextual Bandits Our Parameterization ◮ If each agent naively incorporated the other agents’ observations, they will Algorithm Regret Guarantees each have mean estimates of arms as ≈ (0 . 5 , 0 . 5), leading to O ( T ) regret. Conclusion
Kernel Methods Summary for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ It is clear that instead of naively combining observations from neighbors, Introduction agents must intelligently “weigh” external behavior. Motivation UCB Algorithms Basic Cooperation ◮ Intuitively, this weighing factor would be a function of how “similar” the Summary of Contributions agents’ problems are. Our Method Contextual Bandits ◮ For each agent, when rewards are drawn from arbitrary distributions, it is Our Parameterization Algorithm unclear how “similarity” can be measured between the distributions. Regret Guarantees Conclusion ◮ Summary . In this work, we propose a framework based on Reproducing Kernel Hilbert Spaces (RKHS) to measure similarity between agent rewards, and several near-optimal algorithms for the cooperative contextual bandit problem using this framework.
Kernel Methods The Contextual Bandit Problem for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ At any trial t = 1 , 2 , ... , each agent v ∈ V is supplied a decision set D v , t . Introduction ◮ They select an action x v , t ∈ D v , t and obtain a reward y v , t . Motivation UCB Algorithms Basic Cooperation Summary of y v , t = f v ( x v , t ) + ε v , t , Contributions Our Method Contextual Bandits ◮ The objective of the problem is to minimize the group regret: Our Parameterization Algorithm Regret Guarantees � � T � � Conclusion f v ( x ∗ R G ( T ) = v , t ) − f v ( x v , t ) , t =1 v ∈ V where, x ∗ t = arg max x ∈ D v , t f v ( x ).
Kernel Methods The Cooperative Contextual Bandit for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ We assume the | V | agents communicate via an undirected, connected Introduction Motivation graph G = ( V , E ), where ( i , j ) ∈ E if agents i and j can communicate. UCB Algorithms Basic Cooperation ◮ Messages from any agent v are available to agent v ′ after d ( v , v ′ ) − 1 trials Summary of Contributions of the bandit, where d is the distance between the agents in G . Our Method Contextual Bandits ◮ Every trial, every agent sends the following message m v , t to all its Our Parameterization Algorithm neighbors in G : Regret Guarantees Conclusion m v , t = � t , v , x v , t , y v , t � ◮ This message is forwarded from agent to agent γ times (taking one trial of the bandit problem each between forwards), after which it is dropped.
Kernel Methods Parameteric Network Contexts for Cooperative Contextual Bandits Dubey and Pentland ◮ We assume that each agent v has an underlying network context , denoted ICML 2020 by z v , and the reward function f v ( · ) is parameterized by z v , i.e., for some Introduction unknown but fixed function F , Motivation UCB Algorithms Basic Cooperation Summary of f v ( x ) = F ( x , z v ) ∀ x ∈ X , z v ∈ Z . Contributions Our Method Contextual Bandits ◮ We denote ˜ x = ( x , z v ). Furthermore, we assume that F has a bounded Our Parameterization Algorithm norm in some RKHS H with kernel � K and feature φ ( · ). Regret Guarantees ◮ For a given φ : ( X × Z ) → R d and unknown (but fixed) vector θ ∈ R d , Conclusion x ) T θ . F (˜ x ) = φ (˜ ◮ This implies that in some higher-order feature space (kernel space), F is a linear function, and φ can be thought of as a “feature extractor”.
Kernel Methods Kernel Assumption for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ Now, the kernel function � x 1 ) ⊤ φ (˜ K (˜ x 1 , ˜ x 2 ) = φ (˜ x 2 ). We assume that this Introduction kernel is a composition of two separate kernels (where ˜ x i = ( x i , z i )): Motivation UCB Algorithms Basic Cooperation � K (˜ x 1 , ˜ x 2 ) = K z ( z 1 , z 2 ) · K x ( x 1 , x 2 ) . Summary of Contributions � �� � � �� � Our Method network kernel action kernel Contextual Bandits Our Parameterization Algorithm Regret Guarantees ◮ K z provides us with a generic framework to measure similarity between Conclusion agent functions, and can be learnt online when it is unknown. ◮ K x can be any PSD kernel, e.g., Gaussian (RBF), Linear, deep neural network features, etc. K z can be derived from geographical or demographic constraints (e.g., social networks).
Kernel Methods Algorithm for Cooperative Contextual Bandits Dubey and Pentland ICML 2020 ◮ We face three challenges in algorithm design: Introduction Motivation ◮ Non-identical reward functions f v . UCB Algorithms ◮ Communication delays between agents. Basic Cooperation Summary of ◮ Heterogeneity (agents possess different information at all times). Contributions Our Method ◮ We modify the Kernel-UCB [Valko13] algorithm as follows: Contextual Bandits Our Parameterization ◮ Non-identical rewards : We augment contexts x with network contexts z , Algorithm and use the augmented kernel � Regret Guarantees K to create the UCB. Conclusion ◮ Delays : We use a subsampling technique similar to [Weinberger02], i.e., each agent runs γ UCB instances in parallel. ◮ Heterogeneity : We partition G carefully in terms of cliques to bound heterogeneity, i.e., each agent only accepts messages from a subset of nodes.
Recommend
More recommend