Neural Combinatorial Optimization With Reinforcement Learning CS885 - PowerPoint PPT Presentation

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper by Bello, I., Pham, H., Le, Q. V., Norouzi, M., & Bengio, S. (2016) Presented by Yan Shi

Outline 1. Introduction 2. Background 3. Algorithms and optimization 4. Experiments 5. Conclusions PRESENTATION TITLE PAGE 2

Introduction Travelling Salesman Problem ▪ Combinatorial Optimization is a fundamental problem in computer science ▪ Travelling Salesman Problem is such a typical problem and is NP hard, where given a graph, one needs to search the space of permutations to find an optimal sequence of nodes with minimal total edge weights (tour length). ▪ In 2D Euclidean space, nodes are 2D points and edge weights are Euclidean distances between pairs of points. PRESENTATION TITLE PAGE 3

Introduction Target & Solution ▪ This paper will use reinforcement learning and neural networks to tackle the combinatorial optimization problem, especially TSP. ▪ We want to train a recurrent neural network such that, given a set of city coordinates, it will predict a distribution over different cities permutations. ▪ The recurrent neural network encodes a policy and is optimized by policy gradient, where the reward signal is the negative tour length. ▪ We propose two main approaches, RL Pretraining and Active Search PRESENTATION TITLE PAGE 4

Background ▪ The Traveling Salesman Problem is a well studied combinatorial optimization problem and many exact or approximate algorithms have been proposed. ▪ Like Christofides , Concorde, Google’s vehicle routing problem solver ▪ The real challenge is applying existing search heuristics to newly encountered problems, researcher used “hyper - heuristics” to generalize their optimization system, but more or less, human created heuristic is needed. PRESENTATION TITLE PAGE 5

Background ▪ The earliest solution for TSP using machine learning is Hopfield networks (Hopfield & Tank, 1985), but it is sensitive to hyperparameters and parameter initialization. ▪ Later research include applying Elastic Net (Durbin, 1987), Self Organizing Map (Fort, 1988) to TSP ▪ Most of the other works were analyzing and modifying the above methods, and they showed that neural network were beat by algorithmic solutions PRESENTATION TITLE PAGE 6

Background ▪ Due to sequence to sequence learning, neural network is again the subject of study for optimization in various domain. ▪ In particular, the TSP is revisited in the introduction of Pointer network (Vinyals et al, 2015b), where recurrent neural network is trained in a supervised way to predict the sequence of visited cities. PRESENTATION TITLE PAGE 7

Algorithm and Optimization Construction ▪ We focus on a 2D Euclidean TSP. And let the input be the 𝑜 , where each 𝑦 𝑗 ∈ ℝ 2 . sequence of cities (points) 𝑡 = {𝑦 𝑗 } 𝑗=1 ▪ The target is to find a permutation 𝜌 of these points, terms as a tour, that visits each city and has minimum length. ▪ Define the length of a tour 𝜌 as: 𝑜−1 𝑦 𝜌(𝑗+1) − 𝑦 𝜌(𝑗) 2 𝑦 𝜌(𝑜) − 𝑦 𝜌(1) 2 + σ 𝑗=1 𝑀 𝜌 𝑡 = PRESENTATION TITLE PAGE 8

Algorithm and Optimization Construction ▪ Construct a model-free and policy based algorithm ▪ The goal is to learn the parameters of the stochastic policy 𝑜 𝑞 𝜌 𝑡 = ς 𝑗=1 𝑞( 𝜌 𝑗 𝜌 < 𝑗 , 𝑡 ) ▪ This stochastic policy needs to: Be sequence to sequence i. Be generalized to different graph size ii. PRESENTATION TITLE PAGE 9

Algorithm and Optimization Pointer network Encoder: reads the input sequence s , one city at a time, and transforms it into a sequence of latent memory states 𝑜 , and each 𝑓𝑜𝑑 𝑗 ∈ ℝ 𝑒 {𝑓𝑜𝑑 𝑗 } 𝑗=1 Decoder: uses a pointing mechanism to produce a distribution over the next city to visit in the tour. 𝑣 𝑗 = ቊ𝑤 𝑈 tanh 𝑋 𝑓𝑜𝑑 𝑓𝑜𝑑 𝑗 + 𝑋 𝑒𝑓𝑑 𝑒𝑓𝑑 𝑗𝑔 𝑗 ≠ 𝜌 𝑙 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑙 < 𝑗 𝑘 −∞ 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 𝐵 𝑓𝑜𝑑, 𝑒𝑓𝑑 𝑘 ; 𝑋 𝑓𝑜𝑑 , 𝑋 𝑒𝑓𝑑 , 𝑤 ≝ 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑣) PRESENTATION TITLE PAGE 10

Algorithm and Optimization Optimization ▪ Target (loss) function 𝐾 𝜄 𝑡 = 𝔽 𝜌~𝑞 𝜄 · 𝑡 𝑀 𝜌 𝑡 ▪ Policy gradient with a baseline ∇ 𝜄 𝐾 𝜄 𝑡 = 𝔽 𝜌~𝑞 𝜄 · 𝑡 [ 𝑀 𝜌 𝑡 − 𝑐 𝑡 ∇ 𝜄 log 𝑞 𝜄 · 𝑡 ] ▪ Using samples of size 𝐶 to approximate expectation 𝐶 ∇ 𝜄 𝐾 𝜄 𝑡 = 1 𝐶 ෍ [ 𝑀 𝜌 𝑗 𝑡 𝑗 − 𝑐 𝑡 𝑗 ∇ 𝜄 log 𝑞 𝜄 𝜌 𝑗 𝑡 𝑗 ] 𝑗=1 PRESENTATION TITLE PAGE 11

Algorithm and Optimization Actor Critic ▪ Here, Let 𝑐 𝑡 (the baseline) be the expected tour length 𝔽 𝜌~𝑞 𝜄 · 𝑡 [𝑀 𝜌 𝑡 ] ▪ Introduce another network, called critic and parameterized by 𝜄 𝑤 to encode 𝑐 𝜄 𝑤 𝑡 . ▪ This critic network is trained along with the policy network, and the objective is 𝐶 ℒ 𝜄 𝑤 = 1 2 𝐶 ෍ 𝑐 𝜄 𝑤 𝑡 − 𝑀 𝜌 𝑗 𝑡 𝑗 2 𝑗=𝑗 PRESENTATION TITLE PAGE 12

Algorithm and Optimization Critic’s Architecture One LSTM encoder, similar to the pointer network, encodes the sequence I. of cities 𝑡 to a series of latent memory states and a hidden state ℎ One LSTM processor, which takes the hidden state ℎ as an input, process II. it 𝑄 times, then pass to decoder III. A two-layer ReLU neural network decoder, transforms the above output hidden state into a baseline prediction. PRESENTATION TITLE PAGE 13

Algorithm and Optimization PRESENTATION TITLE PAGE 14

Algorithm and Optimization Search Strategy ▪ In Algorithm 1, we were using greedy decoding at each step to select cities, but we can also sample different tours then select the shortest one. 𝐵 𝑠𝑓𝑔, 𝑟, 𝑈; 𝑋 𝑠𝑓𝑔 , 𝑋 𝑟 , 𝑤 ≝ 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑣/𝑈) ▪ What about developing a search strategy that is not pre-trained, and will optimize parameter for every single test input? PRESENTATION TITLE PAGE 15

Algorithm and Optimization Sample n solutions and select the shortest one Same policy gradient as before No critic network, using a exp moving average baseline instead PRESENTATION TITLE PAGE 16

Experiment ▪ We consider three benchmark tasks, Euclidean TSP20, 50 and 100, for which we generate a test set of 1000 graphs. Points are drawn uniformly at random in the unit square [0, 1] ▪ Four target algorithms: RL pretraining (Actor Critic) with greedy decoding i. RL pretraining (Actor Critic) with sampling ii. iii. RL pretraining-Active Search (run Active Search with a pretrained RL model) iv. Active Search PRESENTATION TITLE PAGE 17

Experiment PRESENTATION TITLE PAGE 18

Experiment ▪ Using 3 algorithmic solutions as baselines: Christofides i. the vehicle routing solver from OR-Tools ii. Optimality iii. ▪ For the purpose of comparison, we also trained pointer networks with the same architecture by supervised learning method (providing with the true label). PRESENTATION TITLE PAGE 19

Experiment Averaged tour length PRESENTATION TITLE PAGE 20

Experiment Running time PRESENTATION TITLE PAGE 21

Experiment Reinforcement Learning methods PRESENTATION TITLE PAGE 22

Experiment Generalization: KnapSack example Given a set of n items 𝑗 = 1, … 𝑜 , each with weight 𝑥 𝑗 and value 𝑤 𝑗 and a maximum weight capacity of 𝑋 , the 0-1 KnapSack problem consists in maximizing the sum of the values of items present in the knapsack so that the sum of the weights is less than or equal to the knapsack capacity: 𝑇⊆{1,2,…,𝑜} ෍ max 𝑤 𝑗 𝑗∈𝑇 𝑡𝑣𝑐𝑘𝑓𝑑𝑢 𝑢𝑝 ෍ 𝑥 𝑗 ≤ 𝑋 𝑗∈𝑇 PRESENTATION TITLE PAGE 23

Experiment Generalization: KnapSack example PRESENTATION TITLE PAGE 24

Conclusion This paper constructs Neural Combinatorial Optimization, a framework to tackle ▪ combinatorial optimization with reinforcement learning and neural networks. We focus on the traveling salesman problem (TSP) and present a set of results for ▪ each variation of the framework The experiment shows that Neural Combinatorial Optimization achieves close to ▪ optimal results on 2D Euclidean graphs with up to 100 nodes. Reinforcement learning and neural networks are successful tools to solve ▪ combinatorial optimization problems if properly constructed. PRESENTATION TITLE PAGE 25

Future works ▪ The above framework works very well when the problems are of sequence to sequence type ▪ Try to solve other kinds of combinatorial optimization problems using reinforcement learning PRESENTATION TITLE PAGE 26

THANK YOU! PRESENTATION TITLE PAGE 27

Neural Combinatorial Optimization With Reinforcement Learning CS885 - PowerPoint PPT Presentation

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper by Bello, I., Pham, H., Le, Q. V., Norouzi, M., & Bengio, S. (2016) Presented by Yan Shi Outline 1. Introduction 2. Background 3. Algorithms

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

20.1 Combinatorial Optimization next chapters: combinatorial optimization similar scenario,

Introduction to Combinatorial Algorithms Lucia Moura Fall 2015 Introduction to Combinatorial

Introduction to Combinatorial Algorithms Lucia Moura Winter 2018 Introduction to Combinatorial

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction: Combinatorial Problems Combinatorial Problem Solving (CPS) Enric Rodr

Combinatorial Optimization Problems 4. Solution Methods 5. Construction Heuristics for the

Foundations of Artificial Intelligence 20. Combinatorial Optimization: Introduction and

General Methods and Search 2. Generic Approaches to Combinatorial Optimization Algorithms 3.

Outline 1. Course Introduction Lecture 1 Course Introduction Combinatorial Optimization and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

CS675: Convex and Combinatorial Optimization Fall 2019 Introduction to Matroid Theory

CS675: Convex and Combinatorial Optimization Fall 2014 Introduction to Matroid Theory

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 13: Optimality Conditions for

Construction Heuristics Iterated Greedy GRASP Adaptive Iterated Construction Search Multilevel

The traveling salesman problem (TSP) Given: n cities, costs c(i,j) to travel from city i to city j

Replica Symmetry and Combinatorial Optimization Johan W astlund Physics of Algorithms, Santa

Approximation and Randomized Algorithms Lecturer: Shi Li Department of Computer Science and

Simulated Annealing Key idea: Vary temperature parameter, i.e. , probability of accepting

On the existence of 0/1 polytopes with high semidefinite extension complexity Daniel Dadush

L ECTURE 28: T ASK A LLOCATION 2 T EACHER : G IANNI A. D I C ARO MT-SR-TA: VRP Robots can work in

Complexity Theory J org Kreiker Chair for Theoretical Computer Science Prof. Esparza TU M