A Massively Scalable Architecture for Learning Representations from - PowerPoint PPT Presentation

A Massively Scalable Architecture for Learning Representations from Heterogeneous Graphs NVIDIA GPU Technology Conference 2019 - San Jose, CA C. Bayan Bruss Anish Khazane

1. Overview & Background TODAY’S TALK 2. Our Approach How to handle heterogeneity in training large graph embedding models 3. Results 2

Who we are Bayan Bruss Anish Khazane 3

SECTION ONE: OVERVIEW A quick background on graph embeddings & some of the issues related to scaling them 4

People are can be disproportionately attracted to content that is sensational or provocative. 5

Machine learning systems that learn how to serve content are prone to optimizing towards these types of content. 6

Some common problems and solutions -> Flag & demote content that is deemed objectionable If this is a problem with content (spam, 1. violent, racist, homophobic, etc.) -> Eliminate fraudulent accounts If this is a problem with users (fake 2. accounts, malicious actors) 7

What’s missing? 8

Basic mechanics of a neural network recommender User Article 9

Basic mechanics of a neural network recommender Clicks On User Article 11

Basic mechanics of a neural network recommender Recommended to User Article 15

How can we add more fidelity to these models? Treat heterogeneous graphs as containing distinct element types 1. Model interactions depending what type of entity is involved 2. 16

A brief history of graph embeddings Most Common Objective: Learn a continuous vector for each node in a graph that preserves some local or global - topological features about the neighborhood of that node Early Efforts Focused on Explicit Matrix Factorization Not very scalable - Highly tuned to specific topological attributes - 17

Meanwhile over in the language modeling world Word2Vec world blows things open Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems . 2013. Y Bengio, R Ducharme, P Vincent, C Jauvin. A neural probabilistic language model. Journal of machine learning research, 2003 18

Quickly ported to graph embeddings Walks on a graph can be likened to sentences in a document A B C F D E 19

Quickly ported to graph embeddings Walks on a graph can be likened to sentences in a document [“D”, “B”, “A”, “F”] A B C [“F”, “C”, “F”, “E”] F D E 26

Walks on graphs can be treated as sentences [“D”, “B”, “A”, “F”] [“F”, “C”, “F”, “E”] 27

Graphs are different from language 28

Graphs can be heterogeneous Cavs Heat Lakers Thunder Rockets Warriors Dion Lebron JaVale Kevin James Steph Waiters James McGee Durant Harden Curry Confidential 31

All this makes scale an even bigger challenge Confidential 32

Homogeneous graphs are difficult Dimensionality: Millions or even billions of nodes Sparsity: Each node only interacts with a small subset of other nodes Confidential 33

Quickly hit limits on all resources An embedding space is a N X D dimensional matrix where each row corresponds to a row. 1) D is typically 100 - 200 (an arbitrary hyperparameter) 2) A 500M node graph would be 200 - 400 GB 3) Cannot hold in GPU memory 4) Quickly exceeds limits of a single worker 5) Lots of little vector multiplication ideal for GPUs 6) Sharding because of connectedness - sharding the matrix is challenging 7) 34

Heterogeneous graphs are even harder Have to keep K possible embedding spaces with N nodes for each Have to have an architecture that routes to the right embedding space Confidential 35

It’s also hard from an algorithmic perspective We’re working on this too but not the focus of today’s talk See interesting articles Metapath2Vec: Scalable Representation Learning for Heterogeneous Networks - CARL: Content-Aware representation Learning for Heterogeneous Graphs - Confidential 36

SECTION TWO: OUR APPROACH Applied Research: An architecture for handling heterogeneity at scale 37

Quick Primer on Negative Sampling Original SkipGram Model Need to compute softmax over entire vocabulary for each input 38

Quick Primer on Negative Sampling Original SkipGram Model VERY EXPENSIVE! 39

Softmax can be approximated by binary classification task Original SkipGram Model Binary Discriminator w(t-2) vs negative samples w(t-1) vs negative samples w(t+1) vs negative samples w(t+2) vs negative samples 40

Use non-edges to generate negative samples Negatives for B [“F”, “C”, “F”] [“D”, “B”, “A”, “F”] A B C Context for B [“F”, “C”, “F”, “E”] F D E 41

Walking on heterogeneous graph Cavs Heat Lakers Thunder Rockets Warriors Dion Lebron JaVale Kevin James Steph Waiters James McGee Durant Harden Curry Confidential 42

How to distribute (parallelize) training 1. Split the training set across a number of workers that execute in parallel asynchronously and unaware of the existence of each other. 2. Create some form of centralized parameter repository that allows learning to be shared across all the workers. 43

Parameter server partitioning A parameter server can hold the embeddings table which contains the ● vectors corresponding to each node in the graph. The embeddings table is a N x M table, where N is the number of nodes in ● the graph and M is a hyperparameter that denotes the number of embedding dimensions. 44

Variable Tensorflow Computational Graphs Confidential 45

SECTION THREE: RESULTS 47

Capital One Heterogeneous Data Node Type A: 18, 856, 021 Node Type B: 32, 107, 404 Total Nodes: 50, 963, 425 Edges: 280, 422, 628 Train Time: 3 Days on 28 workers Confidential 48

Friendster Graph Publicly available dataset 68,349,466 vertices (users) 2,586,147,869 edges (friendships) Sampled 80 positive and 5 * 80 negative edges per node as training data. The data was shuffled, split into chunks and distributed across workers 49

Friendster Graph 50

Friendster Graph 51

Implications Scalability: More nodes per entity type - More entity types - Convergence: Faster as number of workers increases - 52

Limitations and Future Directions Limitations Future Directions Python performance Evaluate c++ variant of architecture • • Not partitioning the embedding space Intelligent partitioning of graph so that • • each worker gets a component of the graph and only has to go to the server for Recomputing the computational graph for • small subset of nodes in other each batch could be optimized components Confidential 53

THANK YOU

A Massively Scalable Architecture for Learning Representations from - PowerPoint PPT Presentation

A Massively Scalable Architecture for Learning Representations from Heterogeneous Graphs NVIDIA GPU Technology Conference 2019 - San Jose, CA C. Bayan Bruss Anish Khazane 1. Overview & Background TODAYS TALK 2. Our Approach How to

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Massively Scalable Indoor Positioning: The Skyhook Solution Christopher Steger Skyhook

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu,

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable

e n o t s y e K OpenStack in the context of Fog/Edge Massively Distributed Clouds

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

When a video game is more than a game Giovanni Viviani MMORPG MMORPG Massively multiplayer

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

Deep Learning on Massively Parallel Processing Databases Frank McQuillan Feb 2019 2 A Brief

Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data

Machine Learning as Enabler for Cross-Layer Resource Allocation: Opportunities and Challenges

Cr Cross L Laye yer Co Control ( (CL CLC) C) ba base sed d on n SDN and nd SDR R

Heterogeneous Networks Mustafa Emara, Miltiades C. Filippou, Dario Sabella 2018 IEEE Wireless

Security on Rails Jonathan Weiss, 03.09.2008 Peritor GmbH Who are we? Jonathan W Jonathan

On the Price of Heterogeneity in Parallel Systems P . Brighten Godfrey and Richard M. Karp

An Experiment-Driven Performance Model of Stream Processing Operators in Fog Computing

Efficient 3-D Placement of an Aerial Base Station in Next Generation Cellular Networks Article by

Scalability of InfiniBand-Connected LNET Routers Team Light Coral Computer System, Cluster, and