A Massively Scalable Architecture for Learning Representations from - - PowerPoint PPT Presentation

a massively scalable architecture for learning
SMART_READER_LITE
LIVE PREVIEW

A Massively Scalable Architecture for Learning Representations from - - PowerPoint PPT Presentation

A Massively Scalable Architecture for Learning Representations from Heterogeneous Graphs NVIDIA GPU Technology Conference 2019 - San Jose, CA C. Bayan Bruss Anish Khazane 1. Overview & Background TODAYS TALK 2. Our Approach How to


slide-1
SLIDE 1
  • C. Bayan Bruss

Anish Khazane

A Massively Scalable Architecture for Learning Representations from Heterogeneous Graphs

NVIDIA GPU Technology Conference 2019 - San Jose, CA

slide-2
SLIDE 2

​2

  • 1. Overview & Background
  • 2. Our Approach
  • 3. Results

TODAY’S TALK

How to handle heterogeneity in training large graph embedding models

slide-3
SLIDE 3

​3

Who we are

​Bayan Bruss ​Anish Khazane

slide-4
SLIDE 4

​4

SECTION ONE: OVERVIEW

A quick background on graph embeddings & some of the issues related to scaling them

slide-5
SLIDE 5

​5

People are can be disproportionately attracted to content that is sensational or provocative.

slide-6
SLIDE 6

​6

Machine learning systems that learn how to serve content are prone to

  • ptimizing towards these

types of content.

slide-7
SLIDE 7

​7

Some common problems and solutions

1. If this is a problem with content (spam, violent, racist, homophobic, etc.) 2. If this is a problem with users (fake accounts, malicious actors)

  • > Flag & demote content that is deemed objectionable
  • > Eliminate fraudulent accounts
slide-8
SLIDE 8

​8

What’s missing?

slide-9
SLIDE 9

​9

Basic mechanics of a neural network recommender

User Article

slide-10
SLIDE 10

​10

Basic mechanics of a neural network recommender

User Article

slide-11
SLIDE 11

​11

Basic mechanics of a neural network recommender

User Article

Clicks On

slide-12
SLIDE 12

​12

Basic mechanics of a neural network recommender

User Article

slide-13
SLIDE 13

​13

Basic mechanics of a neural network recommender

User Article

slide-14
SLIDE 14

​14

Basic mechanics of a neural network recommender

User Article

slide-15
SLIDE 15

​15

Basic mechanics of a neural network recommender

User Article

Recommended to

slide-16
SLIDE 16

​16

How can we add more fidelity to these models?

1.

Treat heterogeneous graphs as containing distinct element types

2.

Model interactions depending what type of entity is involved

slide-17
SLIDE 17

​17

A brief history of graph embeddings

Most Common Objective:

  • Learn a continuous vector for each node in a graph that preserves some local or global

topological features about the neighborhood of that node Early Efforts Focused on Explicit Matrix Factorization

  • Not very scalable
  • Highly tuned to specific topological attributes
slide-18
SLIDE 18

​18

Meanwhile over in the language modeling world

Word2Vec world blows things open

Y Bengio, R Ducharme, P Vincent, C Jauvin. A neural probabilistic language model. Journal of machine learning research, 2003 Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

slide-19
SLIDE 19

​19

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-20
SLIDE 20

​20

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-21
SLIDE 21

​21

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-22
SLIDE 22

​22

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-23
SLIDE 23

​23

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-24
SLIDE 24

​24

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-25
SLIDE 25

​25

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

slide-26
SLIDE 26

​26

Quickly ported to graph embeddings

Walks on a graph can be likened to sentences in a document

B A C D E F

[“D”, “B”, “A”, “F”] [“F”, “C”, “F”, “E”]

slide-27
SLIDE 27

​27

Walks on graphs can be treated as sentences

[“D”, “B”, “A”, “F”] [“F”, “C”, “F”, “E”]

slide-28
SLIDE 28

​28

Graphs are different from language

slide-29
SLIDE 29

​29

Graphs are different from language

slide-30
SLIDE 30

​30

Graphs are different from language

slide-31
SLIDE 31

Confidential ​31

Graphs can be heterogeneous

Heat Cavs Lakers Lebron James Steph Curry Kevin Durant Thunder Rockets Warriors James Harden JaVale McGee Dion Waiters

slide-32
SLIDE 32

Confidential ​32

All this makes scale an even bigger challenge

slide-33
SLIDE 33

Confidential ​33

Homogeneous graphs are difficult

Dimensionality: Millions or even billions of nodes Sparsity: Each node only interacts with a small subset of other nodes

slide-34
SLIDE 34

​34

Quickly hit limits on all resources

1) An embedding space is a N X D dimensional matrix where each row corresponds to a row. 2) D is typically 100 - 200 (an arbitrary hyperparameter) 3) A 500M node graph would be 200 - 400 GB 4) Cannot hold in GPU memory 5) Quickly exceeds limits of a single worker 6) Lots of little vector multiplication ideal for GPUs 7) Sharding because of connectedness - sharding the matrix is challenging

slide-35
SLIDE 35

Confidential ​35

Heterogeneous graphs are even harder

Have to keep K possible embedding spaces with N nodes for each Have to have an architecture that routes to the right embedding space

slide-36
SLIDE 36

Confidential ​36

We’re working on this too but not the focus of today’s talk See interesting articles

  • Metapath2Vec: Scalable Representation Learning for Heterogeneous Networks
  • CARL: Content-Aware representation Learning for Heterogeneous Graphs

It’s also hard from an algorithmic perspective

slide-37
SLIDE 37

​37

SECTION TWO: OUR APPROACH

Applied Research: An architecture for handling heterogeneity at scale

slide-38
SLIDE 38

​38

Quick Primer on Negative Sampling

Original SkipGram Model Need to compute softmax over entire vocabulary for each input

slide-39
SLIDE 39

​39

Quick Primer on Negative Sampling

Original SkipGram Model VERY EXPENSIVE!

slide-40
SLIDE 40

​40

Softmax can be approximated by binary classification task

Original SkipGram Model Binary Discriminator

w(t-2) vs negative samples w(t-1) vs negative samples w(t+1) vs negative samples w(t+2) vs negative samples

slide-41
SLIDE 41

​41

Use non-edges to generate negative samples

Negatives for B

B A C D E F

[“D”, “B”, “A”, “F”] [“F”, “C”, “F”, “E”] [“F”, “C”, “F”]

Context for B

slide-42
SLIDE 42

Confidential ​42

Walking on heterogeneous graph

Heat Cavs Lakers Lebron James Steph Curry Kevin Durant Thunder Rockets Warriors James Harden JaVale McGee Dion Waiters

slide-43
SLIDE 43

​43

How to distribute (parallelize) training

1. Split the training set across a number of workers that execute in parallel asynchronously and unaware of the existence of each other. 2. Create some form of centralized parameter repository that allows learning to be shared across all the workers.

slide-44
SLIDE 44

​44

Parameter server partitioning

  • A parameter server can hold the embeddings table which contains the

vectors corresponding to each node in the graph.

  • The embeddings table is a N x M table, where N is the number of nodes in

the graph and M is a hyperparameter that denotes the number of embedding dimensions.

slide-45
SLIDE 45

Confidential ​45

Variable Tensorflow Computational Graphs

slide-46
SLIDE 46
slide-47
SLIDE 47

​47

SECTION THREE: RESULTS

slide-48
SLIDE 48

Confidential ​48

Capital One Heterogeneous Data

Node Type A: 18, 856, 021 Node Type B: 32, 107, 404 Total Nodes: 50, 963, 425 Edges: 280, 422, 628

Train Time: 3 Days on 28 workers

slide-49
SLIDE 49

​49

Friendster Graph

Publicly available dataset 68,349,466 vertices (users) 2,586,147,869 edges (friendships) Sampled 80 positive and 5 * 80 negative edges per node as training data. The data was shuffled, split into chunks and distributed across workers

slide-50
SLIDE 50

​50

Friendster Graph

slide-51
SLIDE 51

​51

Friendster Graph

slide-52
SLIDE 52

​52

Implications

Scalability:

  • More nodes per entity type
  • More entity types

Convergence:

  • Faster as number of workers increases
slide-53
SLIDE 53

Confidential ​53

Limitations and Future Directions

Limitations

  • Python performance
  • Not partitioning the embedding space
  • Recomputing the computational graph for

each batch could be optimized Future Directions

  • Evaluate c++ variant of architecture
  • Intelligent partitioning of graph so that

each worker gets a component of the graph and only has to go to the server for small subset of nodes in other components

slide-54
SLIDE 54

THANK YOU