HUGECTR GPU 15 Nov 2019 AGENDA Click-Through Rate - PowerPoint PPT Presentation

HUGECTR – GPU 加速的推荐系统训练 15 Nov 2019 王泽寰

AGENDA Click-Through Rate Prediction Challenges in CTR Training HugeCTR Introduction 2

CLICK-THROUGH RATE PREDICTION 3

WHAT IS CTR Wikipedia: “Click -through rate ( CTR ) is the ratio of users who click on a specific link to the number of total users who view a page, email, or advertisement.” Relatives: Data Mining, Learning to Rank, NLP, CV 4

APPLICATIONS Search Advertising Recommend based on input query && advs && user information 5

APPLICATIONS Recommended Ads Recommend based on advs && user information 6

APPLICATIONS Content Recommendation ： UGC 7

APPLICATIONS Content Recommendation ： PGC 8

SEARCH ADVERTISING DISTRIBUTION SYSTEM Ad bank Feature Ad List Search Sys extraction Show Ranking Model Show Log Training Feature Click Log Preprocessing Click Label Matching extraction https://www.cnblogs.com/futurehau/p/6181008.html 9

SEARCH ADVERTISING DISTRIBUTION SYSTEM Ad bank Feature Ad List Search Sys extraction Show Ranking Model Show Log Training Feature Click Log Preprocessing Click Label Matching extraction 10

TWO STAGES RANKING Stage 2: Stage 1: Query Result Query+Top k Ranking Matching/Recall Collaborative Filtering: • • CTR user/item based RDTM • • Topic Model: LSA / LDA .. • PCR Content Model • 11

CTR INFERENCE WORKFLOW Personas Embedding Feature to Model query Pull Features Get Values key Inference Item Features 12

CTR TRAINING WORKFLOW Parameter Server Based Embedding + Model DataStream Feature Extraction Pull Parameters Parameter Model Training Server Worker Update Parameter 13

MODEL Without DNN: Logistic Regression / Factor Machine With DNN: Embedding+MLP / Wide Deep Learning / DeepFM / DCN / DIN / DIEN 14

CHALLENGES IN CTR TRAINING 15

EMBEDDING + MLP Loss FC + bias Standard Network Activation Large Embedding table: E_MEM = GBs to FC + bias TBs Activation Small FC layers: FC + bias FC_MEM = #Layers * 100s * 100s (Suppose 5*500*500*4B = 5MB Embedding Input 16

CTR SOLUTION CPU 100 Nodes, connected with Ethernet (1.25-1.8GB/s) Each forward/backward exchange whole the dense model ~10MB per node: 5.6ms* Compute time = ~2ms (BS=2000) Overall time = compute + data exchange = 7.6ms * Suppose 1.8GB/s Ethernet and CPU with 6TFlops per node 17

CTR SOLUTION CPU 100 Nodes, connected with Ethernet (1.25-1.8GB/s) Each forward/backward exchange whole the dense model ~10MB per node: 5.6ms Compute time = ~2ms (BS=2000) Overall time = compute + data exchange = 7.6ms Bottle Neck is Network 18

CTR SOLUTION Single GPU Node Single Node Within GPU server: model exchange is >83x faster (0.067ms) Compute Time: 6ms (batchsize=2x10^5) Total Time = 6ms (1.26x 100 CPU Nodes) 19

CTR SOLUTION Single GPU Node Single Node Within GPU server: model exchange is >83x faster (0.067ms) Compute Time: 6ms (batchsize=2x10^5) Total Time = 6ms (1.26x 100 CPU Nodes) Bottle Neck is Compute 20

CTR SOLUTION Multi GPU Nodes Multi Node Within GPU server: model exchange is 27.8x faster than CPU Compute Time: 6ms/#Node (batchsize=2x10^5/#Node) Total Time = 6ms/#Node + 0.2ms (linear scale if Nodes < 10) 21

CHALLENGES FOR GPU SOLUTION Streaming Training: Dynamic Hashtable Insertion Very big hashtable (GBs~TBs) Large data I/O for data reading Very shallow networks (3~20 layers) Not a typical DNN training can be handled by current frameworks like pytorch TensorFlow 22

CHALLENGES FOR GPU SOLUTION Challenges: HugeCTR: Streaming Training: Dynamic Flexible GPU Hashtable Hashtable Insertion Multi-Node training Very big hashtable (GBs~TBs) Efficient Three Stage Pipeline Large data I/O for data reading Very shallow networks (3~20 layers) 23

HUGECTR INTRODUCTION 24

WHAT IS HUGECTR HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training. Key Features in 2.0: • GPU Hashtable and dynamic insertion Multi-node training and enabling very large embedding • Mixed precision training • 25

HOW HUGECTR HELP 1. Prototype: Showing performance and possibility on GPUs. (v1.0) 2. Reference Design: Developers and NV can work together to modify HugeCTR according to the specific requirements (v2.0 current stage) 3. Framework: Developers can train their model easily on HugeCTR (v3.0) 26

NETWORK SUPPORTED Embedding + MLP Multi slot embedding: Sum / Mean Layers: Concat / Fully Connected / Relu / BatchNorm / elu Optimizer: Adam/ Momentum SGD/ Nesterov Loss: CrossEngtropy/ BinaryCrossEntropy * Supporting multiple labels and each label will have a unique weight 27

NETWORK SUPPORTED Sparse Model Supported reduce ‘+’: sum / mean Empty Hashtable initialization concat Dynamic insertion {0} if no value in + + + this feature 5, 48, 90, 21 6,24,52 28

PERFORMANCE Good Scalability NCCL 2.0 Three stages pipeline: reading from file • • host to device data transaction (inter / intra nodes) GPU training • *MLP Layers: 12 / MLP Output: 1024 / Embedding Vector: 64 / Table Number: 1 29

PERFORMANCE TensorFlow 44x Speedup to CPU TF and same loss curve Embedding Vector: 64/ Layers: 4 / MLP Output: 200 / Table Number: 1 30

PERFORMANCE Pytorch DLRM Embedding Vector: 64 / Layers: 4 / MLP Output: 512 / Table number: 64 31

SYSTEM Session GPU3 GPU0 GPU1 GPU2 Dense Model Dense Model Dense Model Dense Model Network Network Network Network Data Parallel Model Sparse Model Embedding Parallel CSR DataReader Model View 32 Class View

HOW TO USE A Simplified Framework For Ranking or Retrieval Weight initialization: generate a file with initialized weight according to the name in config file $ huge_ctr – -init config.json Training: $ huge_ctr – -train config.json All the network, solver and dataset is configured under config.json 33

HOW TO USE Config.json Configuration file is in json format, and has four parts: Solver Optimizer Data Network 34

HOW TO USE Dataset Dataset contains two kinds of files: 1. File list: includes the number of files and file name list with text format. A file name could be either of a relative address or absolute address. The file names are separated with ‘ \ n’ 2. Data files: includes a bunch of files with binary format. 36

HOW TO USE Data File Training Set Format (RAW data with header): Slot2_nnz Slot1_nnz … Int label1 Int label2 Int label3 … Int slot1_nnz I64 key I64 key Int slot2_nnz I64 key I64 key I64 key Int label1 Int label1 sample1 Header: 37

ROADMAP 1.0 3.0 2.0 • September 2019 • TensorFlow Inference • HashTable • Early 2019 Embedding • optimized slot • RAW buffer reduction • Multi-node Embedder • Dense input • Mixed Precision • Embedder+MLP Training • Inference support • More Layers • Input data check • WDL/DeepFM/DC N 38

RESOURCES 源码： https://github.com/NVIDIA/HugeCTR 公众号文章： https://mp.weixin.qq.com/s/Oieuhvt2vzFEfKklTHiuOg 39

KEY CONTRIBUTORS Yong Wang Ryan Jeng Joey Wang Fan Yu Algorithm Competitive Project Hashtable Advisor Study Management David Wu Gems Guo Xiaoying Jia Minseok Embedding TensorFlow Mixed Lee 40 Precision Multi-Node

HUGECTR GPU 15 Nov 2019 AGENDA Click-Through Rate - PowerPoint PPT Presentation

HUGECTR GPU 15 Nov 2019 AGENDA Click-Through Rate Prediction Challenges in CTR Training HugeCTR Introduction 2 CLICK-THROUGH RATE PREDICTION 3 WHAT IS CTR Wikipedia: Click -through rate ( CTR

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

2019 Budget Presentation April 18, 2019 THE PROCESS Budget Refresher Budget Requests

Cyber Security Awareness Seminar Presented By: Ryan Moore Ohio Cyber Range Institute, University

Email il Typosquattin ing Janos Szurdi and Nicolas Christin Dictionary ry.com 2 Youtube.com

Jabber, E-mail and Beyond Ralph Meijer and Peter Saint-Andre Jabber, E-mail and Beyond p.1/26

CONTENTS Overview 2 Ecocarrier for Social Edification 2 W5GO TM 3-4 TM PizzzAR 5-6 TM

worldwide range of cities , counties and countries! CITYINFORMATION THE PRODUCT Switch cities

WIIT TAKE YOUR BUSINESS ABOVE THE CLOUDS Company Presentation March 2020 Disclaimer This

Chairmans Presentation 19.05.2017 Safe Harbor Certain statements in these slides are

HUGECTR GPU 15 Nov 2019 AGENDA Click-Through Rate - PowerPoint PPT Presentation

HUGECTR GPU 15 Nov 2019 AGENDA Click-Through Rate Prediction Challenges in CTR Training HugeCTR Introduction 2 CLICK-THROUGH RATE PREDICTION 3 WHAT IS CTR Wikipedia: Click -through rate ( CTR

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

2019 Budget Presentation April 18, 2019 THE PROCESS Budget Refresher Budget Requests

Cyber Security Awareness Seminar Presented By: Ryan Moore Ohio Cyber Range Institute, University

Email il Typosquattin ing Janos Szurdi and Nicolas Christin Dictionary ry.com 2 Youtube.com

Jabber, E-mail and Beyond Ralph Meijer and Peter Saint-Andre Jabber, E-mail and Beyond p.1/26

CONTENTS Overview 2 Ecocarrier for Social Edification 2 W5GO TM 3-4 TM PizzzAR 5-6 TM

worldwide range of cities , counties and countries! CITYINFORMATION THE PRODUCT Switch cities

WIIT TAKE YOUR BUSINESS ABOVE THE CLOUDS Company Presentation March 2020 Disclaimer This

Chairmans Presentation 19.05.2017 Safe Harbor Certain statements in these slides are

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,