REVOLUTIONIZING RETAIL WITH ARTIFICIAL INTELLIGENCE Scott Brubaker, - - PowerPoint PPT Presentation

revolutionizing retail
SMART_READER_LITE
LIVE PREVIEW

REVOLUTIONIZING RETAIL WITH ARTIFICIAL INTELLIGENCE Scott Brubaker, - - PowerPoint PPT Presentation

REVOLUTIONIZING RETAIL WITH ARTIFICIAL INTELLIGENCE Scott Brubaker, Paul Hendricks & Alex Sabatier INCEPTION PARTNERS & RETAIL ECOSYSTEM Physical / In-store VISION-BASED APPLICATIONS ROBOTS, DRONES DIGITAL SIGNAGE Online VISUAL


slide-1
SLIDE 1

REVOLUTIONIZING RETAIL WITH ARTIFICIAL INTELLIGENCE

Scott Brubaker, Paul Hendricks & Alex Sabatier

slide-2
SLIDE 2

2

VISION-BASED APPLICATIONS

INCEPTION PARTNERS & RETAIL ECOSYSTEM

Physical / In-store Online

DIGITAL SIGNAGE ROBOTS, DRONES VISUAL SEARCH, TAGGING MARKETING, ANALYTICS CONVERSATIONAL COMMERCE OTHER

slide-3
SLIDE 3

3

SUPPLY CHAIN

AI FOR RETAIL

SHOPPING EXPERIENCE CORPORATE HEADQUARTERS

slide-4
SLIDE 4

4

FRICTIONLESS COMMERCE

SHOPPING EXPERIENCE: STORE (IVA)

LOSS PREVENTION, SHOPPER TRACKING INVENTORY ANALYSIS

slide-5
SLIDE 5

5

TOP RETAIL IVA USE CASES

LOSS PREVENTION

Ticket Switching Mis-scanning Sweethearting Security

STORE ANALYTICS

Heat Mapping Demographic Analysis Shopper/Employee tracking Stock Out Customer Engagement Price Matching Pick-up $50B in annual shrinkage in US alone 50% of top retailers will implement IVA for store analytics Autonomous checkout locations to increase 4x annually for next 3 years

AUTONOMOUS SHOPPING

Autonomous Checkout Nano Stores Smart Cabinets

slide-6
SLIDE 6

6

AR/VR CONSUMER INTERACTION

SHOPPING EXPERIENCE: ONLINE

RECOMMENDATION ENGINE IMAGE-BASED SEARCH

slide-7
SLIDE 7

7

VIDEO RECOMMENDATIONS

RECOMMENDATION ENGINES ON GPU CLOUD

SONG RECOMMENDATIONS TARGETED RECOMMENDATIONS

slide-8
SLIDE 8

8

DYNAMIC SUPPLY CHAIN REAL-TIME RE-ROUTING

AI IN SUPPLY CHAIN

WAREHOUSE OPTIMIZATION FORECASTING AND REPLENISHMENT

slide-9
SLIDE 9

9

AI AT CORPORATE HQ

SINGLE VIEW OF CONSUMER DEMAND SIGNAL ANALYSIS AD SPEND OPTIMIZATION PREDICTIVE ANALYTICS

slide-10
SLIDE 10

10

DATA SCIENCE IN RETAIL

Supply Chain Replenishment Inventory Management Price Simulation & Management Prioritize Promotion - Ad Targeting Marketing Optimization Personalized Recommendations Truck Routing Online Delivery

GPU POWERED MACHINE LEARNING

slide-11
SLIDE 11

11

THE STORE OF THE FUTURE

Future-Proofed IVA Infrastructure

Loss Prevention Stock Out Reduction Store Analytics Security DL-BASED IVA EDGE USE CASES

Server (6 x T4s) Server Back of Store Jetson AGX Xavier / Nano

T 4 T 4 T 4 T 4 T 4 T 4

In-Store Cameras Sensors

slide-12
SLIDE 12

12

NVIDIA VALUE

Comprehensive Platform for Retail IVA

NVIDIA DELIVERS IVA Inference w/NVIDIA T4 GPU

Speed Up 27*X CPU Images/second (1080P) 4400 Metropolis Platform optimized for IVA DS Inference SDK TensorRT GPU accelerated IVA Software Partners 70+ Deep Learning Education Developer Blogs + IVA DLI

* Based on ResNet-50

ResNet-50

Speedup: 27x Faster GPU hardware accelerator engines for video decoding and encoding support faster than real-time video processing.

slide-13
SLIDE 13

ART OF THE POSSIBLE

Paul Hendricks Solutions Architect phendricks@nvidia.com

The State of AI in Retail

slide-14
SLIDE 14

14

  • Paul Hendricks is a Solutions Architect at NVIDIA, helping

enterprise customers with their deep learning and AI initiatives

  • Paul's background is primarily in retail, and has spent the

past 5 years working with many Fortune 500 retail companies to implement data science and AI solutions.

  • Prior to joining NVIDIA, Paul worked at Victoria’s Secret as a

Data Scientist building models to understand customer propensity to purchase and how to optimize assortment in stores.

  • Currently, Paul's research at NVIDIA focuses on intelligent

video analytics, machine leaning, recommendation systems, GANs, and reinforcement learning.

INTRODUCTION

slide-15
SLIDE 15

15

  • Paul Hendricks is a Solutions Architect at NVIDIA, helping

enterprise customers with their deep learning and AI initiatives

  • Paul's background is primarily in retail, and has spent the

past 5 years working with many Fortune 500 retail companies to implement data science and AI solutions.

  • Prior to joining NVIDIA, Paul worked at Victoria’s Secret as a

Data Scientist building models to understand customer propensity to purchase and how to optimize assortment in stores.

  • Currently, Paul's research at NVIDIA focuses on intelligent

video analytics, machine leaning, recommendation systems, GANs, and reinforcement learning.

INTRODUCTION

slide-16
SLIDE 16

16

INTELLIGENT VIDEO ANALYTICS

slide-17
SLIDE 17

17

Image Classification

  • Input Data: Images, Videos
  • Goal: Given an input, identify the class that input belongs to

Problem Background

slide-18
SLIDE 18

18

Object Detection

  • Input Data: Images, Videos
  • Goal: Given an input, identify objects and output bounding boxes around the objects and their classes

Problem Background

slide-19
SLIDE 19

19

Object Segmentation (Semantic Segmentation)

  • Input Data: Images, Videos
  • Goal: Given an input, identify objects and output a mapping of pixels to their respective classes

Problem Background

slide-20
SLIDE 20

20

LOSS PREVENTION, STORE ANALYTICS, AND FRICTIONLESS CHECKOUT

https://www.standardcognition.com/

slide-21
SLIDE 21

21

LOCALIZING ALGORITHMS

slide-22
SLIDE 22

22

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

slide-23
SLIDE 23

23

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

  • Computationally efficient and can be very fast

during inference

slide-24
SLIDE 24

24

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

  • Computationally efficient and can be very fast

during inference

  • Examples: YOLOv3, SSD, RetinaNet, RetinaMask
slide-25
SLIDE 25

25

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

  • Computationally efficient and can be very fast

during inference

  • Examples: YOLOv3, SSD, RetinaNet, RetinaMask

Two Stage Detectors

  • These algorithms generate a number of region

proposals which are then passed to a CNN and classified

slide-26
SLIDE 26

26

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

  • Computationally efficient and can be very fast

during inference

  • Examples: YOLOv3, SSD, RetinaNet, RetinaMask

Two Stage Detectors

  • These algorithms generate a number of region

proposals which are then passed to a CNN and classified

  • Slower during inference since regions must be

proposed and then evaluated (often redundant if

  • verlaps)
slide-27
SLIDE 27

27

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

  • Computationally efficient and can be very fast

during inference

  • Examples: YOLOv3, SSD, RetinaNet, RetinaMask

Two Stage Detectors

  • These algorithms generate a number of region

proposals which are then passed to a CNN and classified

  • Slower during inference since regions must be

proposed and then evaluated (often redundant if

  • verlaps)
  • Often are more accurate than single stage

detectors, especially when trained on semantic segmentations

slide-28
SLIDE 28

28

LOCALIZING ALGORITHMS

Single Stage Detectors

  • These algorithms regress the bounding boxes as

well as classify the object within that bounding box in a single pass

  • Computationally efficient and can be very fast

during inference

  • Examples: YOLOv3, SSD, RetinaNet, RetinaMask

Two Stage Detectors

  • These algorithms generate a number of region

proposals which are then passed to a CNN and classified

  • Slower during inference since regions must be

proposed and then evaluated (often redundant if

  • verlaps)
  • Often are more accurate than single stage

detectors, especially when trained on semantic segmentations

  • Examples: Faster RCNN, Mask RCNN
slide-29
SLIDE 29

29

GETTING STARTED

DLI Courses

  • Introduction to Object Detection with TensorFlow – https://courses.nvidia.com/courses/course-v1:DLI+L-AV-04+V1

Papers

  • YOLOV3 – https://pjreddie.com/publications/
  • Faster RCNN – https://arxiv.org/pdf/1506.01497
  • Mask RCNN – https://arxiv.org/abs/1703.06870
  • RetinaNet – https://arxiv.org/abs/1708.02002
  • RetinaMask – https://arxiv.org/abs/1901.03353

Libraries

  • DarkNet – https://github.com/pjreddie/darknet
  • TensorFlow’s Object Detection API – https://github.com/tensorflow/models/tree/master/research/object_detection
  • Facebook’s Mask RCNN Benchmark – https://github.com/facebookresearch/maskrcnn-benchmark

Datasets

  • ImageNet – https://www.kaggle.com/c/imagenet-object-detection-challenge
  • Pascal VOC – http://host.robots.ox.ac.uk/pascal/VOC/
  • COCO – http://cocodataset.org/
  • Open Images – https://storage.googleapis.com/openimages/web/index.html
slide-30
SLIDE 30

30

MACHINE LEARNING

slide-31
SLIDE 31

31

DATA SCIENCE IN RETAIL

Supply Chain Replenishment Inventory Management Price Management / Markdown Optimization Prioritize Promotion And Ad Targeting Marketing Optimization Personalized Recommendations Truck Routing Online Delivery

slide-32
SLIDE 32

32

ML WORKFLOW STIFLES INNOVATION

Data Sources

Wrangle Data

Train

Time-consuming, inefficient workflow that wastes data science productivity

Data Lake

ETL

Evaluate Predictions

Data Preparation Train Deploy

slide-33
SLIDE 33

33

DATA SCIENCE WORKFLOW WITH RAPIDS

Open Source, End-to-end GPU-accelerated Workflow Built On CUDA

DATA

DATA PREPARATION

GPUs accelerated compute for in-memory data preparation Simplified implementation using familiar data science tools Python drop-in Pandas replacement built on CUDA C++. GPU-accelerated Spark (in development)

PREDICTIONS

slide-34
SLIDE 34

34

DATA SCIENCE WORKFLOW WITH RAPIDS

Open Source, End-to-end GPU-accelerated Workflow Built On CUDA

MODEL TRAINING

GPU-acceleration of today’s most popular ML algorithms XGBoost, Random Forest, Linear Regression, PCA, K-means, k-NN, DBScan, tSVD …

DATA PREDICTIONS

slide-35
SLIDE 35

35

DATA SCIENCE WORKFLOW WITH RAPIDS

Open Source, End-to-end GPU-accelerated Workflow Built On CUDA

VISUALIZATION

Effortless exploration of datasets, billions of records in milliseconds Dynamic interaction with data = faster ML model development Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS

DATA PREDICTIONS

slide-36
SLIDE 36

36

RAPIDS — OPEN GPU DATA SCIENCE

Software Stack

Data Preparation Visualization Model Training CUDA PYTHON APACHE ARROW DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUML CUDF CUGRAPH

slide-37
SLIDE 37

37

GETTING STARTED

DLI Courses

  • Accelerating Data Science Workflows with RAPIDS – https://courses.nvidia.com/courses/course-v1:DLI+L-DS-01+V1

Resources RAPIDS GitHub – https://github.com/rapidsai

  • cuDF – https://github.com/rapidsai/cudf
  • cuML – https://github.com/rapidsai/cuml
  • cuGraph – https://github.com/rapidsai/cugraph
  • XGBoost – https://github.com/rapidsai/xgboost
  • Dask cuDF – https://github.com/rapidsai/dask-cudf
  • Dask cuML – https://github.com/rapidsai/dask-cuml
  • Dask XGBoost – https://github.com/rapidsai/dask-xgboost
  • Notebooks – https://github.com/rapidsai/notebooks
  • Notebooks Extended– https://github.com/rapidsai/notebooks-extended
slide-38
SLIDE 38

38

NVIDIA HARDWARE

slide-39
SLIDE 39

39

TESLA V100 TENSOR CORE GPU

World’s Most Advanced Data Center GPU

5,120 CUDA cores 640 Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32 GB HBM2 @ 900GB/s | 300GB/s NVLink

slide-40
SLIDE 40

40

TENSOR CORE BUILT FOR AI

Delivering 125 TFLOPS of DL Performance TENSOR CORE

ALL MAJOR FRAMEWORKS VOLTA-OPTIMIZED cuDNN

MATRIX DATA OPTIMIZATION: Dense Matrix of Tensor Compute TENSOR-OP CONVERSION: FP32 to Tensor Op Data for Frameworks

TENSOR CORE

VOLTA TENSOR CORE

4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized For Deep Learning

slide-41
SLIDE 41

41

NVIDIA DGX

AI Supercomputer-in-a-Box

1000 TFLOPS | 8x Tesla V100 32GB | NVLink Hybrid Cube Mesh 2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W

slide-42
SLIDE 42

42

NVIDIA DGX-2

THE WORLD’S MOST POWERFUL DEEP LEARNING SYSTEM FOR THE MOST COMPLEX DEEP LEARNING CHALLENGES

  • First 2 PFLOPS System
  • 16 V100 32GB GPUs Fully Interconnected
  • NVSwitch: 2.4 TB/s bisection bandwidth
  • 24X GPU-GPU Bandwidth
  • 0.5 TB of Unified GPU Memory
  • 10X Deep Learning Performance

42

slide-43
SLIDE 43

43

2,560 CUDA Cores 320 Turing Tensor Cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s 70 W

TESLA T4

WORLD’S MOST ADVANCED SCALE-OUT GPU

slide-44
SLIDE 44

44

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE & ENTRY LEVEL TRAINING 65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

slide-45
SLIDE 45

45

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Speedup: 36x faster

GNMT

Speedup: 27x faster

ResNet-50 (7ms latency limit)

Speedup: 21X faster

DeepSpeech 2

1.0 10X 36X

  • 5

10 15 20 25 30 35 40

Speedup v. CPU Server

Natural Language Processing Inference

CPU Server Tesla P4 Tesla T4 1.0 4X 21X

  • 5

10 15 20 25

Speedup v. CPU Server

Speech Inference

CPU Server Tesla P4 Tesla T4 1.0 10X 27X

  • 5

10 15 20 25 30

Speedup v. CPU Server

Video Inference

CPU Server Tesla P4 Tesla T4 5.5 22 65 130 260

50 100 150 200 250 300

TFLOPS / TOPS

Peak Performance

T4 P4

float INT8 float INT8 INT4

For all three graphs: Dual-Socket Xeon Gold 6140 @ 3.6GHz with single GPU as shown 18.11-py3 | TensorRT 5.0 | CPU FP32, P4 & T4: INT8 | Batch Size = 128

slide-46
SLIDE 46

46

60 85 27 53 Images / Second / Watt ResNet-50 GoogleNet

WORLD’S FASTEST INFERENCE PERFORMANCE

NVIDIA GPUs Set New Performance Records

4,018 5,760 6,359 12,300 Images / Second ResNet-50 GoogleNet

NVIDIA T4 NVIDIA V100 NVIDIA T4 NVIDIA V100

Latency Throughput Energy Efficiency

1.10 0.65 1.00 0.60

Latency (milliseconds)

ResNet-50 GoogleNet NVIDIA T4 NVIDIA V100

slide-47
SLIDE 47

47

JETSON TX1 7 - 15W 1 TFOPS (FP16) 50mm x 87mm JETSON TX2 7 – 15W 1.3 TOPS (FP16) 50mm x 87mm JETSON AGX XAVIER 10 – 30W 10 TFLOPS (FP16) | 32 TOPS (INT8) 100mm x 87mm

THE JETSON FAMILY

Multiple devices • Unified software

AI at the edge Fully autonomous machines UAVs • AI subsystems • AI Cameras Factory automation • Logistics • Delivery robots

slide-48
SLIDE 48

48

NVIDIA SOFTWARE

slide-49
SLIDE 49

49

CHALLENGES WITH DEEP LEARNING

Current DIY deep learning environments are complex and time consuming to build, test and maintain Requires high level of expertise to manage driver, library, framework dependencies Development of frameworks by the community is moving very fast

NVIDIA Libraries NVIDIA Docker NVIDIA Driver NVIDIA GPU Open Source Frameworks

slide-50
SLIDE 50

50

NVIDIA GPU CLOUD

Innovate in minutes, not weeks Removes all the DIY complexity of deep learning software integration Always up to date Monthly updates by NVIDIA to ensure maximum performance Deep learning across platforms Containers run locally on DGX Systems and TITAN PCs, or on cloud service provider GPU instances

Deep Learning Everywhere, For Everyone

NVIDIA GPU Cloud integrates GPU-optimized deep learning frameworks, runtimes, libraries, and OS into a ready-to-run container, available at no charge

slide-51
SLIDE 51

51

COMMON SOFTWARE STACK ACROSS DGX FAMILY

Cloud Service Provider

  • Single, unified stack for deep learning frameworks
  • Predictable execution across platforms
  • Pervasive reach

DGX Station DGX-1

NVIDIA GPU Cloud

DGX-2

51

slide-52
SLIDE 52

52

TENSORRT

slide-53
SLIDE 53

53

TENSORRT DEPLOYMENT WORKFLOW

TensorRT Optimizer TensorRT Runtime Engine Trained Neural Network

Step 1: Optimize trained model

Plan 1 Plan 2 Plan 3

Optimized Plans

Step 2: Deploy optimized plans with runtime

Embedded Automotive Data center

Import Model Serialize Engine

Plan 1 Plan 2 Plan 3

Optimized Plans

De-serialize Engine Deploy Runtime

slide-54
SLIDE 54

54

Frameworks Platforms

TESLA V100 DRIVE PX 2 TESLA P4/T4 JETSON TX2 NVIDIA DLA

TensorRT

NVIDIA TENSORRT

From Every Framework, Optimized For Each Target Platform

slide-55
SLIDE 55

55

TensorRT 5 & TensorRT Inference Server

Turing Support ● Optimizations & APIs ● Inference Server

World’s Most Advanced Inference Accelerator

Up to 40x faster perf. on Turing Tensor Cores

New optimizations & flexible INT8 APIs

New INT8 workflows, Win & CentOS support

TensorRT inference server

Maximize GPU utilization, run multiple models

  • n a node

Free download to members of NVIDIA Developer Program soon at developer.nvidia.com/tensorrt

slide-56
SLIDE 56

56

TensorRT Inference Server

Containerized Microservice for Data Center Inference Multiple models scalable across GPUs

Supports all popular AI frameworks Seamless integration into DevOps deployments leveraging Docker and Kubernetes Ready-to-run container, free from the NGC container registry

slide-57
SLIDE 57

57

TRANSFER LEARNING TOOLKIT

slide-58
SLIDE 58

58

End to End NVIDIA Deep Learning Workflow

Accelerate time to market and save on compute resources!

Pre-Trained model access from NGC * Training & adaptation * Applications ready to integrate with DeepStream

slide-59
SLIDE 59

59

DEEPSTREAM

slide-60
SLIDE 60

60

Accelerate building and deploying heterogeneous applications for IVA use cases with TLT & DeepStream 3.0

NVIDIA DEEPSTREAM

Zero Memory Copies

slide-61
SLIDE 61

61

slide-62
SLIDE 62