FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA - - PowerPoint PPT Presentation

fast forward
SMART_READER_LITE
LIVE PREVIEW

FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA - - PowerPoint PPT Presentation

GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017 GRADUATE FELLOWSHIP PROGRAM Funding for Ph.D. students revolutionizing disciplines with the GPU Engage: Build mindshare Facilitate


slide-1
SLIDE 1

Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017

GRADUATE FELLOW FAST FORWARD

slide-2
SLIDE 2

2

GRADUATE FELLOWSHIP PROGRAM

Funding for Ph.D. students revolutionizing disciplines with the GPU

Engage:

  • Build mindshare
  • Facilitate recruiting

Learn:

  • Keep a finger on the pulse of leading academic research
  • Keep up with all the applications that are powered by GPUs

Leverage:

  • Track relevant research
  • Help to guide researchers working on relevant problems
slide-3
SLIDE 3

3

GRADUATE FELLOWSHIP PROGRAM

Eligibility/Application Process:

  • Ph.D. candidates in at least their 2nd year
  • Nomination(s) by Professor(s)/Advisor
  • 1-2 page research proposal

Selection Process:

  • Committee of NVIDIA scientists and engineers review applications
  • Applications evaluated for originality, potential, and relevance

144 Graduate Fellowships awarded -- $3.8M since program inception in 2002

slide-4
SLIDE 4

4

CURRENT 2016-2017 GRAD FELLOWS

Saman Ashkiani, UC Davis Yong He, CMU Yatish Turakhia, Stanford Minjie Wang, NYU Jiajun Wu, MIT Gang Wu, Univ of Sussex NVIDIA Foundation Fellow

slide-5
SLIDE 5

5

CURRENT 2016-2017 GRAD FELLOW FINALISTS

  • Ahmed Elkholy, University of Illinois at Urbana-Champaign
  • Achuta Kadambi, Massachusetts Institute of Technology
  • Caroline Trippel, Princeton
  • Yu-Hang Tang, Brown University
  • Ling-Qi Yan, University of California at Berkeley
slide-6
SLIDE 6

6

AGENDA

6 Talks 3 Minutes each

slide-7
SLIDE 7

7

JIAJUN WU, MIT

slide-8
SLIDE 8

Jiajun Wu, May 11, 2017

SINGLE IMAGE 3D INTERPRETER NETWORK

slide-9
SLIDE 9

9

GOAL

slide-10
SLIDE 10

10

3D INTERPRETER NETWORK (3D-INN)

3D Interpreter 3D-to-2D Projection 2D Keypoint Labels Three-step training paradigm 2D Keypoint Estimation

slide-11
SLIDE 11

11

3D INTERPRETER NETWORK (3D-INN)

2D Keypoint Estimation Three-step training paradigm I: 2D Keypoint Estimation

slide-12
SLIDE 12

12

3D INTERPRETER NETWORK (3D-INN)

3D Interpreter Three-step training paradigm I: 2D Keypoint Estimation II: 3D Interpreter

slide-13
SLIDE 13

13

3D INTERPRETER NETWORK (3D-INN)

3D Interpreter 3D-to-2D Projection 2D Keypoint Labels Three-step training paradigm I: 2D Keypoint Estimation III: End-to-end Finetuning II: 3D Interpreter 2D Keypoint Estimation

slide-14
SLIDE 14

14

3D ESTIMATION: QUALITATIVE RESULTS

Training: our Keypoint-5 dataset, 2K images per category

slide-15
SLIDE 15

15

3D ESTIMATION: QUALITATIVE RESULTS

Keypoint-5 dataset Training: our Keypoint-5 dataset, 2K images per category

slide-16
SLIDE 16

16

3D ESTIMATION: QUALITATIVE RESULTS

Training: our Keypoint-5 dataset, 2K images per category IKEA Dataset [Lim et al, ’13]

slide-17
SLIDE 17

17

3D ESTIMATION: QUALITATIVE RESULTS

SUN Database [Xiao et al, ’11]

SUN

Input

After FT

Training: our Keypoint-5 dataset, 2K images per category

slide-18
SLIDE 18

18

3D ESTIMATION: QUALITATIVE RESULTS

Training: our Keypoint-5 dataset, 2K images per category SUN Database [Xiao et al, ’11]

slide-19
SLIDE 19

19

CHAIR EMBEDDING

Manifold of chairs based on their inferred viewpoint

slide-20
SLIDE 20

20

CONTRIBUTIONS OF 3D-INN

Single image 3D perception

Real 2D labels + synthetic 3D models, connected via keypoints A 3D-to-2D projection layer for end-to-end training

slide-21
SLIDE 21
slide-22
SLIDE 22

22

YATISH TURAKHIA, STANFORD

slide-23
SLIDE 23

Yatish Turakhia, 05/11/2017

DARWIN: A GENOMICS CO-PROCESSOR

slide-24
SLIDE 24

24

GENOME ANALYSIS PIPELINE

  • Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have

high error rate (15-40%)

  • >1,300 CPU hours for reference-guided assembly of noisy long reads
  • >15,600 CPU hours for de novo assembly of noisy long reads

ATGTCGAT CGATACGA GAGTCATC ACTGACGT

Reads

REFERENCE:--ATGTCGATGATCCAGAGGATACTAGGATAT- PATIENT: --ATGTCAATGAT-CAGAGGATATTAGGATAT-

Genome (3 Billion base pairs) Mutations Read assembly (Sequence alignment) Find the disease- causing mutation

3

Patient DNA sequencer

1 2

slide-25
SLIDE 25

25

DARWIN: SEQUENCE ALIGNMENT FRAMEWORK

D-SOFT (Seed) GACT (Extend) GACT API D-SOFT API

Software

Darwin

40nm ASIC 300mm2, 9W

D-SOFT

Reference (R) Query (Q)

GACT

Reference (R) Query (Q)

High speed and programmability

1. D-SOFT: Tunable speed/sensitivity to match different error profiles 2. GACT: First algorithm with O(1) memory for compute-intensive step of alignment allowing arbitrarily long alignments in hardware – well-suited to long reads 3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware

slide-26
SLIDE 26

26

DARWIN: REFERENCE-GUIDED ASSEMBLY

Candidate alignment start locations Seed hit 1st GACT tile (from D-SOFT) Extended GACT tiles GACT trace-back Score=7500 Score=60

D-SOFT GACT

Reads

~10Kbp

Reference genome

~3Gbp ~106 Reference Reference Read Read

slide-27
SLIDE 27

27

DARWIN: DE NOVO ASSEMBLY

Candidate alignment start locations Seed hit 1st GACT tile (from D-SOFT) Extended GACT tiles GACT trace-back

D-SOFT GACT

Reads

Reference Read ... . ... .

Inferred

  • verlap

... . ... . ... . ... .

Score=2500

Reference Read

40-100X speedup 6000X speedup

  • 1. Sequential accesses to multiple

DRAM channels

  • 2. Random accesses using large on-

chip memory (64MB)

  • 1. 512 Processing Elements (PEs) solving 3

dynamic programming equations every cycle

  • 2. Trace-back pointers maintained in on-chip

memory (2KB/PE)

slide-28
SLIDE 28

28

DARWIN PERFORMANCE

READ TYPE ERROR RATE SENSITIVITY DARWIN SPEEDUP SOFTWARE DARWIN Pacific Biosciences

15% 95.95% 99.91% 4,110X

Oxford Nanopore 2D

30% 98.11% 98.40% 4,080X

Oxford Nanopore 1D

40% 97.10% 97.40% 128X

READ TYPE ERROR RATE SENSITIVITY DARWIN SPEEDUP SOFTWARE DARWIN Pacific Biosciences

15% 99.80% 99.89% 250X

Reference-guided assembly De novo assembly

slide-29
SLIDE 29

THANK YOU!

slide-30
SLIDE 30

30

SAMAN ASHKIANI, UC DAVIS

slide-31
SLIDE 31

Saman Ashkiani, 05/11/2017

DYNAMIC DATA STRUCTURES FOR THE GPU

slide-32
SLIDE 32

32

DYNAMIC DATA STRUCTURES FOR THE GPU

  • Objective: a general-purpose data structure that can be updated at runtime
  • Supports updates (insert/deletion): batched or individual
  • Efficient queries (lookup, count, range, etc.): batched or individual
  • Motivation: more types of GPU data structures in programmer’s toolbox
  • It is a challenging task, because
  • GPUs have thousands of parallel threads: need an efficient non-blocking data structure
  • Most classic non-blocking ideas are hard to be efficiently implemented in SIMD fashion
  • Efficient dynamic memory allocation is hard on GPUs
  • Safe memory reclamation: no dynamic memory management on GPUs
slide-33
SLIDE 33

33

OUR IMPLEMENTATIONS

CONCURRENT HASH TABLE GPU LSM

Hash table with chaining

  • Updates: concurrent insertion/deletion

(497 M updates/s on K40)

  • Queries: lookup (860 M queries/s on K40)
  • Each bucket: a warp friendly linked list
  • Warp-synchronous programming to

better fit SIMD model

  • Our own dynamic memory allocator for

nodes

  • Memory reclamation: safe removal of

deleted nodes, for future reuse

Dictionary data structure: multiple sorted arrays with different sizes

  • Updates: batch insertion/deletion

(average: 225 M updates/s on K40)

  • Queries: lookup, count, and range

(133, 60, and 30 M queries/s)

  • Based on radix sort, merge, and

binary search

  • Paper draft:

http://ece.ucdavis.edu/~ashkiani/ gpu_lsm.pdf

slide-34
SLIDE 34
slide-35
SLIDE 35

35

GANG WU, UNIVERSITY OF SUSSEX

slide-36
SLIDE 36

Gang Wu, 11th May 2017

HIGH-SPEED FLUORESCENCE LIFETIME IMAGING BASED ON ANN AND GPU

slide-37
SLIDE 37

37

CONTENTS

What is FLIM Project Aims FLIM Theories ANN-GPU-FLIM Results

slide-38
SLIDE 38

38

WHAT IS FLIM

FLIM Fluorescence-lifetime imaging microscopy is an technique for producing an image based on the differences in the exponential decay rate of the fluorescence from a fluorescent sample. Applications Surgery guidance Disease diagnosis

Gold nanorods

Disease therapies

slide-39
SLIDE 39

39

PROJECT AIMS

Current systems CPU based traditional FLIM analysis is very slow (tens of minutes for one image) Aims: high-speed FLIM analysis Fast algorithm (Artificial neural network) Highly paralleled hardware (GPU)

slide-40
SLIDE 40

40

FLIM THEORIES

TCSPC Sample Laser Lifetime analysis This work GPU CPU with GUI Detector

slide-41
SLIDE 41

41

FLIM THEORIES

𝐵, 𝑔

𝐸, 𝜐𝐺, 𝜐𝐸

Time bin Photon counts

3 ns 2.2

1.5 0.7

Fluorescence decay histogram

slide-42
SLIDE 42

42

ANN-GPU-FLIM

ANN-GPU-FLIM principle

… …

60 30 1 2 3 165 166 Photon count

… … …

Artificial Neural Network FLIM data

Once network training

FLIM Images

slide-43
SLIDE 43

43

RESULTS

Accuracy performance Different optimized areas, comparable performance.

ALGORITHMS IMAGE SIZE TIME-CPU (S) TIME-GPU (S) SPEEDUP (GPU VS CPU) OVERALL SPEEDUP ANN

256×256 0.89 0.1 8.9 415

LSM

41.5 3.8 10.8

Time performance

slide-44
SLIDE 44
slide-45
SLIDE 45

45

YONG HE, CMU

slide-46
SLIDE 46

Yong He, May 2017

EVOLVING SHADER COMPILATION FOR PERFORMANCE AND MAINTAINABILITY

slide-47
SLIDE 47

47

EVOLVING SHADER COMPILERS

Meeting Performance Goals with Productivity Constraints

Modern games feature increasingly more realistic graphics A game’s shader library has grown 100x more complex Shading language is still the same as ten years ago, lack functionality for achieving high performance without compromising code modularity and extensibility

slide-48
SLIDE 48

48

AUTOMATIC APPROXIMATE SHADER COMPILATION

Performance Productivity Fast Shader Code

A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015

Fast Compilation

slide-49
SLIDE 49

49

SPIRE SHADING LANGUAGE

An IR for cross-stage shader code transformations

Performance Productivity Fast Shader Code

A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015

Fast Compilation

A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015

Explorable Optimizations

slide-50
SLIDE 50

50

Shader Components: Modular and High Performance Shader

  • Development. Yong He, Tim

Foley, Teguh Hofstee, Haomin Long, Kayvon Fatahalian. SIGGRAPH 2017 (to appear)

SHADER COMPONENTS

building block for flexible and extensible shading system that maps efficiently to the new graphics API

Performance Productivity Fast Shader Code

A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015

Fast Compilation

A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015

Explorable Optimizations Fast CPU logic Flexibility & Extensibility

slide-51
SLIDE 51

51

SUMMARY

Shader compiler and shading language design for performance and productivity goals Future work: formally define language semantics, unify each piece of work in a consistent system

slide-52
SLIDE 52
slide-53
SLIDE 53

53

MINJIE WANG, NYU

slide-54
SLIDE 54

Minjie Wang, May 11, 2017

BUILDING FLEXIBLE AND EFFICIENT DISTRIBUTED MACHINE LEARNING SYSTEMS

slide-55
SLIDE 55

55

ABOUT ME

Third-year Ph.D. student in NYU Advisor: Jinyang Li Research field: Distributed systems for machine learning Projects:

slide-56
SLIDE 56

56

5-layer MLP; Hidden Size = 8192; Batch Size = 512

RESEARCH WORK: TOFU

Inception Network on Cifar-10 dataset

Data parallelism suffers from batch-size-dilemma. It requires large batch size to scale. But large batch size harms model accuracy.

slide-57
SLIDE 57

57

RESEARCH WORK: TOFU

Other parallelisms exist but are hard to program. Model parallelism, hybrid parallelism, combined parallelism.

slide-58
SLIDE 58

58

TOFU AUTOMATICALLY PARALLELIZES DEEP LEARNING ALGORITHM

Unify data & model parallelism as different distributed strategies of tensor operators. Algorithm that finds best distributed strategy with least communication cost Results show that we can achieve much better scalability when batch size is small.

slide-59
SLIDE 59

59

OPEN SOURCE WORK: MINPY

Native NumPy Support

  • One line code change.
  • Transparent fallback to NumPy.

Dynamic Autograd

  • One call to compute gradient.
  • Support data dependent branch.
  • Support python’s native if/while statements.

Just-In-Time Optimization

  • Optimize the recorded graph.

 Kernel fusion

  • Reuse optimized graph if next

iteration has the same computation.

slide-60
SLIDE 60
slide-61
SLIDE 61

61

Announcing: The New 2017-2018 Grad Fellows And Finalists

slide-62
SLIDE 62

62

NEW 2017-2018 GRAD FELLOWS

Awni Hannun, Stanford Deepak Pathak, UC Berkeley Caroline Trippel, Princeton Fereshteh Sadeghi,

  • Univ. Washington

Ling-Qi Yan, UC Berkeley Abigail See, Stanford

slide-63
SLIDE 63

63

NEW 2017-2018 GRAD FELLOWS

Robin Betz, Stanford Xiaolong Wang, CMU Adams Wei Yu, CMU Anna Shcherbina, Stanford NVIDIA Foundation Fellow Robert Konrad, Stanford

slide-64
SLIDE 64

64

NVIDIA FOUNDATION

  • Compute The Cure Initiative to support research in cancer biology and treatment
  • Focus on Bioinformatics, Genomics, Proteomics
  • Six $200K Research Grants to nonprofit & academic labs since 2013
  • Annual PhD Fellowships to promising researchers in related fields:
  • http://www.nvidia.com/object/compute-the-cure.html

2016-2017 Gang Wu AI for Fluorescence Lifetime Imaging 2017-2018 Anna Shcherbina DL for Epigenetic Regulatory Mechanisms

slide-65
SLIDE 65

65

NEW 2017-2018 GRAD FELLOW FINALISTS

  • Daniel Thuerck, Technische Universität Darmstadt
  • Leyuan Wang, University of California at Davis
  • Mohammad Babaeizadeh, University of Illinois at Urbana-Champaign
  • Philippe Tillet, Harvard University
  • Dingzeyu Li, Columbia University
slide-66
SLIDE 66

THANK YOU