Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017
FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA - - PowerPoint PPT Presentation
FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA - - PowerPoint PPT Presentation
GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017 GRADUATE FELLOWSHIP PROGRAM Funding for Ph.D. students revolutionizing disciplines with the GPU Engage: Build mindshare Facilitate
2
GRADUATE FELLOWSHIP PROGRAM
Funding for Ph.D. students revolutionizing disciplines with the GPU
Engage:
- Build mindshare
- Facilitate recruiting
Learn:
- Keep a finger on the pulse of leading academic research
- Keep up with all the applications that are powered by GPUs
Leverage:
- Track relevant research
- Help to guide researchers working on relevant problems
3
GRADUATE FELLOWSHIP PROGRAM
Eligibility/Application Process:
- Ph.D. candidates in at least their 2nd year
- Nomination(s) by Professor(s)/Advisor
- 1-2 page research proposal
Selection Process:
- Committee of NVIDIA scientists and engineers review applications
- Applications evaluated for originality, potential, and relevance
144 Graduate Fellowships awarded -- $3.8M since program inception in 2002
4
CURRENT 2016-2017 GRAD FELLOWS
Saman Ashkiani, UC Davis Yong He, CMU Yatish Turakhia, Stanford Minjie Wang, NYU Jiajun Wu, MIT Gang Wu, Univ of Sussex NVIDIA Foundation Fellow
5
CURRENT 2016-2017 GRAD FELLOW FINALISTS
- Ahmed Elkholy, University of Illinois at Urbana-Champaign
- Achuta Kadambi, Massachusetts Institute of Technology
- Caroline Trippel, Princeton
- Yu-Hang Tang, Brown University
- Ling-Qi Yan, University of California at Berkeley
6
AGENDA
6 Talks 3 Minutes each
7
JIAJUN WU, MIT
Jiajun Wu, May 11, 2017
SINGLE IMAGE 3D INTERPRETER NETWORK
9
GOAL
10
3D INTERPRETER NETWORK (3D-INN)
3D Interpreter 3D-to-2D Projection 2D Keypoint Labels Three-step training paradigm 2D Keypoint Estimation
11
3D INTERPRETER NETWORK (3D-INN)
2D Keypoint Estimation Three-step training paradigm I: 2D Keypoint Estimation
12
3D INTERPRETER NETWORK (3D-INN)
3D Interpreter Three-step training paradigm I: 2D Keypoint Estimation II: 3D Interpreter
13
3D INTERPRETER NETWORK (3D-INN)
3D Interpreter 3D-to-2D Projection 2D Keypoint Labels Three-step training paradigm I: 2D Keypoint Estimation III: End-to-end Finetuning II: 3D Interpreter 2D Keypoint Estimation
14
3D ESTIMATION: QUALITATIVE RESULTS
Training: our Keypoint-5 dataset, 2K images per category
15
3D ESTIMATION: QUALITATIVE RESULTS
Keypoint-5 dataset Training: our Keypoint-5 dataset, 2K images per category
16
3D ESTIMATION: QUALITATIVE RESULTS
Training: our Keypoint-5 dataset, 2K images per category IKEA Dataset [Lim et al, ’13]
17
3D ESTIMATION: QUALITATIVE RESULTS
SUN Database [Xiao et al, ’11]
SUN
Input
After FT
Training: our Keypoint-5 dataset, 2K images per category
18
3D ESTIMATION: QUALITATIVE RESULTS
Training: our Keypoint-5 dataset, 2K images per category SUN Database [Xiao et al, ’11]
19
CHAIR EMBEDDING
Manifold of chairs based on their inferred viewpoint
20
CONTRIBUTIONS OF 3D-INN
Single image 3D perception
Real 2D labels + synthetic 3D models, connected via keypoints A 3D-to-2D projection layer for end-to-end training
22
YATISH TURAKHIA, STANFORD
Yatish Turakhia, 05/11/2017
DARWIN: A GENOMICS CO-PROCESSOR
24
GENOME ANALYSIS PIPELINE
- Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have
high error rate (15-40%)
- >1,300 CPU hours for reference-guided assembly of noisy long reads
- >15,600 CPU hours for de novo assembly of noisy long reads
ATGTCGAT CGATACGA GAGTCATC ACTGACGT
Reads
REFERENCE:--ATGTCGATGATCCAGAGGATACTAGGATAT- PATIENT: --ATGTCAATGAT-CAGAGGATATTAGGATAT-
Genome (3 Billion base pairs) Mutations Read assembly (Sequence alignment) Find the disease- causing mutation
3
Patient DNA sequencer
1 2
25
DARWIN: SEQUENCE ALIGNMENT FRAMEWORK
D-SOFT (Seed) GACT (Extend) GACT API D-SOFT API
Software
Darwin
40nm ASIC 300mm2, 9W
D-SOFT
Reference (R) Query (Q)
GACT
Reference (R) Query (Q)
High speed and programmability
1. D-SOFT: Tunable speed/sensitivity to match different error profiles 2. GACT: First algorithm with O(1) memory for compute-intensive step of alignment allowing arbitrarily long alignments in hardware – well-suited to long reads 3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware
26
DARWIN: REFERENCE-GUIDED ASSEMBLY
Candidate alignment start locations Seed hit 1st GACT tile (from D-SOFT) Extended GACT tiles GACT trace-back Score=7500 Score=60
D-SOFT GACT
Reads
~10Kbp
Reference genome
~3Gbp ~106 Reference Reference Read Read
27
DARWIN: DE NOVO ASSEMBLY
Candidate alignment start locations Seed hit 1st GACT tile (from D-SOFT) Extended GACT tiles GACT trace-back
D-SOFT GACT
Reads
Reference Read ... . ... .
Inferred
- verlap
... . ... . ... . ... .
Score=2500
Reference Read
40-100X speedup 6000X speedup
- 1. Sequential accesses to multiple
DRAM channels
- 2. Random accesses using large on-
chip memory (64MB)
- 1. 512 Processing Elements (PEs) solving 3
dynamic programming equations every cycle
- 2. Trace-back pointers maintained in on-chip
memory (2KB/PE)
28
DARWIN PERFORMANCE
READ TYPE ERROR RATE SENSITIVITY DARWIN SPEEDUP SOFTWARE DARWIN Pacific Biosciences
15% 95.95% 99.91% 4,110X
Oxford Nanopore 2D
30% 98.11% 98.40% 4,080X
Oxford Nanopore 1D
40% 97.10% 97.40% 128X
READ TYPE ERROR RATE SENSITIVITY DARWIN SPEEDUP SOFTWARE DARWIN Pacific Biosciences
15% 99.80% 99.89% 250X
Reference-guided assembly De novo assembly
THANK YOU!
30
SAMAN ASHKIANI, UC DAVIS
Saman Ashkiani, 05/11/2017
DYNAMIC DATA STRUCTURES FOR THE GPU
32
DYNAMIC DATA STRUCTURES FOR THE GPU
- Objective: a general-purpose data structure that can be updated at runtime
- Supports updates (insert/deletion): batched or individual
- Efficient queries (lookup, count, range, etc.): batched or individual
- Motivation: more types of GPU data structures in programmer’s toolbox
- It is a challenging task, because
- GPUs have thousands of parallel threads: need an efficient non-blocking data structure
- Most classic non-blocking ideas are hard to be efficiently implemented in SIMD fashion
- Efficient dynamic memory allocation is hard on GPUs
- Safe memory reclamation: no dynamic memory management on GPUs
33
OUR IMPLEMENTATIONS
CONCURRENT HASH TABLE GPU LSM
Hash table with chaining
- Updates: concurrent insertion/deletion
(497 M updates/s on K40)
- Queries: lookup (860 M queries/s on K40)
- Each bucket: a warp friendly linked list
- Warp-synchronous programming to
better fit SIMD model
- Our own dynamic memory allocator for
nodes
- Memory reclamation: safe removal of
deleted nodes, for future reuse
Dictionary data structure: multiple sorted arrays with different sizes
- Updates: batch insertion/deletion
(average: 225 M updates/s on K40)
- Queries: lookup, count, and range
(133, 60, and 30 M queries/s)
- Based on radix sort, merge, and
binary search
- Paper draft:
http://ece.ucdavis.edu/~ashkiani/ gpu_lsm.pdf
35
GANG WU, UNIVERSITY OF SUSSEX
Gang Wu, 11th May 2017
HIGH-SPEED FLUORESCENCE LIFETIME IMAGING BASED ON ANN AND GPU
37
CONTENTS
What is FLIM Project Aims FLIM Theories ANN-GPU-FLIM Results
38
WHAT IS FLIM
FLIM Fluorescence-lifetime imaging microscopy is an technique for producing an image based on the differences in the exponential decay rate of the fluorescence from a fluorescent sample. Applications Surgery guidance Disease diagnosis
Gold nanorods
Disease therapies
39
PROJECT AIMS
Current systems CPU based traditional FLIM analysis is very slow (tens of minutes for one image) Aims: high-speed FLIM analysis Fast algorithm (Artificial neural network) Highly paralleled hardware (GPU)
40
FLIM THEORIES
TCSPC Sample Laser Lifetime analysis This work GPU CPU with GUI Detector
41
FLIM THEORIES
𝐵, 𝑔
𝐸, 𝜐𝐺, 𝜐𝐸
Time bin Photon counts
3 ns 2.2
1.5 0.7
Fluorescence decay histogram
42
ANN-GPU-FLIM
ANN-GPU-FLIM principle
… …
60 30 1 2 3 165 166 Photon count
… … …
Artificial Neural Network FLIM data
Once network training
FLIM Images
43
RESULTS
Accuracy performance Different optimized areas, comparable performance.
ALGORITHMS IMAGE SIZE TIME-CPU (S) TIME-GPU (S) SPEEDUP (GPU VS CPU) OVERALL SPEEDUP ANN
256×256 0.89 0.1 8.9 415
LSM
41.5 3.8 10.8
Time performance
45
YONG HE, CMU
Yong He, May 2017
EVOLVING SHADER COMPILATION FOR PERFORMANCE AND MAINTAINABILITY
47
EVOLVING SHADER COMPILERS
Meeting Performance Goals with Productivity Constraints
Modern games feature increasingly more realistic graphics A game’s shader library has grown 100x more complex Shading language is still the same as ten years ago, lack functionality for achieving high performance without compromising code modularity and extensibility
48
AUTOMATIC APPROXIMATE SHADER COMPILATION
Performance Productivity Fast Shader Code
A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015
Fast Compilation
49
SPIRE SHADING LANGUAGE
An IR for cross-stage shader code transformations
Performance Productivity Fast Shader Code
A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015
Fast Compilation
A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015
Explorable Optimizations
50
Shader Components: Modular and High Performance Shader
- Development. Yong He, Tim
Foley, Teguh Hofstee, Haomin Long, Kayvon Fatahalian. SIGGRAPH 2017 (to appear)
SHADER COMPONENTS
building block for flexible and extensible shading system that maps efficiently to the new graphics API
Performance Productivity Fast Shader Code
A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015
Fast Compilation
A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015
Explorable Optimizations Fast CPU logic Flexibility & Extensibility
51
SUMMARY
Shader compiler and shading language design for performance and productivity goals Future work: formally define language semantics, unify each piece of work in a consistent system
53
MINJIE WANG, NYU
Minjie Wang, May 11, 2017
BUILDING FLEXIBLE AND EFFICIENT DISTRIBUTED MACHINE LEARNING SYSTEMS
55
ABOUT ME
Third-year Ph.D. student in NYU Advisor: Jinyang Li Research field: Distributed systems for machine learning Projects:
56
5-layer MLP; Hidden Size = 8192; Batch Size = 512
RESEARCH WORK: TOFU
Inception Network on Cifar-10 dataset
Data parallelism suffers from batch-size-dilemma. It requires large batch size to scale. But large batch size harms model accuracy.
57
RESEARCH WORK: TOFU
Other parallelisms exist but are hard to program. Model parallelism, hybrid parallelism, combined parallelism.
58
TOFU AUTOMATICALLY PARALLELIZES DEEP LEARNING ALGORITHM
Unify data & model parallelism as different distributed strategies of tensor operators. Algorithm that finds best distributed strategy with least communication cost Results show that we can achieve much better scalability when batch size is small.
59
OPEN SOURCE WORK: MINPY
Native NumPy Support
- One line code change.
- Transparent fallback to NumPy.
Dynamic Autograd
- One call to compute gradient.
- Support data dependent branch.
- Support python’s native if/while statements.
Just-In-Time Optimization
- Optimize the recorded graph.
Kernel fusion
- Reuse optimized graph if next
iteration has the same computation.
61
Announcing: The New 2017-2018 Grad Fellows And Finalists
62
NEW 2017-2018 GRAD FELLOWS
Awni Hannun, Stanford Deepak Pathak, UC Berkeley Caroline Trippel, Princeton Fereshteh Sadeghi,
- Univ. Washington
Ling-Qi Yan, UC Berkeley Abigail See, Stanford
63
NEW 2017-2018 GRAD FELLOWS
Robin Betz, Stanford Xiaolong Wang, CMU Adams Wei Yu, CMU Anna Shcherbina, Stanford NVIDIA Foundation Fellow Robert Konrad, Stanford
64
NVIDIA FOUNDATION
- Compute The Cure Initiative to support research in cancer biology and treatment
- Focus on Bioinformatics, Genomics, Proteomics
- Six $200K Research Grants to nonprofit & academic labs since 2013
- Annual PhD Fellowships to promising researchers in related fields:
- http://www.nvidia.com/object/compute-the-cure.html
2016-2017 Gang Wu AI for Fluorescence Lifetime Imaging 2017-2018 Anna Shcherbina DL for Epigenetic Regulatory Mechanisms
65
NEW 2017-2018 GRAD FELLOW FINALISTS
- Daniel Thuerck, Technische Universität Darmstadt
- Leyuan Wang, University of California at Davis
- Mohammad Babaeizadeh, University of Illinois at Urbana-Champaign
- Philippe Tillet, Harvard University
- Dingzeyu Li, Columbia University