FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA - PowerPoint PPT Presentation

GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017

GRADUATE FELLOWSHIP PROGRAM Funding for Ph.D. students revolutionizing disciplines with the GPU Engage: Build mindshare • Facilitate recruiting • Learn: • Keep a finger on the pulse of leading academic research • Keep up with all the applications that are powered by GPUs Leverage: Track relevant research • Help to guide researchers working on relevant problems • 2

GRADUATE FELLOWSHIP PROGRAM 144 Graduate Fellowships awarded -- $3.8M since program inception in 2002 Eligibility/Application Process: • Ph.D. candidates in at least their 2 nd year • Nomination(s) by Professor(s)/Advisor • 1-2 page research proposal Selection Process: • Committee of NVIDIA scientists and engineers review applications • Applications evaluated for originality, potential, and relevance 3

CURRENT 2016-2017 GRAD FELLOWS Saman Ashkiani, UC Davis Yong He, CMU Yatish Turakhia, Stanford Gang Wu, Univ of Sussex Minjie Wang, NYU Jiajun Wu, MIT NVIDIA Foundation Fellow 4

CURRENT 2016-2017 GRAD FELLOW FINALISTS Ahmed Elkholy, University of Illinois at Urbana-Champaign • • Achuta Kadambi, Massachusetts Institute of Technology Caroline Trippel, Princeton • Yu-Hang Tang, Brown University • Ling-Qi Yan, University of California at Berkeley • 5

6 Talks AGENDA 3 Minutes each 6

JIAJUN WU, MIT 7

SINGLE IMAGE 3D INTERPRETER NETWORK Jiajun Wu, May 11, 2017

GOAL 9

3D INTERPRETER NETWORK (3D-INN) 2D Keypoint Labels 2D Keypoint 3D 3D-to-2D Estimation Interpreter Projection Three-step training paradigm 10

3D INTERPRETER NETWORK (3D-INN) 2D Keypoint Estimation Three-step training paradigm I: 2D Keypoint Estimation 11

3D INTERPRETER NETWORK (3D-INN) 3D Interpreter Three-step training paradigm I: 2D Keypoint Estimation II: 3D Interpreter 12

3D INTERPRETER NETWORK (3D-INN) 2D Keypoint Labels 2D Keypoint 3D 3D-to-2D Estimation Interpreter Projection Three-step training paradigm I: 2D Keypoint Estimation II: 3D Interpreter III: End-to-end Finetuning 13

3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category 14

3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category Keypoint-5 dataset 15

3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category IKEA Dataset [Lim et al, ’13] 16

3D ESTIMATION: QUALITATIVE RESULTS SUN Training: our Keypoint-5 dataset, 2K images per category Input SUN Database [Xiao et al, ’11] After FT 17

3D ESTIMATION: QUALITATIVE RESULTS Training: our Keypoint-5 dataset, 2K images per category SUN Database [Xiao et al, ’11] 18

CHAIR EMBEDDING Manifold of chairs based on their inferred viewpoint 19

CONTRIBUTIONS OF 3D-INN Single image 3D perception Real 2D labels + synthetic 3D models, connected via keypoints A 3D-to-2D projection layer for end-to-end training 20

YATISH TURAKHIA, STANFORD 22

DARWIN: A GENOMICS CO-PROCESSOR Yatish Turakhia, 05/11/2017

GENOME ANALYSIS PIPELINE Patient Reads Genome (3 Billion base pairs) 1 2 ATGTCGAT REFERENCE:--ATGTC G ATGATCCAGAGGATA C TAGGATAT- CGATACGA Read assembly GAGTCATC PATIENT: --ATGTC A ATGAT - CAGAGGATA T TAGGATAT- DNA sequencer (Sequence alignment) ACTGACGT Mutations Find the disease- 3 causing mutation • Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have high error rate (15-40%) • >1,300 CPU hours for reference-guided assembly of noisy long reads • >15,600 CPU hours for de novo assembly of noisy long reads 24

DARWIN: SEQUENCE ALIGNMENT FRAMEWORK D-SOFT GACT Darwin D-SOFT GACT Query (Q) 40nm ASIC Query (Q) (Seed) (Extend) 300mm 2 , 9W D-SOFT API GACT API Reference (R) Reference (R) Software High speed and programmability 1. D-SOFT: Tunable speed/sensitivity to match different error profiles 2. GACT: First algorithm with O(1) memory for compute-intensive step of alignment allowing arbitrarily long alignments in hardware – well-suited to long reads 3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware 25

DARWIN: REFERENCE-GUIDED ASSEMBLY Reference genome ~3Gbp D-SOFT GACT Reads 1 st GACT tile Candidate alignment Extended GACT Seed hit ~10Kbp start locations (from D-SOFT) GACT tiles trace-back Score=7500 Score=60 Read Read ~10 6 Reference Reference 26

DARWIN: DE NOVO ASSEMBLY Reads GACT D-SOFT 1 st GACT tile Candidate alignment Extended GACT Seed hit (from D-SOFT) GACT tiles start locations trace-back Score=2500 Inferred ... Read ... Read overlap . . ... ... ... ... . . . . Reference Reference 40-100X speedup 6000X speedup 1. Sequential accesses to multiple 1. 512 Processing Elements (PEs) solving 3 DRAM channels dynamic programming equations every cycle 2. Random accesses using large on- 2. Trace-back pointers maintained in on-chip chip memory (64MB) memory (2KB/PE) 27

DARWIN PERFORMANCE Reference-guided assembly SENSITIVITY DARWIN READ TYPE ERROR RATE SPEEDUP SOFTWARE DARWIN Pacific Biosciences 15% 95.95% 99.91% 4,110X Oxford Nanopore 2D 30% 98.11% 98.40% 4,080X Oxford Nanopore 1D 40% 97.10% 97.40% 128X De novo assembly SENSITIVITY DARWIN READ TYPE ERROR RATE SPEEDUP SOFTWARE DARWIN Pacific Biosciences 15% 99.80% 99.89% 250X 28

THANK YOU!

SAMAN ASHKIANI, UC DAVIS 30

DYNAMIC DATA STRUCTURES FOR THE GPU Saman Ashkiani, 05/11/2017

DYNAMIC DATA STRUCTURES FOR THE GPU Objective: a general-purpose data structure that can be updated at runtime • • Supports updates (insert/deletion): batched or individual Efficient queries (lookup, count, range, etc.): batched or individual • • Motivation: more types of GPU data structures in programmer’s toolbox • It is a challenging task, because GPUs have thousands of parallel threads: need an efficient non-blocking data structure • • Most classic non-blocking ideas are hard to be efficiently implemented in SIMD fashion Efficient dynamic memory allocation is hard on GPUs • Safe memory reclamation: no dynamic memory management on GPUs • 32

OUR IMPLEMENTATIONS GPU LSM CONCURRENT HASH TABLE Hash table with chaining Dictionary data structure: multiple sorted arrays with different sizes • Updates: concurrent insertion/deletion (497 M updates/s on K40) • Updates: batch insertion/deletion • Queries: lookup (860 M queries/s on K40) (average: 225 M updates/s on K40) • Each bucket: a warp friendly linked list Queries: lookup, count, and range • (133, 60, and 30 M queries/s) • Warp-synchronous programming to better fit SIMD model Based on radix sort, merge, and • binary search • Our own dynamic memory allocator for nodes • Paper draft: http://ece.ucdavis.edu/~ashkiani/ • Memory reclamation: safe removal of gpu_lsm.pdf deleted nodes, for future reuse 33

GANG WU, UNIVERSITY OF SUSSEX 35

HIGH-SPEED FLUORESCENCE LIFETIME IMAGING BASED ON ANN AND GPU Gang Wu, 11 th May 2017

CONTENTS What is FLIM Project Aims FLIM Theories ANN-GPU-FLIM Results 37

WHAT IS FLIM FLIM Fluorescence-lifetime imaging microscopy is an technique for producing an image based on the differences in the exponential decay rate of the fluorescence from a fluorescent sample. Gold nanorods Applications Surgery guidance Disease therapies Disease diagnosis 38

PROJECT AIMS Current systems CPU based traditional FLIM analysis is very slow (tens of minutes for one image) Aims: high-speed FLIM analysis Fast algorithm ( Artificial neural network ) Highly paralleled hardware ( GPU ) 39

FLIM THEORIES Laser Lifetime TCSPC analysis GPU This work Sample Detector CPU with GUI 40

FLIM THEORIES 1.5 Photon counts 𝐵, 𝑔 𝐸 , 𝜐 𝐺 , 𝜐 𝐸 0.7 3 ns 0 2.2 Time bin Fluorescence decay histogram 41

ANN-GPU-FLIM ANN-GPU-FLIM principle FLIM Images FLIM data Artificial Neural Network 60 Photon count … 30 … … … 0 1 2 3 … 165 Once network training 166 42

RESULTS Accuracy performance Different optimized areas, comparable performance. Time performance SPEEDUP OVERALL ALGORITHMS IMAGE SIZE TIME-CPU (S) TIME-GPU (S) (GPU VS CPU) SPEEDUP ANN 0.89 0.1 8.9 256×256 415 LSM 41.5 3.8 10.8 43

YONG HE, CMU 45

EVOLVING SHADER COMPILATION FOR PERFORMANCE AND MAINTAINABILITY Yong He, May 2017

EVOLVING SHADER COMPILERS Meeting Performance Goals with Productivity Constraints Modern games feature increasingly more realistic graphics A game’s shader library has grown 100x more complex Shading language is still the same as ten years ago, lack functionality for achieving high performance without compromising code modularity and extensibility 47

AUTOMATIC APPROXIMATE SHADER COMPILATION Performance Productivity Fast Shader Fast Code Compilation A System for Rapid, Automatic Shader Level-of-Detail. Yong He, Tim Foley, Natalya Tatarchuk, Kayvon Fatahalian. SIGGRAPH Asia 2015 48

FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA - PowerPoint PPT Presentation

GRADUATE FELLOW FAST FORWARD Bill Dally, Chief Scientist and SVP Research, NVIDIA Thursday, May 11, 2017 GRADUATE FELLOWSHIP PROGRAM Funding for Ph.D. students revolutionizing disciplines with the GPU Engage: Build mindshare Facilitate

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

The Education for All The Education for All Fast Track Initiative Fast Track Initiative

Redis for Fast Data Ingest Agenda Fast Data Ingest and its challenges Redis for Fast

Fast-track listing Fast-track listing process Time to market can be essential benefits of

Fast-SCNN: Fast Semantic Segmentation Network Rudra PK Poudel Stephan Liwicki Roberto Cipolla

IOTA/FAST Collaboration Meeting - Intro Vladimir SHILTSEV, AD/APC IOTA/FAST Workshop and

SUPER FAST 15 MINS SUPER FAST 15 MINS 1300 733 215 1300 733 215 UNLIMITED DATA UNLIMITED DATA

IDN ccTLD Fast Track Update Naela Sarras IDN Fast Track Manager Agenda Status update

FAST-VM, ANSIBLE+PACEMAKER 2018/03/07 ONDREJ FAM RA Senior Technical Support Engineer

Life in the Fast Lane: the confluence lens George Varghese, Microsoft Research I drive fast

Efficient Cryptography on the RISC-V Architecture Ko Stoffelen Tl;dr In this talk: Fast

Summary of flat-beam studies at FAST during FALL17 run A. Halavanau*, work by all the FAST team.

Booster Fast Loss Monitoring PIP Booster Workshop R.J. Tesarek 11/23/15 1 Fast Loss Monitor

Associative Graph Data Structures Used for Acceleration of K Nearest Neighbor Classifiers AGH

Rancang Bangun Pengendali Hoist Pada Miniatur Rubber Tyred Gantry Crane OLEH : EDWIN ABDURAHMAN

UPAKWESHIP PRESENTS THE U-CRATE 100 A 100 CUBIC FOOT CRATE , PERFECT FOR THE ESSENTIALS!

Configuring Visual Studio Linking Presentation and Middle Layers So far in this work we have

Sorting Announcements for This Lecture Finishing Up Assignment 7 Submit a course evaluation

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Machines! The Sorting Hat is sick today We need to help it sort the students! But all we know

CHESTER COUNTY ENGINEERS November 15, 2018 Legal, Legislative, and Regulatory Issues for Civil