CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - PowerPoint PPT Presentation

! morning good CS 744: TPU Shivaram Venkataraman Fall 2020

Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm – Papers from SCOPE to TPU Thu 2 . . – Similar format etc. Piazza Presentations project week after : Presentations Dec 8, 10 talks 4 min - – Sign up sheet slides 4 to 3 - – Presentation template statement Problem – Trial run? - Approach - results Initial Dee 17h - In - progress / final report → -

↳ MOTIVATION Capacity demands on datacenters e- text speech to convert model to New workloads search ML Voice → → Metrics sensitive Power/operation latency workloads Latency ( or Hut ) → are Performance/operation → Total cost of ownership operate Buy ( Build ) t Goal: Improve cost-performance by 10x over GPUs

WORKLOAD CNN only not 5% ⑥ are weights is ① Number of , and MPs batch are 61% correlated size with ops very high ④ have enns ⑨ hare Mlp & same Lsm ops 1- Byte L batch and size ops / byte ÷ ← - ← ← D O : DNN: RankBrain, LSTM: subset of GNM Translate CNNs: Inception, DeepMind AlphaGo

WORKLOAD: ML INFERNCE model weights from I convert Quantization à Lower precision, energy use 8 bit integer hit float → 32 8-bit integer multiplies (unlike training), 6X less energy and 6X less area T Need for predictable latency and not throughput - e.g., 7ms at 99th percentile Inference Focus on - average improve caches → ! only branch prediction scenario case

TPU DESIGN CONTROL - connect t ) inter PCIe compatibility for limited has Pae → & latency bandwidth ← issued - queried ons host from µ buffer ↳ Instruction - simple thread ! single

↳ COMPUTE L 8 - bit bit 7 L 16 > O 512 - 255 O - Multiply Unit matrix → ↳ Fully connected convolutions ① the chip of 24% area → Multiply MAC s t t Accumulate - - bit lb or integer ran → 8 bit compute Separate → unit for Activation & Normalize / Pool

Models in size 8GB DATA A X B → can = - OO fit here = = . Models ( or weights ) C) € 7 stored - chip off in - are DRAM arts ) a ::m÷ : : ' " DO - buffer unified d I t - fetching of pipeline (3) matrix weights with multiply \ results Intermediate ④ and then accumulated Unified Buffer stored in

↳ INSTRUCTIONS CISC format (why ?) set instruction → Specialized 1. Read_Host_Memory 2. Read_Weights → CISC → encode 3. MatrixMultiply/Convolve Instructions - 4. Activate take that operations → 5. Write_Host_Memory to run cycles many

⇒ ↳ SYSTOLIC EXECUTION Problem: Reading a large SRAM uses much more power than arithmetic! - Typical cpu or Registers , L2 etc ) ( L1 baches , I units compute inputs to have ¥ Tpu of data propagation - like wave → element every for Data reuse → execution predictable → , predictable performance

- tart , band ) Head ROOFLINE MODEL µ operational intensity ( memory f- Slope x. axis part : ; per compute Amount of read of data ⑦ byte ! operations I second TeraOps/sec y - axis : comes from - Blue line spec hardware intensive compute corns bound are - compute bound close ← Memory & Mvps are ↳ em - c- hardware to peak perf of Operational Intensity: MAC Ops/weight byte

HASWELL ROOFLINE ops ! weight byte scope Cpu lo ends at " ⑥ Number of l points d ' TeraOps/sec away . . A roofline . . ⑥ Much i lower I Tera Ops / i second ' Operational Intensity: MAC Ops/weight byte

⇒ T ry off chip COMPARISON WITH CPU, GPU = .IE?g%cam.myxpower-me1argaegsm.d;nhigherm dukes , L1 , L2 , 43 7 A - D O O - - I - v1 a - . t J down power APUs bring tower - 2x idle used when and am configured Cpu much not so - Tp vs

SELECTED LESSONS Latency more important than throughput for inference • LSTMs and MLPs are more common than CNNs • improve also to • Performance counters are helpful → models DNN compilers of Remember architecture history •

SUMMARY New workloads à new hardware requirements Domain specific design (understand workloads!) No features to improve the average case No caches, branch prediction, out-of-order execution etc. Simple design with MACs, Unified Buffer gives efficiency Drawbacks No sparse support, training support (TPU v2, v3) Vendor specific ?

DISCUSSION https://forms.gle/tss99VSCMeMjZx7P6

have higher tput ① Larger batches Tyer tail latent also higher but ewes per see I O G O O higher ② much tuts IPO are target latency meeting with while 7ms at higher be ③ can Tpu CPU compared to batch size IPS at same higher has ④ apu latency relates to avg → .

How would TPUs impact serving frameworks like Clipper? Discuss what specific effects it could have on distributed serving systems architecture models ① many to Tpvs 8h13 share have Clipper in containers break might ↳ but this frequent less ② stragglers are helpful be batching ) very can - ( Auto ⑦ Batching ④

NEXT STEPS No class Thursday! Happy Thanksgiving! Next week schedule: Tue: Fairness in ML, Summary Thu: Midterm 2

ENERGY PROPORTIONALITY

CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - PowerPoint PPT Presentation

! morning good CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm Papers from SCOPE to TPU Thu 2 . . Similar format etc. Piazza

TPU & CPU TPU & CPU June 12th, 2017 Coatings & Engineering Materials Div., Mitsui

CS 744: TPU Shivaram Venkataraman Fall 2019 Administrivia Midterm 2, Dec 10 th Papers from

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

TOMSK POLYTECHNIC UNIVERSITY university of resource- efficient technologies TPU

TPU for Exa-TrkX Xiangyang Ju ExaTrkX Collaboration Meeting 7 April 2020 Introduction

2.744 Dreamweaver Tutorial Sangmok Han sangmok@mit.edu Feb 24, 2010 Overview We will go over

QR CODES 4 All Diane Edgar Education Specialist Region 4 ESC 713.744.6862 Handout Follow

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment

Annual Budget 25448 Seil Rd. Shorewood, IL 60404 815-744-1968 www.troytownship.com P a g e | 1

Y R A N I M I L E R P 25448 Seil Rd. Shorewood, IL 60404 815-744-1968

Proposed Town Fund Levy Presentation 25448 Seil Rd. Shorewood, IL 60404 815-744-1968

Proposed Town Fund Levy Presentation 25448 Seil Rd. Shorewood, IL 60404 815-744-1968

Authority Financials Financial Snapshot May 2017 Profit/Loss $593,016 $409,744 Actual

CS 744: GOOGLE FILE SYSTEM Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS no - Assignment 1

CS 744: NAIAD Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course Project Proposal

CS 744: GANDIVA Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course project proposal -

Freytags Pyramid 05.30.10 || English 1302: Composition & Rhetoric II || D. Glen Smith,

CSE P521: Applied Algorithms Instructor: Prof. James R. Lee TAs: Evan McCarty (head), Jeffrey

A Second Look At Prolog Chapter Twenty Modern Programming Languages, 2nd ed. 1 Outline

Chapter 23: Minimal Spanning Trees. Context : Weighted, connected, undirected graph, G = ( V, E ),

Chemical Thermodynamics Thermochemistry Start looking at techniques for chemical systems

2. Continuous point heat source in infinite body: If the heat is liberated at the rate dQ= P.dt

Lecture 24: Heat Flow COMPSCI/MATH 290-04 Chris Tralie, Duke University 4/12/2016 COMPSCI/MATH

Heat Flow in Space and Time Time-Stepping Via the Leap Frog Algorithm Rubin H Landau Sally

Sambuz

Useful Links

Newsletter

Mail Us

CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - PowerPoint PPT Presentation

! morning good CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm Papers from SCOPE to TPU Thu 2 . . Similar format etc. Piazza

TPU &amp; CPU TPU &amp; CPU June 12th, 2017 Coatings &amp; Engineering Materials Div., Mitsui

CS 744: TPU Shivaram Venkataraman Fall 2019 Administrivia Midterm 2, Dec 10 th Papers from

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

TOMSK POLYTECHNIC UNIVERSITY university of resource- efficient technologies TPU

TPU for Exa-TrkX Xiangyang Ju ExaTrkX Collaboration Meeting 7 April 2020 Introduction

2.744 Dreamweaver Tutorial Sangmok Han sangmok@mit.edu Feb 24, 2010 Overview We will go over

QR CODES 4 All Diane Edgar Education Specialist Region 4 ESC 713.744.6862 Handout Follow

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment

Annual Budget 25448 Seil Rd. Shorewood, IL 60404 815-744-1968 www.troytownship.com P a g e | 1

Y R A N I M I L E R P 25448 Seil Rd. Shorewood, IL 60404 815-744-1968

Proposed Town Fund Levy Presentation 25448 Seil Rd. Shorewood, IL 60404 815-744-1968

Proposed Town Fund Levy Presentation 25448 Seil Rd. Shorewood, IL 60404 815-744-1968

Authority Financials Financial Snapshot May 2017 Profit/Loss $593,016 $409,744 Actual

CS 744: GOOGLE FILE SYSTEM Shivaram Venkataraman Fall 2020 ANNOUNCEMENTS no - Assignment 1

CS 744: NAIAD Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course Project Proposal

CS 744: GANDIVA Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course project proposal -

Freytags Pyramid 05.30.10 || English 1302: Composition &amp; Rhetoric II || D. Glen Smith,

CSE P521: Applied Algorithms Instructor: Prof. James R. Lee TAs: Evan McCarty (head), Jeffrey

A Second Look At Prolog Chapter Twenty Modern Programming Languages, 2nd ed. 1 Outline

Chapter 23: Minimal Spanning Trees. Context : Weighted, connected, undirected graph, G = ( V, E ),

Chemical Thermodynamics Thermochemistry Start looking at techniques for chemical systems

2. Continuous point heat source in infinite body: If the heat is liberated at the rate dQ= P.dt

Lecture 24: Heat Flow COMPSCI/MATH 290-04 Chris Tralie, Duke University 4/12/2016 COMPSCI/MATH

Heat Flow in Space and Time Time-Stepping Via the Leap Frog Algorithm Rubin H Landau Sally

Sambuz

Useful Links

Newsletter

Mail Us

TPU & CPU TPU & CPU June 12th, 2017 Coatings & Engineering Materials Div., Mitsui

Freytags Pyramid 05.30.10 || English 1302: Composition & Rhetoric II || D. Glen Smith,