cs 744 tpu
play

CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - PowerPoint PPT Presentation

! morning good CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm Papers from SCOPE to TPU Thu 2 . . Similar format etc. Piazza


  1. ! morning good CS 744: TPU Shivaram Venkataraman Fall 2020

  2. Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm – Papers from SCOPE to TPU Thu 2 . . – Similar format etc. Piazza Presentations project week after : Presentations Dec 8, 10 talks 4 min - – Sign up sheet slides 4 to 3 - – Presentation template statement Problem – Trial run? - Approach - results Initial Dee 17h - In - progress / final report → -

  3. ↳ MOTIVATION Capacity demands on datacenters e- text speech to convert model to New workloads search ML Voice → → Metrics sensitive Power/operation latency workloads Latency ( or Hut ) → are Performance/operation → Total cost of ownership operate Buy ( Build ) t Goal: Improve cost-performance by 10x over GPUs

  4. WORKLOAD CNN only not 5% ⑥ are weights is ① Number of , and MPs batch are 61% correlated size with ops very high ④ have enns ⑨ hare Mlp & same Lsm ops 1- Byte L batch and size ops / byte ÷ ← - ← ← D O : DNN: RankBrain, LSTM: subset of GNM Translate CNNs: Inception, DeepMind AlphaGo

  5. WORKLOAD: ML INFERNCE model weights from I convert Quantization à Lower precision, energy use 8 bit integer hit float → 32 8-bit integer multiplies (unlike training), 6X less energy and 6X less area T Need for predictable latency and not throughput - e.g., 7ms at 99th percentile Inference Focus on - average improve caches → ! only branch prediction scenario case

  6. TPU DESIGN CONTROL - connect t ) inter PCIe compatibility for limited has Pae → & latency bandwidth ← issued - queried ons host from µ buffer ↳ Instruction - simple thread ! single

  7. ↳ COMPUTE L 8 - bit bit 7 L 16 > O 512 - 255 O - Multiply Unit matrix → ↳ Fully connected convolutions ① the chip of 24% area → Multiply MAC s t t Accumulate - - bit lb or integer ran → 8 bit compute Separate → unit for Activation & Normalize / Pool

  8. Models in size 8GB DATA A X B → can = - OO fit here = = . Models ( or weights ) C) € 7 stored - chip off in - are DRAM arts ) a ::m÷ : : ' " DO - buffer unified d I t - fetching of pipeline (3) matrix weights with multiply \ results Intermediate ④ and then accumulated Unified Buffer stored in

  9. ↳ INSTRUCTIONS CISC format (why ?) set instruction → Specialized 1. Read_Host_Memory 2. Read_Weights → CISC → encode 3. MatrixMultiply/Convolve Instructions - 4. Activate take that operations → 5. Write_Host_Memory to run cycles many

  10. ⇒ ↳ SYSTOLIC EXECUTION Problem: Reading a large SRAM uses much more power than arithmetic! - Typical cpu or Registers , L2 etc ) ( L1 baches , I units compute inputs to have ¥ Tpu of data propagation - like wave → element every for Data reuse → execution predictable → , predictable performance

  11. - tart , band ) Head ROOFLINE MODEL µ operational intensity ( memory f- Slope x. axis part : ; per compute Amount of read of data ⑦ byte ! operations I second TeraOps/sec y - axis : comes from - Blue line spec hardware intensive compute corns bound are - compute bound close ← Memory & Mvps are ↳ em - c- hardware to peak perf of Operational Intensity: MAC Ops/weight byte

  12. HASWELL ROOFLINE ops ! weight byte scope Cpu lo ends at " ⑥ Number of l points d ' TeraOps/sec away . . A roofline . . ⑥ Much i lower I Tera Ops / i second ' Operational Intensity: MAC Ops/weight byte

  13. ⇒ T ry off chip COMPARISON WITH CPU, GPU = .IE?g%cam.myxpower-me1argaegsm.d;nhigherm dukes , L1 , L2 , 43 7 A - D O O - - I - v1 a - . t J down power APUs bring tower - 2x idle used when and am configured Cpu much not so - Tp vs

  14. SELECTED LESSONS Latency more important than throughput for inference • LSTMs and MLPs are more common than CNNs • improve also to • Performance counters are helpful → models DNN compilers of Remember architecture history •

  15. SUMMARY New workloads à new hardware requirements Domain specific design (understand workloads!) No features to improve the average case No caches, branch prediction, out-of-order execution etc. Simple design with MACs, Unified Buffer gives efficiency Drawbacks No sparse support, training support (TPU v2, v3) Vendor specific ?

  16. DISCUSSION https://forms.gle/tss99VSCMeMjZx7P6

  17. have higher tput ① Larger batches Tyer tail latent also higher but ewes per see I O G O O higher ② much tuts IPO are target latency meeting with while 7ms at higher be ③ can Tpu CPU compared to batch size IPS at same higher has ④ apu latency relates to avg → .

  18. How would TPUs impact serving frameworks like Clipper? Discuss what specific effects it could have on distributed serving systems architecture models ① many to Tpvs 8h13 share have Clipper in containers break might ↳ but this frequent less ② stragglers are helpful be batching ) very can - ( Auto ⑦ Batching ④

  19. NEXT STEPS No class Thursday! Happy Thanksgiving! Next week schedule: Tue: Fairness in ML, Summary Thu: Midterm 2

  20. ENERGY PROPORTIONALITY

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend