professor media ic system lab graduate institute of
play

Professor Media IC & System Lab Graduate Institute of - PowerPoint PPT Presentation

Shao-Yi Chien () Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University Outline AI edge: distributed intelligence Tensor transform for memory-efficient operations


  1. Shao-Yi Chien (簡韶逸) Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University

  2. Outline — AI edge: distributed intelligence — Tensor transform for memory-efficient operations — Implementation results — Conclusion

  3. Internet-of-AI-Things AI Big IoT Data

  4. Where Should Computing be Located? — Data from Internet: big data Cloud Servers — Data from IoT: Ultra-big data ! — AI on the cloud? Aggregator — AI on the edge? Aggregator Smart Devices

  5. Distributed Intelligence AI Edge Senso sor Aggregator/ Ag Cloud Cl Ga Gate teway Data from Large La Small Sm Each Sensor Data Filtering Process Hi High Low Low Semantic Level Context Inferring Process Light-We Li Weight Learning/Reco cognition Cloud Serve vers rs with HSA, NPU, DSP, P, En Engine CPU/GPU PU/FPG PGA Neura ral Proce cesso ssors rs

  6. ��������� ������� �������� �������� ����������� ����������������� ����� ������������� ����� �������� ����������������������� Deep Learning Ecosystem Memory efficient is the most important target for optimization

  7. ������ ������ ����� ������� Unroll: Fast and Simple 7

  8. ������ Formulation of Unrolling 8

  9. ������� ��������� ������� ���������� ������ � � �������������� ������ ��������� ��������� ��������� ������� ������� ������� ������� ����� ����� ����� ����� �������� �������� �������� �������� Unroll: More than Conv. 9

  10. Unrolling: Where and Who? — Where the unrolling operation is employed? — Everywhere in optimized parallel computing systems! — CPU, GPU, DSP, VPU, ASIC — Who will execute unrolling in a system — General purpose processors: the software developers need to handle it — VPU and ASIC: it is embedded in the hardware for specific applications

  11. ���� ������ Problem of Unrolling Main memory Main memory 11

  12. Unroll is a Fast Blackbox Unroll Blackbox Main memory Processors 12

  13. Efficient Blackbox: Unroll as Last as Possible 13

  14. ������������ ������ Naïve Unrolling 14

  15. ���������� ��������������� ������������ ������ Unroll at Shared Memory 15

  16. � ������������������ ��������������� ����������������� ������������� ��������������������������������� ������������������������������������� Unroll Upon Computation 16

  17. ��������� ����� ������� �������������� ������ ��������� ������ Useful Unrolling Framework Requires — Formulation of unrolling — Build algorithms by unrolling — DNN — CV, ML — … — Memory efficient unrolling — GPUs — ASICs 17

  18. �������� ������������ ��� �������� ������ ����������� ����� ����� ����� ����� � � ����� ����� ����� ����� ������ UMI (Unrolled Memory Inner-Products) Operator — You simply write code for — Describing the unroll pattern and — Defining what to do for each row. — Efficient blackbox make you code fast. 18

  19. � � Memory Efficient Unrolling — Smooth dataflow must consider: DRAM reuse 1. Bank conflict 2. — Both can be analyzed by the formula: 19

  20. UMI: Experimental Results — UMI blackbox Baseline: OpenCV, Parboil and Caffe — CUDA version is available on Github — Code reduction 2--4x — Speed-up 1.4--26x — Hardware implementation is coming soon Ref: Y. S. Lin, W. C. Chen and S. Y. Chien, "Unrolled Memory Inner-Products: An Abstract GPU Operator for Efficient Vision-Related Computations," ICCV 2017 .

  21. ����������������� ������������� ����������������� ������������������� ��������������� ����� ASIC Design — TAU: 32-core parallel processor — Scaled up linearly 21

  22. Conclusion — AI edge: distributed intelligence — Memory access optimization is the key for efficient CNN computing — Unrolling plays an important role for memory optimization, which can also benefit other operations — A unrolling framework, tensor transform for memory- efficient operations, is developed to decouple unrolling operations — Implementation results: code reduction 2--4x; speed- up 1.4--26x

  23. ����� ������ �������� ������� ������� ������ ����������� ������������ �������� ������ ������ ��������������� ��� �������� �������� ������� ��������� ������ ����� Using UMI Operator is… 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend