arm big little technology
play

ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture - PowerPoint PPT Presentation

Advanced Seminar Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual


  1. Advanced Seminar – Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1

  2. 1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual Memory 2. Performance 3. 4. Conclusion 2

  3.  Smartphone/Tablet use cases: Idle most of the time 1.  low power CPU 2. High-performance requirements  high performance CPU  Difficult to achieve with one CPU 3

  4.  Idea: ARM big.LITTLE Fusing a low-power and a high-performance CPU in one chip big 1 2 Big Cortex A57 LITTLE LITTLE Cortex A53 • Gaming 1 2 • HD – videos • OS • Rich Web • UI 3 4 3 4 Services • Internet • … • E-Mail L2 Cache L2 Cache • … Cache Coherent Interconnect 4

  5. Basics 5

  6.  A dvanced R ISC M achines  Founded: 1990 by Acorn, Apple and VLSI  Origin: Microcontrollers / Embedded Systems  Business model: design and licensing of Intellectual Property (IP)  Revenue: 1.2 billion USD ( Intel: 55.8 billion USD )  Employees: 3,300 ( Intel: 106,700 )  Market Share: > 90% (2014, smartphone/tablet) 6

  7.  ARM Instruction Set:  RISC (Reduced Instruction Set Computing) 7

  8. RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] 8

  9.  ARM Instruction Set:  RISC (Reduced Instruction Set Computing)  16 general purpose registers + 2 status registers  32-bit fixed-size instructions  Condition Codes for (almost) all instructions  Barrel Shifter for ALU  16-bit fixed-size THUMB instructions  Digital Signal Processing (DSP) instructions  Cryptography Extension Instructions Not strictly RISC 9

  10. RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] ADD r0, r0, r1, LSL #3 Microcode LDR r3, [r0, #4]! ADD r3, #1 STR r3, [r0] 10

  11. Instruction Set Architecture (ISA) has no significant impact on performance and power consumption Average Power (normalized) 4 A8 (ARM, 0.6GHz, 65nm, iPhone 4) 3 A15 (ARM, 1.66GHz, 32nm, Galaxy S4) 2 Atom (x86, 1.66GHz, 45nm, Netbook) 1 i7 (x86, 3.4GHz, 32nm, Desktop) 0 Tech-independet, scaled to 1GHz, 45 nm process, normalized to A8 11

  12.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 12

  13.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size Reducing capacitance  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 13

  14.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size Dynamically adjusting supply voltage and  Voltage and Frequency Scaling clock speed according  Power-domains to need  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 14

  15.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling Power supply for different sections of  Power-domains core can be turned  Clock-gating on/off independently  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 15

  16.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains Clock for different sections of the core  Clock-gating can be turned on/off  Power-modes independently  Pipelining  Caches  SoC (System-On-A-Chip) design 16

  17.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating Predefined low-power modes utilizing the  Power-modes above mentioned  Pipelining features  Caches  SoC (System-On-A-Chip) design 17

  18.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes Reducing idle time of  Pipelining different parts of core  Caches  SoC (System-On-A-Chip) design 18

  19.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining Reducing time and power intensive  Caches accesses to main  SoC (System-On-A-Chip) design memory 19

  20.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches Adjusting all components of a  SoC (System-On-A-Chip) design processor to one- another 20

  21.  ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating ARMs emphasis is on power  Power-modes consumption and size  Momentum for mobile market  Pipelining  Caches  SoC (System-On-A-Chip) design 21

  22. Core 1 MMU Instr. Instr. Arbiter BUS L1 Cache µTLB TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 22 SoC

  23. Cortex A53 8-stage (integer), in-order Core 1 MMU Cortex A57 Instr. Instr. µTL Arbiter BUS L1 Cache 15-stage (integer), out-of-order B TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 23 SoC

  24. LITTLE big CPU Cortex A53 Cortex A57 64-bit Yes Yes Cores 1 – 4 1 – 4 Frequency* 1.3 GHz 1.9 GHz L1 Cache 8 – 64 kB 48/32 kB L2 Cache 128 – 2,048 kB 512 – 2,048 kB Integer depth 8 15 Pipeline Out-of-order No Yes Performance 2.3 DMIPS/MHz 4.1 DMIPS/MHz Technology node* 20 nm 20 nm Core Size* 0.70 mm² 2.05 mm² Cluster Size* 4.58 mm² 15.10 mm² * Values for SoC Samsung Exynos 5433 (Galaxy Note 4) 24

  25. Cortex-A Power Consumption 8 Power Consumption (W) 7 6 5 A53 (1 Core) 4 A53 (4 Cores) 3 A57 (1 Core) 2 A57 (4 Cores) 1 0 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 Frequency (MHz) SoC: Samsung Exynos 5433 (Galaxy Note 4) 25

  26. Heterogenous multi-processing 26

  27. Connecting two heterogeneous clusters… 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache  Binary compatible 27

  28. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI AXI = A dvanced e X tensible I nterface 28

  29. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI Read_Adress Read_Data Write_Adress Write_Data Write_Ack 29

  30. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Read_Adress Read_Data Write_Adress Write_Data Write_Ack C_Address C_Data C_Response 30

  31. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE ACE = A XI C oherency E xtension C_Address C_Data C_Response 31

  32. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Cache Coherent Interconnect 32

  33. SoC L2 Cache 1 2 1 2 Memory Controller 3 4 3 4 Periphery Cache Coherent L2 Cache Interconnect BUS Display GPU 33

  34. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 Coherency States 3 4 Valid Invalid 3 4 Unique Shared L2 Cache L2 Cache Unique Shared Dirty AXI ACE AXI ACE Dirty Dirty Invalid Clean Unique Shared Clean Clean Cache Coherent Interconnect Analogical to MOESI -protocol: Modified, Owned, Exclusive, Shared, Invalid 34

  35. 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 35

  36. 1. LITTLE  load(A) 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 36

  37. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 37

  38. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 38

  39. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 4. CCI  load_mem(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect to main memory 39

  40. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 4. CCI  load_mem(A) 5. CCI  return(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 40

  41. 1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) A u 2 4. CCI  load_mem(A) 5. CCI  return(A) 3 4 3 4 Cache A u Cache AXI ACE AXI ACE Cache Coherent Interconnect 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend