ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture - PowerPoint PPT Presentation

Advanced Seminar – Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1

1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual Memory 2. Performance 3. 4. Conclusion 2

 Smartphone/Tablet use cases: Idle most of the time 1.  low power CPU 2. High-performance requirements  high performance CPU  Difficult to achieve with one CPU 3

 Idea: ARM big.LITTLE Fusing a low-power and a high-performance CPU in one chip big 1 2 Big Cortex A57 LITTLE LITTLE Cortex A53 • Gaming 1 2 • HD – videos • OS • Rich Web • UI 3 4 3 4 Services • Internet • … • E-Mail L2 Cache L2 Cache • … Cache Coherent Interconnect 4

Basics 5

 A dvanced R ISC M achines  Founded: 1990 by Acorn, Apple and VLSI  Origin: Microcontrollers / Embedded Systems  Business model: design and licensing of Intellectual Property (IP)  Revenue: 1.2 billion USD ( Intel: 55.8 billion USD )  Employees: 3,300 ( Intel: 106,700 )  Market Share: > 90% (2014, smartphone/tablet) 6

 ARM Instruction Set:  RISC (Reduced Instruction Set Computing) 7

RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] 8

 ARM Instruction Set:  RISC (Reduced Instruction Set Computing)  16 general purpose registers + 2 status registers  32-bit fixed-size instructions  Condition Codes for (almost) all instructions  Barrel Shifter for ALU  16-bit fixed-size THUMB instructions  Digital Signal Processing (DSP) instructions  Cryptography Extension Instructions Not strictly RISC 9

RISC (ARM) CISC (IA-32) MOV r2, #8 ADD $1, 4(%eax, %ebx, 8) MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] ADD r0, r0, r1, LSL #3 Microcode LDR r3, [r0, #4]! ADD r3, #1 STR r3, [r0] 10

Instruction Set Architecture (ISA) has no significant impact on performance and power consumption Average Power (normalized) 4 A8 (ARM, 0.6GHz, 65nm, iPhone 4) 3 A15 (ARM, 1.66GHz, 32nm, Galaxy S4) 2 Atom (x86, 1.66GHz, 45nm, Netbook) 1 i7 (x86, 3.4GHz, 32nm, Desktop) 0 Tech-independet, scaled to 1GHz, 45 nm process, normalized to A8 11

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 12

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size Reducing capacitance  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 13

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size Dynamically adjusting supply voltage and  Voltage and Frequency Scaling clock speed according  Power-domains to need  Clock-gating  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 14

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling Power supply for different sections of  Power-domains core can be turned  Clock-gating on/off independently  Power-modes  Pipelining  Caches  SoC (System-On-A-Chip) design 15

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains Clock for different sections of the core  Clock-gating can be turned on/off  Power-modes independently  Pipelining  Caches  SoC (System-On-A-Chip) design 16

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating Predefined low-power modes utilizing the  Power-modes above mentioned  Pipelining features  Caches  SoC (System-On-A-Chip) design 17

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes Reducing idle time of  Pipelining different parts of core  Caches  SoC (System-On-A-Chip) design 18

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining Reducing time and power intensive  Caches accesses to main  SoC (System-On-A-Chip) design memory 19

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating  Power-modes  Pipelining  Caches Adjusting all components of a  SoC (System-On-A-Chip) design processor to one- another 20

 ARM Instruction Set  Microarchitecture:  Technology-node and feature size  Voltage and Frequency Scaling  Power-domains  Clock-gating ARMs emphasis is on power  Power-modes consumption and size  Momentum for mobile market  Pipelining  Caches  SoC (System-On-A-Chip) design 21

Core 1 MMU Instr. Instr. Arbiter BUS L1 Cache µTLB TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 22 SoC

Cortex A53 8-stage (integer), in-order Core 1 MMU Cortex A57 Instr. Instr. µTL Arbiter BUS L1 Cache 15-stage (integer), out-of-order B TLB Data Data Snoop Controller Unit L2 Cache (shared) Cluster 23 SoC

LITTLE big CPU Cortex A53 Cortex A57 64-bit Yes Yes Cores 1 – 4 1 – 4 Frequency* 1.3 GHz 1.9 GHz L1 Cache 8 – 64 kB 48/32 kB L2 Cache 128 – 2,048 kB 512 – 2,048 kB Integer depth 8 15 Pipeline Out-of-order No Yes Performance 2.3 DMIPS/MHz 4.1 DMIPS/MHz Technology node* 20 nm 20 nm Core Size* 0.70 mm² 2.05 mm² Cluster Size* 4.58 mm² 15.10 mm² * Values for SoC Samsung Exynos 5433 (Galaxy Note 4) 24

Cortex-A Power Consumption 8 Power Consumption (W) 7 6 5 A53 (1 Core) 4 A53 (4 Cores) 3 A57 (1 Core) 2 A57 (4 Cores) 1 0 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 Frequency (MHz) SoC: Samsung Exynos 5433 (Galaxy Note 4) 25

Heterogenous multi-processing 26

Connecting two heterogeneous clusters… 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache  Binary compatible 27

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI AXI = A dvanced e X tensible I nterface 28

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI AXI Read_Adress Read_Data Write_Adress Write_Data Write_Ack 29

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Read_Adress Read_Data Write_Adress Write_Data Write_Ack C_Address C_Data C_Response 30

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE ACE = A XI C oherency E xtension C_Address C_Data C_Response 31

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI ACE AXI ACE Cache Coherent Interconnect 32

SoC L2 Cache 1 2 1 2 Memory Controller 3 4 3 4 Periphery Cache Coherent L2 Cache Interconnect BUS Display GPU 33

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 Coherency States 3 4 Valid Invalid 3 4 Unique Shared L2 Cache L2 Cache Unique Shared Dirty AXI ACE AXI ACE Dirty Dirty Invalid Clean Unique Shared Clean Clean Cache Coherent Interconnect Analogical to MOESI -protocol: Modified, Owned, Exclusive, Shared, Invalid 34

1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 35

1. LITTLE  load(A) 1 2 Big Cortex A57 LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 36

1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 37

1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 38

1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 4. CCI  load_mem(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect to main memory 39

1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) 1 2 4. CCI  load_mem(A) 5. CCI  return(A) 3 4 3 4 Cache Cache AXI ACE AXI ACE Cache Coherent Interconnect 40

1. LITTLE  load(A) 1 2 Big Cortex A57 2. CCI  snoop(A) LITTLE Cortex A53 3. big  resp(miss) A u 2 4. CCI  load_mem(A) 5. CCI  return(A) 3 4 3 4 Cache A u Cache AXI ACE AXI ACE Cache Coherent Interconnect 41

ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture - PowerPoint PPT Presentation

Advanced Seminar Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

In Kernel Switcher: A solution to support ARM's new big.LITTLE technology Presenter: Mathieu

Doing big.LITTLE right: little and big obstacles Uladizislau Rezki, Vitaly Wool Softprise

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Little Liverpool Range Initiative From Little Things, Big Things Grow What is the Little

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Illustration: =0.4%, =1.2% n =35 per-arm per-stage Do all experimental treatments share a

Semantics of Caching with SPOCA: A Stateless, Proportional, Optimally-Consistent Addressing

Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu

Design Space Optimization of Embedded Memory Design Space Optimization of Embedded Memory Systems

Build Your Own Static WCET Analyzer the Case of f the Automotiv ive Proce cessor AURIX TC275

MONITORING SERVERLESS ARCHITECTURES CAN YOU HELP WITH SOME PRODUCTION PROBLEMS? Your Manager

Meltdown Overview of a security vulnerability Stefano Ottolenghi @ Binary Analysis and Secure

PoWA 3 June, 28 2016 - 5432... Meet us! Authors Ronan Dunklau DBA @ Dalibo Open-Source:

SPLIT ARRAY CACHES FOR EMBEDDED APPLICATIONS Euromicro DSD 2010 Alice M. Tokarnia, Marina