ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture - - PowerPoint PPT Presentation

arm big little technology
SMART_READER_LITE
LIVE PREVIEW

ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture - - PowerPoint PPT Presentation

Advanced Seminar Computer Engineering Philipp Gsching 08.12.2015 ARM big.LITTLE Technology 1 1. Introduction 2. ARM Architecture Instruction Set 1. Microarchitecture 2. CPUs 3. 3. big.LITTLE Cache Coherency 1. Distributed Virtual


slide-1
SLIDE 1

ARM big.LITTLE Technology

Advanced Seminar – Computer Engineering Philipp Gsching 08.12.2015

1

slide-2
SLIDE 2
  • 1. Introduction
  • 2. ARM Architecture

1.

Instruction Set

2.

Microarchitecture

3.

CPUs

  • 3. big.LITTLE

1.

Cache Coherency

2.

Distributed Virtual Memory

3.

Performance

  • 4. Conclusion

2

slide-3
SLIDE 3

 Smartphone/Tablet use cases: 1.

Idle most of the time  low power CPU

  • 2. High-performance requirements

 high performance CPU

 Difficult to achieve with one CPU

3

slide-4
SLIDE 4

 Idea: ARM big.LITTLE

Fusing a low-power and a high-performance CPU in

  • ne chip

LITTLE

  • OS
  • UI
  • Internet
  • E-Mail

big

  • Gaming
  • HD – videos
  • Rich Web

Services

Cache Coherent Interconnect LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4 L2 Cache L2 Cache

4

slide-5
SLIDE 5

Basics

5

slide-6
SLIDE 6

 Advanced RISC Machines  Founded:

1990 by Acorn, Apple and VLSI

 Origin:

Microcontrollers / Embedded Systems

 Business model: design and licensing of

Intellectual Property (IP)

 Revenue:

1.2 billion USD

( Intel: 55.8 billion USD )

 Employees:

3,300

( Intel: 106,700 )

 Market Share: > 90%

(2014, smartphone/tablet)

6

slide-7
SLIDE 7

 ARM Instruction Set:

  • RISC (Reduced Instruction Set Computing)

7

slide-8
SLIDE 8

RISC (ARM)

MOV r2, #8 MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0]

CISC (IA-32)

ADD $1, 4(%eax, %ebx, 8)

8

slide-9
SLIDE 9

Not strictly RISC

 ARM Instruction Set:

  • RISC (Reduced Instruction Set Computing)
  • 16 general purpose registers + 2 status registers
  • 32-bit fixed-size instructions
  • Condition Codes for (almost) all instructions
  • Barrel Shifter for ALU
  • 16-bit fixed-size THUMB instructions
  • Digital Signal Processing (DSP) instructions
  • Cryptography Extension Instructions

9

slide-10
SLIDE 10

RISC (ARM)

MOV r2, #8 MUL r1, r1, r2 ADD r0, r0, r1 ADD r0, r0, #4 LDR r3, [r0] ADD r3, r3, #1 STR r3, [r0] ADD r0, r0, r1, LSL #3 LDR r3, [r0, #4]! ADD r3, #1 STR r3, [r0]

CISC (IA-32)

ADD $1, 4(%eax, %ebx, 8) Microcode

10

slide-11
SLIDE 11

Instruction Set Architecture (ISA) has no significant impact on performance and power consumption

Tech-independet, scaled to 1GHz, 45 nm process, normalized to A8 1 2 3 4

Average Power (normalized)

A8 (ARM, 0.6GHz, 65nm, iPhone 4) A15 (ARM, 1.66GHz, 32nm, Galaxy S4) Atom (x86, 1.66GHz, 45nm, Netbook) i7 (x86, 3.4GHz, 32nm, Desktop)

11

slide-12
SLIDE 12

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

12

slide-13
SLIDE 13

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Reducing capacitance

13

slide-14
SLIDE 14

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Dynamically adjusting supply voltage and clock speed according to need

14

slide-15
SLIDE 15

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Power supply for different sections of core can be turned

  • n/off independently

15

slide-16
SLIDE 16

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Clock for different sections of the core can be turned on/off independently

16

slide-17
SLIDE 17

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Predefined low-power modes utilizing the above mentioned features

17

slide-18
SLIDE 18

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Reducing idle time of different parts of core

18

slide-19
SLIDE 19

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Reducing time and power intensive accesses to main memory

19

slide-20
SLIDE 20

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

Adjusting all components of a processor to one- another

20

slide-21
SLIDE 21

 ARM Instruction Set  Microarchitecture:

  • Technology-node and feature size
  • Voltage and Frequency Scaling
  • Power-domains
  • Clock-gating
  • Power-modes
  • Pipelining
  • Caches
  • SoC (System-On-A-Chip) design

ARMs emphasis is on power consumption and size  Momentum for mobile market

21

slide-22
SLIDE 22

Snoop Controller Unit L2 Cache (shared) Cluster SoC MMU TLB BUS Arbiter Core 1

µTLB Instr. Data

L1 Cache

Instr. Data

22

slide-23
SLIDE 23

Snoop Controller Unit L2 Cache (shared) Cluster SoC MMU TLB BUS Arbiter Core 1

µTL B Instr. Data

L1 Cache

Instr. Data

Cortex A53

8-stage (integer), in-order

Cortex A57

15-stage (integer), out-of-order

23

slide-24
SLIDE 24

LITTLE big CPU Cortex A53 Cortex A57 64-bit Yes Yes Cores 1 – 4 1 – 4 Frequency* 1.3 GHz 1.9 GHz L1 Cache 8 – 64 kB 48/32 kB L2 Cache 128 – 2,048 kB 512 – 2,048 kB Pipeline Integer depth 8 15 Out-of-order No Yes Performance 2.3 DMIPS/MHz 4.1 DMIPS/MHz Technology node* 20 nm 20 nm Core Size* 0.70 mm² 2.05 mm² Cluster Size* 4.58 mm² 15.10 mm² * Values for SoC Samsung Exynos 5433 (Galaxy Note 4)

24

slide-25
SLIDE 25

1 2 3 4 5 6 7 8

400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900

Power Consumption (W) Frequency (MHz)

Cortex-A Power Consumption

A53 (1 Core) A53 (4 Cores) A57 (1 Core) A57 (4 Cores) SoC: Samsung Exynos 5433 (Galaxy Note 4)

25

slide-26
SLIDE 26

Heterogenous multi-processing

26

slide-27
SLIDE 27

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4 L2 Cache L2 Cache

Connecting two heterogeneous clusters…

27

 Binary compatible

slide-28
SLIDE 28

Big Cortex A57

AXI = Advanced eXtensible Interface

LITTLE Cortex A53 1 2 3 4 3 4 L2 Cache L2 Cache AXI 2 1 AXI

28

slide-29
SLIDE 29

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4 L2 Cache L2 Cache AXI AXI Read_Adress Read_Data Write_Adress Write_Data Write_Ack

29

slide-30
SLIDE 30

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4 L2 Cache L2 Cache AXI ACE AXI

Read_Adress Read_Data Write_Adress Write_Data Write_Ack

ACE

C_Address C_Data C_Response 30

slide-31
SLIDE 31

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4 L2 Cache L2 Cache AXI ACE AXI ACE

C_Address C_Data C_Response

ACE = AXI Coherency Extension

31

slide-32
SLIDE 32

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4 L2 Cache L2 Cache AXI ACE AXI ACE

Cache Coherent Interconnect

32

slide-33
SLIDE 33

SoC

Cache Coherent Interconnect

GPU BUS 1 2 3 4 L2 Cache 2 1 3 4 L2 Cache Memory Controller Display Periphery

33

slide-34
SLIDE 34

Valid Invalid Unique Shared Dirty Unique Dirty Shared Dirty Invalid Clean Unique Clean Shared Clean LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

L2 Cache L2 Cache AXI ACE AXI ACE Coherency States Analogical to MOESI-protocol: Modified, Owned, Exclusive, Shared, Invalid

34

slide-35
SLIDE 35

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

Cache Cache AXI ACE AXI ACE

35

slide-36
SLIDE 36

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

Cache Cache AXI ACE AXI ACE

  • 1. LITTLE  load(A)

36

slide-37
SLIDE 37

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

Cache Cache AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)

37

slide-38
SLIDE 38

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)

Cache Cache

38

slide-39
SLIDE 39

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)

to main memory Cache Cache

39

slide-40
SLIDE 40

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)

Cache Cache

40

slide-41
SLIDE 41

LITTLE Cortex A53 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)

Au

Cache Au Cache

41

slide-42
SLIDE 42

LITTLE Cortex A53 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

Au

Cache Cache Au

42

slide-43
SLIDE 43

LITTLE Cortex A53 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

Au

Cache Cache Au

43

slide-44
SLIDE 44

LITTLE Cortex A53 2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)

As

Cache Cache As

44

slide-45
SLIDE 45

LITTLE Cortex A53

As

2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)

Cache Cache As

45

slide-46
SLIDE 46

LITTLE Cortex A53

As

2 3 4 Big Cortex A57 2

As

3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)

Cache As Cache As

46

slide-47
SLIDE 47

LITTLE Cortex A53

A′s

2 3 4 Big Cortex A57 2 1 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)

Cache As Cache As

47

slide-48
SLIDE 48

LITTLE Cortex A53

A′s

2 3 4 Big Cortex A57 2 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)
  • 10. LITTLE  makeUnique(A)

Cache As Cache As 1

48

slide-49
SLIDE 49

LITTLE Cortex A53

A′s

2 3 4 Big Cortex A57 2 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)
  • 10. LITTLE  makeUnique(A)
  • 11. CCI  invalidate(A)

Cache As Cache As 1

49

slide-50
SLIDE 50

LITTLE Cortex A53

A′s

2 3 4 Big Cortex A57 2 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)
  • 10. LITTLE  makeUnique(A)
  • 11. CCI  invalidate(A)
  • 12. big  invalidated  resp(ack)

Cache --- Cache As 1

50

slide-51
SLIDE 51

LITTLE Cortex A53

A′u

2 3 4 Big Cortex A57 2 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)
  • 10. LITTLE  makeUnique(A)
  • 11. CCI  invalidate(A)
  • 12. big  invalidated  resp(ack)
  • 13. CCI  resp(isUnique)

Cache --- Cache Au 1

51

slide-52
SLIDE 52

LITTLE Cortex A53 1 2 3 4 Big Cortex A57 2 3 4

Cache Coherent Interconnect

AXI ACE AXI ACE

  • 1. LITTLE  load(A)
  • 2. CCI  snoop(A)
  • 3. big  resp(miss)
  • 4. CCI  load_mem(A)
  • 5. CCI  return(A)
  • 6. big  load(A)

7. CCI  snoop(A)

  • 8. LITTLE  resp(hit)  return(A)
  • 9. CCI  return(A)
  • 10. LITTLE  makeUnique(A)
  • 11. CCI  invalidate(A)
  • 12. big  invalidated  resp(ack)
  • 13. CCI  resp(isUnique)
  • 14. LITTLE  store(A)

Cache --- Cache A′𝑣 1

52

slide-53
SLIDE 53

LITTLE Cortex A53 1 1 1 A Big Cortex A57

Cache Coherent Interconnect

AXI ACE AXI ACE

TLB

Distributed Virtual Memory (DVM):

  • Threads on different cores

share the same virtual memory

  • Core A causes change in page

table  core Bs TLB entry out-

  • f-date
  • Core A issues invalidation

message  CCI broadcasts TLB entry invalidation  Core B invalidates TLB entry

1 1 1 B L2 Cache L2 Cache

TLB

 TLBs are read-only  DVM messages can only invalidate entries (can‘t fetch entries)

53

slide-54
SLIDE 54

ACE Performance:

 ACE clock can be integer fractions of CPU clock (including

1:1)

 16 simultaneous write commands per cluster  8 simultaneous read commands per core

Snoop Performance:

 SCU is clocked with CPU clock  8 simultaneous snoops per cluster  Snoop response after:

13 cycles (L2 hit) 16 cycles (L1 hit) 6 cycles (Cache miss)

54

slide-55
SLIDE 55

ACE Performance:

 ACE clock can be integer fractions of CPU clock (including

1:1)

 16 simultaneous write commands per cluster  8 simultaneous read commands per core

Snoop Performance:

 SCU is clocked with CPU clock  8 simultaneous snoops per cluster  Snoop response after:

13 cycles (L2 hit) 16 cycles (L1 hit) 6 cycles (Cache miss)

55

slide-56
SLIDE 56

 Queue of 8 transaction, 16 cycle wait  Each transaction returns one cache line  64 byte cache line width  128-bit data channel width (16 bytes)

 32 kB L1 transfer: 32 𝑙𝐶𝑧𝑢𝑓𝑡

16 𝐶𝑧𝑢𝑓𝑡 + 16 = 2,013

 2 MB L2 transfer: 2 𝑁𝐶𝑧𝑢𝑓𝑡

16 𝐶𝑧𝑢𝑓𝑡 + 13 = 131,088

4 cycles

56

slide-57
SLIDE 57

A53 @ 1.3 GHz E5520 @ 2.26 GHz

 Queue of 8 transaction, 16 cycle wait  Each transaction returns one cache line  64 byte cache line width  128-bit data channel width (16 bytes)

 32 kB: ~ 1.5 µs ~ 6.5 µs  2 MB: ~ 100 µs ~ 30 µs

4 cycles

57

slide-58
SLIDE 58

58

0,01 0,1 1 10 920 1380 2300 3450 4600 6440 8280 10120 11960 14760 18040 21320 24600 27880 31160 34920 38200 41480 Power Consumption in W Performance in DMIPS

Samsung Exynos 5433 (Galaxy Note 4)

big.LITTLE Cortex A53 Cortex A57

slide-59
SLIDE 59

59

0,01 0,1 1 10 920 1380 2300 3450 4600 6440 8280 10120 11960 14760 18040 21320 24600 27880 31160 34920 38200 41480 Power Consumption in W Performance in DMIPS

Samsung Exynos 5433 (Galaxy Note 4)

big.LITTLE Cortex A53 Cortex A57 Power advantage ~ 70% ~ 55%

slide-60
SLIDE 60

60

0,01 0,1 1 10 920 1380 2300 3450 4600 6440 8280 10120 11960 14760 18040 21320 24600 27880 31160 34920 38200 41480 Power Consumption in W Performance in DMIPS

Samsung Exynos 5433 (Galaxy Note 4)

big.LITTLE Cortex A53 Cortex A57 ~ 25% Performance advantage

slide-61
SLIDE 61

61

0,01 0,1 1 10 920 1380 2300 3450 4600 6440 8280 10120 11960 14760 18040 21320 24600 27880 31160 34920 38200 41480 Power Consumption in W Performance in DMIPS

Samsung Exynos 5433 (Galaxy Note 4)

big.LITTLE Cortex A53 Cortex A57 Thermal barrier: ~7 W  Temp. > 40-50 °C

slide-62
SLIDE 62

62

0,01 0,1 1 10 920 1380 2300 3450 4600 6440 8280 10120 11960 14760 18040 21320 24600 27880 31160 34920 38200 41480 Power Consumption in W Performance in DMIPS

Samsung Exynos 5433 (Galaxy Note 4)

big.LITTLE Cortex A53 Cortex A57 Idle/low-power advantage

slide-63
SLIDE 63

 smaller performance advantage (up to 25%) for

high performance applications  TDP usually too high for 8 cores at maximum frequency

 significant power advantages (up to 70%) for

high efficiency applications  better performance/power for entire system

 low power Idle (and background apps) not

available to high performance CPUs

63

slide-64
SLIDE 64

 Heterogeneous CPUs are possible  big.medIUM.LITTLE:

Helio X20 SoC

64

slide-65
SLIDE 65

Sources and References:

Papers

  • E. Blem, J. Menon, T. Vijayaraghavan, K. Sankaralingam. (2015, March). ISA Wars: Understanding the Relevance of ISA being RISC or CISC to Performance,

Power, and Energy on Modern Architectures. ACM Transactions on Computer Systems. [Type of medium]. Vol. 33, No.1, Article 3. Available: http://tocs.acm.org/

  • T. Mitra. (2014). Energy-Efficient Computing with Heterogeneous Multi-Cores, Presented at International Symposium on Integrated Circuits (ISIC).
  • V. Villebonnet, G. Da Costa, L. Lefevre, J.-M. Pierson, P. Stolf. (2014). Towards Generalizing "Big.Little" for Energy Proportional HPC and Cloud
  • Infrastructures. Presented at IEEE Fourth International Conference on Big Data and Cloud Computing.
  • S. Yoo, Y. Shim, S. Lee, S.-A. Lee, J. Kim. (2015, October). A case for bad big.LITTLE switching: How to scale power-performance in SI-HMP. Presented at

Hotpower’15, Monterey, CA, USA.

ARM Technical Reference Manuals and publications

  • ARMv7 Architecture Reference Manual, ARM Ltd., Cherry Hinton, Cambridge, 2014.
  • ARMv8 Architecture Reference Manual, ARM Ltd., Cherry Hinton, Cambridge, 2015.
  • CoreLink CCI-400 Cache Coherent Interconnect Technical Reference Manual, ARM Ltd., Cherry Hinton, Cambridge, 2012.
  • AMBA AXI and ACE Protocol Specification, ARM Ltd., Cherry Hinton, Cambridge, 2013.
  • Introduction to AMBA 4 ACE and big.LITTLE Processing Technology, ARM Ltd., Cherry Hinton, Cambridge, 2013.
  • ARM Cortex-A53 MPCore Processor Technical Reference Manual, ARM Ltd., Cherry Hinton, Cambridge, 2014.
  • ARM Cortex-A57 MPCore Processor Technical Reference Manual, ARM Ltd., Cherry Hinton, Cambridge, 2014.
  • big.LITTLE Technology: The Future of Mobile, ARM Ltd., Cherry Hinton, Cambridge, 2013.

Internet

  • A. Frumusanu, R. Smith. (2015, February). ARM A53/A57/T760 investigated - Samsung Galaxy Note 4 Exynos Review. Available:

http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/

  • B. Sigoure, (2010, November) How long does it take to make a context switch? Available: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-

context.html

Other

  • H.-D. Cho, K. Chung, T. Kim. (2012, February). Benefits of the big.LITTLE Architecture. Samsung Electronics, Seoul.
  • K. Yu, (2012) big.LITTLE Switchers – Evaluation on Exynos.bl Processor. Presented at 2012 Korea Linux Forum. Available:

http://events.linuxfoundation.org/images/stories/pdf/klf2012_yu.pdf

65

slide-66
SLIDE 66

31 30 29 28

… Condition Code = ? Instruction Status Register

Bit Bit

31 30 29 28 27

… N Z C V Q e.g. 0000 := Zero flag (Z) is set 0001 := Zero flag (Z) is clear CMP r4, r5 ; (r4 – r5) == 0 ? ADDEQ r1, r2, r3 ; if equal: r1 := r2 + r3 ADDNE r1, r2, r4 ; else: r1 := r2 + r4 Not-executed instruction takes up 1 cycle

66

slide-67
SLIDE 67

Registers r0 – r15 Barrel Shifter Operand A Operand B Result MOV r4, #2 ; binary: 0010 ADD r5, r4, r4, LSL #1 ; r5 := r4 + (r4 << 1) ; r5 := 0010 + 0100 ; r5 := 2 + 4 ALU Shift operation: +1 cycle N

67

slide-68
SLIDE 68

68 Clock: 2,26 GHz (Turbo: 2,53 GHz) Cores: 4 (capable of hyperthreading) Cache: 8 MB (L1 64 kB per core, L2 256kB per core, 8 MB shared) TDP: 80 W Node: 45 nm Year: 2009

slide-69
SLIDE 69

69

from CCI-400 Cache Coherent Interconnect Reference Manual