Memory driven architecture: flipping the inequality computing vs. - - PowerPoint PPT Presentation

memory driven architecture
SMART_READER_LITE
LIVE PREVIEW

Memory driven architecture: flipping the inequality computing vs. - - PowerPoint PPT Presentation

Uri Weiser Professor of Engineering Technion Memory driven architecture: flipping the inequality computing vs. memory 1 The talk covers research done by: Prof. Y. Etsion, Dr. Z. Guz, Prof. I. Keidar, Prof. A. Kolodny, S. Kvatinsky, Prof I.


slide-1
SLIDE 1

Uri Weiser

Professor of Engineering Technion

Memory driven architecture: flipping the inequality computing vs. memory

1 1

The talk covers research done by: Prof. Y. Etsion, Dr. Z. Guz, Prof. I. Keidar, Prof. A. Kolodny, S. Kvatinsky, Prof I. Keslassy, T. Zidenberg, Prof. A. Mendelson, Y. Nacson, Prof E. Friedman, Prof. U. Weiser

slide-2
SLIDE 2

“The large energy consumption associated with the ever increasing internet use and the lack of efficient renewable energy sources to support it”

*Energy problems in data-com systems *Energy problems in computers: from systems to the chip level *Advanced solar energy harvesting

Scent of Solutions? This conference’s message

slide-3
SLIDE 3

The Trend Our Customers Expect

3

From:

slide-4
SLIDE 4

The Trend Our Customers Expect

4

From:

slide-5
SLIDE 5

Outline

The trends The implications The opportunities

Heterogeneous systems – some thoughts Memristor  Memory Intensive Architecture (MIA)

Energy: Optimal resource allocation in a Heterogeneous system How to start to think about Memory Intensive Architecture

5

slide-6
SLIDE 6

The Trends

6

slide-7
SLIDE 7

Process Technology: Minimum Feature Size

Source: Intel, SIA Technology Roadmap

SIA: Semiconductor Industry Association

0.01

Feature Size (microns)

0.1 1 10 ’68 ’71 ’76 ’80 ’84 ’88 ’92 ’96 ’00 ’04 ’08 Intel SIA ’14

130nm 90nm 65nm 45nm 32nm 180nm 22nm

7

14nm 22nm

slide-8
SLIDE 8

Putting It All Together

! ! ! !!

8

slide-9
SLIDE 9

The Trend

Where are we going? The power wall

9

slide-10
SLIDE 10

Microarchitecture

VLSI Microarchitecture has been influenced by concepts that have been around for a long time We hit a power wall Solutions

Top down – improve performance/power or Throughput/power  Heterogeneous Architecture Bottom up – new devices ? Memory resistive devices?

10

slide-11
SLIDE 11

Hetero vs. Memory Intensive

Heterogeneous Architecture 

For a while no major breakthrough in CPU technology But the main reason is the POWER wall and energy/task Accelerators to the rescue

Memory Intensive Architecture 

Either a huge amount of memory cells close to logic, or Logic cells close to lots of memory Does it imply Symmetric processing?

11

slide-12
SLIDE 12

Flying machines - are they all the same?

Heterogeneous Systems

12

slide-13
SLIDE 13

Heterogeneous Computing: Application Specific Accelerators

Performance/power

Apps range

Continue performance trend using Heterogeneous computing to bypass current technological hurdles

Accelerators

13

slide-14
SLIDE 14

Heterogeneous Computing

Performances/Power

General Purpose Accelerator

14

slide-15
SLIDE 15

Heterogeneous Systems’ Environment

Environment with limited resources Need to optimize system’s targets within resource constrains Resources may be:

  • Power, energy, area, space, $

System's targets may be:

  • Performance, power, energy, area, space, $

15

slide-16
SLIDE 16

Heterogeneous Computing

Heterogeneous system design under resource constraint

how to divide resources (e.g. area, power, energy) to achieve maximum system’s output (e.g. performance, throughput)

Accelerator target (an example): Minimize execution time under Area constraint 𝑏1 𝑏2 𝑏3 𝑏𝑜 𝑏4 𝑩 =

𝒋=𝟐 𝒋=𝒐

𝒃𝒋

t2 t3 tn t1 time

ti = execution time of an application’s section (run on a reference computing system)

Example:

16

slide-17
SLIDE 17

MultiAmdahl:

t1* F1(a1)+ t2* F2(a2) + + tn* Fn(an)

a4

𝑏1 𝑏2 𝑏3 𝑏𝑜

t2 t3 tn t1

F1(a1) F2(a2) Fn(an)

T = A = a1 + a2 + a3 + … + an

Target: Minimize T under a constraint A

17

slide-18
SLIDE 18

MultiAmdahl:

Optimization using Lagrange multipliers

Minimize execution time (T) under an Area (a) constraint t2 t3 tn t1

F1(a1) F2(a2) Fn(an)

18

tj F’j(aj) = ti F’i(ai)

F’= derivation of the accelerator function ai = Area of the i-th accelerator ti = Execution time on reference computer

slide-19
SLIDE 19

MultiAmdahl Framework Applying known techniques* to new environments Can be used during system’s definition and/or dynamically to tune system

* Gossen’s second law (1854), Marginal utility, Marginal rate of substitution (Finance)

19

slide-20
SLIDE 20

Example: CPU vs. Accelerators

Future GP CPU size vs. transistor budget growth

Test case: 4 accelerators and GP (big) CPU Applications: evenly distributed benchmarks mix w/ 10% sequential code

Heterogeneous Insight: In an increased-transistor-budget-environment, General Purpose (big) CPU importance will grow

20

slide-21
SLIDE 21

Example: CPU vs. Accelerators

GP CPU size vs. power budget

Test case: 4 accelerators and GP (big) CPU Applications: evenly distributed benchmarks mix w/ 10% sequential code

21

Heterogeneous Insight: In a decreased-power-budget-environment, Accelerators importance will grow

slide-22
SLIDE 22

Environment Changes

Is it time for a change in implementation?

Throughput became an essential Microprocessor target Data footprint became bigger Multi-Core systems are everywhere  more performance = more memory usage Memory pressure is increasing Significant CPU die power (>30%) is consumed by IO

(access to out-of-die memory)

22

slide-23
SLIDE 23

Bottom up approach: New device - Memristor?

23

slide-24
SLIDE 24

What is a Memristor?

2-terminal resistive nonvolatile device Device’s resistivity depends on past electrical current Device is constructed of 2 metal layers with

  • xide in between (e.g. TiO2)

Can be implemented in Multi (physical) layer memory

RON ROFF Voltage [V] Current [mA]

24

Jul 30, 2013 Panasonic Starts World's First Mass Production of ReRAM Mounted Microcomputers

[1] ReRAM (Resistive Random Access Memory) A type of non-volatile memory which records "0" and "1" digital information by generating large resistance changes with a pulsed voltage applied to a thin-film metal oxide. The simple structure of the metal oxide sandwiched by electrodes makes the manufacturing process easier and provides excellent low power-consumption and high-speed rewriting characteristics.
slide-25
SLIDE 25
  • Theoretical idea by Chua

in 1971

  • Implemented today by

Hewlett Packard SK Hynix, HRL Labs

  • Memory products by

Pannasonic

Memristor

50nm Array of 17 oxygen-depleted titanium dioxide memristors (HP Labs)

25

slide-26
SLIDE 26

Memristor Microarchitecture “Vision”

Layers of memory cells above logic

Does this new structure open the possibility for new Microarchitecture?

26

slide-27
SLIDE 27

Memristors to the Rescue?

Huge amount of memory cells Very close to logic Non volatile

No need for power to keep alive

~ transistor size Fast No leakage

27

slide-28
SLIDE 28

Sea of Memory Cells Impact

  • Conventional vs. Out of the box
  • Enhance Multithreading architecture (Graphics like)
  • Increase on-die prediction structures
  • Instruction queues
  • Back to LUT (look-up-tables) implementations
  • New caches (e.g. NAHALAL, MC vs. MT, Cache specific content)
  • Non-Register Architecture (memory-to-memory operations)
  • Continues Flow Multithreading (improved SoE MT)
  • Instruction reuse (memoization)
  • Computation at the memory level*

* Ref Dr. Avidan Akerib General manager NeoMagic

28

? ? ? ?

slide-29
SLIDE 29

Memory Intensive Architectures Bandwidth demons Bandwidth demons Traversing on a constant-throughput-line ? ?  increase on-die-memory (e.g. cache, new ideas)

The trend Bandwidth Throughput/Bandwidth

TP3 TP1 TP2 TP4

Throughput and Bandwidth*

*Influenced by ISCA 1995 paper: Performance Evaluation of the PowerPC 620 Microarchitecture; (graph: frequency vs. performance/frequency)

Chip boundary

Bandwidth to out-of Chip devices  energy waste

Throughput engine

29

slide-30
SLIDE 30

Thread A Thread B Fetch Execute Write back Cache miss!!!

30

Switch on Event Multithreading

Example- processor pipeline

Thread C Pipeline stages

slide-31
SLIDE 31

Continuous Flow MT (CFMT)

Example – processor’s pipeline

SOE deficiencies

Instructions flush beyond the “event instruction”

 waste of energy  performance degradation

Can we use Memristor to reduce thread switch penalty (bubbles)?  Yes do not flush, store the thread-pipe-state in memristors (Multistage Pipeline Register)

31

slide-32
SLIDE 32

Continues Flow MT (CFMT)

Example – processor’s pipeline

Thread A Pipeline register

R/W R/W R/W R/W R/W

Fetch Execute Write back

R/W

Multistate Pipeline Register (MPR)

Thread B Pipeline register 32

Pipeline stages

slide-33
SLIDE 33

MPR MPR MPR MPR MPR MPR

Continuous Flow MT (CFMT)

Example – processor’s pipeline

Thread A Thread B Fetch Execute Write back

Cache miss!!! MPR=Multistate Pipeline Register 33

Thread C Pipeline stages

slide-34
SLIDE 34

CFMT Initial Simulation (preliminary)

(ARM like Microarch V7, lbm from Spec CPU 2006

CFMT for multiple cycle events? not sure yet…

SoE; no CFMT CFMT (Mem and MCE)

34

CFMT (mem only) IPC

(Performance)

# of threads

slide-35
SLIDE 35

Memory Intensive Architecture

Looking Forward

  • Large on die memory may save energy and change

the way we architect our computational machines

  • Reduction in Data-Transfer
  • Opportunity for dramatic improvement in

Performance/Power or Throughput/Power

  • Performance improvement (@same power) => energy

reduction

  • Reduction of static/leakage power
  • Energy saving in reactive systems

(0 memory energy when no operation)

  • NEW!!!

35

slide-36
SLIDE 36

Summary

Saving energy via optimal Heterogeneous system The introduction of on-die huge memory should alter the way we design computational machines for low energy consumption

36

slide-37
SLIDE 37

Thank You

37