Memory driven architecture: flipping the inequality computing vs. - PowerPoint PPT Presentation

Uri Weiser Professor of Engineering Technion Memory driven architecture: flipping the inequality computing vs. memory 1 The talk covers research done by: Prof. Y. Etsion, Dr. Z. Guz, Prof. I. Keidar, Prof. A. Kolodny, S. Kvatinsky, Prof I. Keslassy, T. Zidenberg, Prof. A. Mendelson, Y. Nacson, Prof E. Friedman, Prof. U. Weiser 1

This conference ’ s message “ The large energy consumption associated with the ever increasing internet use and the lack of efficient renewable energy sources to support it ” *Energy problems in data-com systems Scent of Solutions? *Energy problems in computers: from systems to the chip level *Advanced solar energy harvesting

The Trend Our Customers Expect 3 From:

The Trend Our Customers Expect 4 From:

Outline The trends The implications The opportunities Heterogeneous systems – some thoughts Memristor  Memory Intensive Architecture (MIA) Energy: Optimal resource allocation in a Heterogeneous system How to start to think about Memory Intensive Architecture 5

The Trends 6

Process Technology: Minimum Feature Size Feature Size (microns) 10 Intel SIA 1 180nm 130nm 90nm 0.1 65nm 45nm 32nm 22nm 22nm 14nm 0.01 ’ 08 ’ 14 ’ 68 ’ 71 ’ 76 ’ 80 ’ 84 ’ 88 ’ 92 ’ 96 ’ 00 ’ 04 Source: Intel, SIA Technology Roadmap 7 SIA: Semiconductor Industry Association

Putting It All Together ! ! ! !! 8

The Trend Where are we going? The power wall 9

Microarchitecture VLSI Microarchitecture has been influenced by concepts that have been around for a long time We hit a power wall Solutions Top down – improve performance/power or Throughput/power  Heterogeneous Architecture Bottom up – new devices ? Memory resistive devices? 10

Hetero vs. Memory Intensive Heterogeneous Architecture  For a while no major breakthrough in CPU technology But the main reason is the POWER wall and energy/task Accelerators to the rescue Memory Intensive Architecture  Either a huge amount of memory cells close to logic, or Logic cells close to lots of memory Does it imply Symmetric processing? 11

Heterogeneous Systems Flying machines - are they all the same? 12

Heterogeneous Computing: Application Specific Accelerators Performance/power Accelerators Apps range Continue performance trend using Heterogeneous computing to bypass current technological hurdles 13

Heterogeneous Computing Performances/Power Accelerator General Purpose 14

Heterogeneous Systems ’ Environment Environment with limited resources Need to optimize system ’ s targets within resource constrains Resources may be: - Power, energy, area, space, $ System's targets may be: - Performance, power, energy, area, space, $ 15

Heterogeneous Computing Heterogeneous system design under resource constraint how to divide resources (e.g. area, power, energy) to achieve maximum system ’ s output (e.g. performance, throughput) Example: time t 1 t 2 t 3 t n t i = execution time of an application ’ s section (run on a reference computing system) 𝑏 4 𝒋=𝒐 𝑏 3 𝑩 = 𝒃 𝒋 𝑏 2 𝑏 𝑜 𝑏 1 𝒋=𝟐 Accelerator target (an example): Minimize execution time under Area constraint 16

MultiAmdahl: t 2 t 1 t 3 t n F 2 (a 2 ) F 1 (a 1 ) F n (a n ) t 1* F 1 (a 1 )+ t 2* F 2 (a 2 ) + T = + t n* F n (a n ) A = a 1 + a 2 + a 3 + … + a n 𝑏 1 𝑏 3 𝑏 2 𝑏 𝑜 a 4 Target: Minimize T under a constraint A 17

MultiAmdahl: t 2 t 1 t 3 t n F 1 (a 1 ) F 2 (a 2 ) F n (a n ) Optimization using Lagrange multipliers Minimize execution time (T) under an Area (a) constraint t j F ’ j (a j ) = t i F ’ i (a i ) F ’ = derivation of the accelerator function a i = Area of the i-th accelerator t i = Execution time on reference computer 18

MultiAmdahl Framework Applying known techniques* to new environments Can be used during system ’ s definition and/or dynamically to tune system * Gossen ’ s second law (1854), Marginal utility, Marginal rate of substitution (Finance) 19

Example: CPU vs. Accelerators Future GP CPU size vs. transistor budget growth Test case: 4 accelerators and GP (big) CPU Applications: evenly distributed benchmarks mix w/ 10% sequential code Heterogeneous Insight: In an increased-transistor-budget-environment, 20 General Purpose (big) CPU importance will grow

Example: CPU vs. Accelerators GP CPU size vs. power budget Test case: 4 accelerators and GP (big) CPU Applications: evenly distributed benchmarks mix w/ 10% sequential code Heterogeneous Insight: In a decreased-power-budget-environment, 21 Accelerators importance will grow

Environment Changes Is it time for a change in implementation? Throughput became an essential Microprocessor target Data footprint became bigger Multi-Core systems are everywhere  more performance = more memory usage  Memory pressure is increasing Significant CPU die power (>30%) is consumed by IO (access to out-of-die memory) 22

Bottom up approach: New device - Memristor? 23

What is a Memristor? Current [mA] 2-terminal resistive nonvolatile device R OFF R ON Device ’ s resistivity depends on past electrical current Device is constructed of 2 metal layers with Voltage [V] oxide in between (e.g. T i O 2 ) Can be implemented in Multi (physical) layer memory Jul 30, 2013 Panasonic Starts World's First Mass Production of ReRAM Mounted Microcomputers [1] ReRAM (Resistive Random Access Memory) A type of non-volatile memory which records "0" and "1" digital information by generating large resistance changes with a pulsed voltage applied to a thin-film metal oxide. 24 The simple structure of the metal oxide sandwiched by electrodes makes the manufacturing process easier and provides excellent low power-consumption and high-speed rewriting characteristics.

Memristor • Theoretical idea by Chua in 1971 • Implemented today by Hewlett Packard SK Hynix, HRL Labs • Memory products by Pannasonic Array of 17 oxygen-depleted titanium dioxide 50nm memristors (HP Labs) 25

Memristor Microarchitecture “ Vision ” Layers of memory cells above logic Does this new structure open the possibility for new Microarchitecture? 26

Memristors to the Rescue? Huge amount of memory cells Very close to logic Non volatile No need for power to keep alive ~ transistor size Fast No leakage 27

Sea of Memory Cells Impact - Conventional vs. Out of the box  Enhance Multithreading architecture (Graphics like)  Increase on-die prediction structures  Instruction queues  Back to LUT (look-up-tables) implementations  New caches (e.g. NAHALAL, MC vs. MT, Cache specific content)  Non-Register Architecture (memory-to-memory operations) ? ?  Continues Flow Multithreading (improved SoE MT)  ? Instruction reuse (memoization) ?  Computation at the memory level* 28 * Ref Dr. Avidan Akerib General manager NeoMagic

Throughput and Bandwidth* Chip boundary Throughput/Bandwidth Memory Intensive Throughput engine Bandwidth Architectures to out-of Chip TP 4 devices  energy waste TP 3 TP 2 The trend TP 1 Bandwidth demons Bandwidth demons Bandwidth Traversing on a constant-throughput-line  ? ?  increase on-die-memory (e.g. cache, new ideas) 29 *Influenced by ISCA 1995 paper: Performance Evaluation of the PowerPC 620 Microarchitecture; (graph: frequency vs. performance/frequency)

Switch on Event Multithreading Example- processor pipeline Fetch Thread A Pipeline stages Thread B Thread C Cache miss!!! Execute Write back 30

Continuous Flow MT (CFMT) Example – processor ’ s pipeline SOE deficiencies I nstructions flush beyond the “ event instruction ”  waste of energy  performance degradation Can we use Memristor to reduce thread switch penalty (bubbles)?  Yes do not flush, store the thread-pipe-state in memristors (Multistage Pipeline Register) 31

Continues Flow MT (CFMT) Example – processor ’ s pipeline Thread B Pipeline register Thread A Pipeline register Fetch R/W Pipeline stages R/W R/W R/W Multistate Pipeline R/W Register (MPR) Execute R/W Write back 32

Continuous Flow MT (CFMT) Example – processor ’ s pipeline Thread A Fetch Thread B MPR MPR=Multistate Pipeline Register Thread C Pipeline stages MPR MPR MPR MPR Cache miss!!! Execute MPR Write back 33

CFMT Initial Simulation (preliminary) (ARM like Microarch V7, lbm from Spec CPU 2006 IPC (Performance) CFMT (Mem and MCE) CFMT (mem only) SoE; no CFMT # of threads CFMT for multiple cycle events? not sure yet … 34

Memory Intensive Architecture Looking Forward • Large on die memory may save energy and change the way we architect our computational machines  Reduction in Data-Transfer  Opportunity for dramatic improvement in Performance/Power or Throughput/Power  Performance improvement (@same power) => energy reduction  Reduction of static/leakage power  Energy saving in reactive systems (0 memory energy when no operation)  NEW!!! 35

Summary Saving energy via optimal Heterogeneous system The introduction of on-die huge memory should alter the way we design computational machines for low energy consumption 36

Thank You 37

Memory driven architecture: flipping the inequality computing vs. - PowerPoint PPT Presentation

Uri Weiser Professor of Engineering Technion Memory driven architecture: flipping the inequality computing vs. memory 1 The talk covers research done by: Prof. Y. Etsion, Dr. Z. Guz, Prof. I. Keidar, Prof. A. Kolodny, S. Kvatinsky, Prof I.

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

False fasting is driven by pride False fasting is driven by pride False fasting is

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

The Time-predictable Multicore Architecture T-CREST Martin Schoeberl Technical University of

SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&M

Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer

Compile-Time Detection of False Sharing via Loop Cost Modeling Munara Tolubaeva, Yonghong Yan and

Peeling Google Public DNS Onion ANALYZING CACHE COHERENCY AND LOCALITY OF GOOGLE PUBLIC DNS

Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble

Invited talk at Dansk Selskab for Datalogi Copenhagen, 13 June 2002 Title: Software tools for

The Parking Fairy Using open data effectively in mobile apps. Background December 2015:

Memory driven architecture: flipping the inequality computing vs. - PowerPoint PPT Presentation

Uri Weiser Professor of Engineering Technion Memory driven architecture: flipping the inequality computing vs. memory 1 The talk covers research done by: Prof. Y. Etsion, Dr. Z. Guz, Prof. I. Keidar, Prof. A. Kolodny, S. Kvatinsky, Prof I.

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

False fasting is driven by pride False fasting is driven by pride False fasting is

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

Memory Management Ideally programmers want memory that is large fast non

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

The Time-predictable Multicore Architecture T-CREST Martin Schoeberl Technical University of

SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&amp;M

Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer

Compile-Time Detection of False Sharing via Loop Cost Modeling Munara Tolubaeva, Yonghong Yan and

Peeling Google Public DNS Onion ANALYZING CACHE COHERENCY AND LOCALITY OF GOOGLE PUBLIC DNS

Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble

Invited talk at Dansk Selskab for Datalogi Copenhagen, 13 June 2002 Title: Software tools for

The Parking Fairy Using open data effectively in mobile apps. Background December 2015:

SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&M