INTERCONNECTION NETWORKS Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

INTERCONNECTION NETWORKS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture

Overview ¨ Upcoming deadline ¤ Feb.3 rd : project group formation ¤ No groups have sent me emails! ¨ This lecture ¤ Cache interconnects ¤ Basics of the interconnection networks ¤ Network topologies ¤ Flow control

Where Interconnects Are Used? ¨ About 60% of the dynamic power in modern microprocessors is dissipated in on-chip interconnects • Six processor cores • 8MB Last level cache [Magen’04] [Intel Core i7]

Cache Interconnect Optimizations

Large Cache Organization ¨ Fewer subarrays gives increased area efficiency, but larger delay due to longer wordlines/bitlines NDWL = 4 SUBARRAY NDBL = 4 Core Core H-TREE Cache Cache Interconnect Cache Cache Core Core [Aniruddha’09]

Large Cache Energy Consumption ¨ H-tree is clearly the dominant component of energy consumption H-tree Decoder Wordlines Bitline mux & drivers Senseamp mux & drivers Bitlines Sense amplifier Sub-array output drivers 90% [Aniruddha’09]

Heterogeneous Interconnects ¨ A global wire management at the microarchitecture level ¨ A heterogeneous interconnect that is comprised of wires with varying latency, bandwidth, and energy characteristics [Balasubramonian’05]

Heterogeneous Interconnects ¨ Better energy-efficiency for a dynamically scheduled partitioned architecture ¤ ED 2 is reduced by 11% ¨ A low-latency low-bandwidth network can be effectively used to hide wire latencies and improve performance ¨ A high-bandwidth low-energy network and an instruction assignment heuristic are effective at reducing contention cycles and total processor energy. [Balasubramonian’05]

Non-Uniform Cache Architecture ¨ NUCA optimizes energy and time based on the proximity of the cache blocks to the cache controller. 2MB @ 130nm 16MB @ 50nm Bank Access time = 3 cycles Bank Access time = 3 cycles Interconnect delay = 8 cycles Interconnect delay = 44 cycles [Kim’04]

Non-Uniform Cache Architecture ¨ S-NUCA-1 ¤ Use private per-bank channel ¤ Each bank has its distinct access latency ¤ Statically decide data location for its given address ¤ Average access latency =34.2 cycles ¤ Wire overhead = 20.9% à an issue Sub-bank Bank Data Bus Predecoder Address Bus Sense amplifier Tag Wordline driver Array [Kim’04] and decoder

Non-Uniform Cache Architecture ¨ S-NUCA-2 ¤ Use a 2D switched network to alleviate wire area overhead ¤ Average access latency =24.2 cycles ¤ Wire overhead = 5.9% Tag Array Switch Bank Data bus Predecoder Wordline driver [Kim’04] and decoder

Non-Uniform Cache Architecture ¨ Dynamic NUCA ¤ Data can dynamically migrate ¤ Move frequently used cache lines closer to CPU bank 8 bank sets one set way 0 way 1 way 2 way 3 [Kim’04]

Non-Uniform Cache Architecture ¨ Fair mapping ¤ Average access time across all bank sets are equal bank 8 bank sets one set way 0 way 1 way 2 way 3

Non-Uniform Cache Architecture ¨ Shared mapping ¤ Sharing the closet banks for farther banks bank 8 bank sets way 0 way 1 way 2 way 3

Encoding Based Optimizations

Cache Interconnect Optimizations ¨ Bus invert coding transfers either the data or its complement to minimize the number of bit flips on the bus. New data 0 1 1 0 0 1 Old data 1 1 0 0 1 0 = a 2 P CV f switching DD New data 1 0 0 1 1 0 Old data 1 1 0 0 1 0 [Stan’95]

Time-Based Data Transfer ¨ The percentage of processor energy expended on an 8MB cache when running a set of parallel applications on a Sun Niagara-like multicore processor Relative CPU Energy [Bojnordi’13]

Time-Based Data Transfer ¨ Communication over the long, capacitive H-tree interconnect is the dominant source of energy consumption (80% on average) in the L2 cache Relative Cache Energy [Bojnordi’13]

Time-Based Data Transfer Example: transmitting the value 5 Key idea: represent Fixed information by the Time Based Dynamic number of clock Data Transfer Energy cycles between two consecutive pulses to reduce Parallel interconnect activity Data Transfer factor. Fixed Transfer Time Serial Data Transfer Time 0 1 2 3 4 5 (cycles) [Bojnordi’13]

Time-Based Data Transfer ¨ Cache blocks are partitioned into small, contiguous chunks. [Bojnordi’13]

Time-Based Data Transfer [Bojnordi’13]

Time-Based Data Transfer ¨ L2 cache energy is reduced by 1.8x at the cost of less than 2% increase in the execution time. Bus Invert 1.2 Execution Time Normalized Coding to the Binary Encoding 1 Dynamic Zero DESC 30% Compression 0.8 40% 0.6 0.4 0.2 0 0 0.5 1 L2 Cache Energy Normalized to the Binary Encoding [Bojnordi’13]

Interconnection Networks

Interconnection Networks ¨ Goal: transfer maximum amount of information with the minimum time and power ¨ Connects processors, memories, caches, and I/O devices CPU CPU Mem Mem CPU CPU Interconnection Network Mem Mem CPU CPU Mem Mem

Types of Interconnection Networks ¨ Four domains based on number and proximity of devices ¤ On-chip networks (OCN or NOC) n Microarchitectural elements: cores, caches, reg. files, etc. ¤ System/storage area networks (SAN) n Computer subsystems: storage, processor, IO device, etc. ¤ Local area networks (LAN) n Autonomous computer systems: desktop computers etc. ¤ Wide area networks (WAN) n Interconnected computers distributed across the globe

Basics of Interconnection Networks ¨ Network topology ¤ How to wire switches and nodes in the network ¨ Routing algorithm ¤ How to transfer a message from source to destination ¨ Flow control ¤ How to control the flow messages within the network

INTERCONNECTION NETWORKS Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

INTERCONNECTION NETWORKS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Feb.3 rd : project group formation No groups have sent me

Interconnection, Peering IXPs What and How Interconnection 2 Interconnection The Internet is

Interconnection Application Options and Process Jason Foster, Sr. Interconnection Specialist

Interconnection Networks for Parallel Computers Interconnection networks carry data between

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

synchronous networks an interconnection structure with an interconnection structure with

ISO generator interconnection queue Bob Emmert Sr. Manager, Interconnection Resources Board of

Decision on interconnection process enhancements for independent study and fast track processes

Decision on Interconnection Process Enhancements Track 4 Robert Emmert Manager,

Evolution of Interconnection Joseph Lorenzo Hall March 11, 2015 Princeton CITP Global Conference

Low Power and Reliable Interconnection with Low Power and Reliable Interconnection with

Endogenizing Interconnection Measurement and Economics Christopher S. Yoo University of

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

Pricing Interconnection: one regulatory economists perspective William Lehr MIT

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank

COMP 633 - Parallel Computing Lecture 20 October 27, 2020 Interconnection Networks

Algorithms and Methods for Distributed Storage Networks 10 Distributed Heterogeneous Hash Tables

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

Logic as a Tool Chapter 3: Understanding First-order Logic 3.4 Truth, validity, logical

Challenges for Effective Procurement Control in New Reactor Construction June 3, 2009 Naoki

Towards a Middleware for Configuring Large-scale Storage Infrastructures David M. Eyers Ramani

Interim Storage Facility for Removed Soil and Interim Storage Waste Facility Outline of the

Chapter 14: Mass-Storage Systems Disk Structure Disk Scheduling Disk Management

Not every generational GC is a copying generational GC. Knowing when to schedule GC can also