PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND - PowerPoint PPT Presentation

Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Memory 2 0 Shared Cache 3 2 … Cache Cache 1 4 … Core Core

Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging 2 Memory 2 0 Shared Cache 3 … Cache Cache 1 4 … Core Core

Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Push PageRank on uk-2005 graph 1.4 2 Memory 2 1.2 Memory requests Updates 0 Shared Cache 1.0 Updates Destination per edge 3 Vertex 0.8 … Source Cache Cache 0.6 Vertex 1 CSR 0.4 4 … Core Core 0.2 0.0 Push UB

Scatter updates are inefficient on conventional hierarchies 6 ¨ Poor temporal and spatial locality when inputs do not fit in cache ¤ Wasteful data transfers from main memory ¨ Multiple threads update the same vertex ¤ Cache line ping-ponging Push PageRank on uk-2005 graph 1.4 2 Memory 2 1.2 93% of traffic due Memory requests Updates 0 Shared Cache 1.0 Updates to scatter updates Destination per edge 3 Vertex 0.8 … Source Cache Cache 0.6 Vertex 1 10x more traffic CSR 0.4 4 … Core Core than compulsory 0.2 0.0 Push UB

Prior hardware support for scatter updates 7

Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks)

Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging

Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion

Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion ¨ Both RMOs and COUP do not improve locality

Prior hardware support for scatter updates 7 ¨ Remote memory operations (RMOs) send and perform update operations at a fixed location (e.g., shared cache banks) ¤ Avoids cache-line ping ponging ¨ COUP [ MICRO’15 ] modifies the coherence protocol to perform commutative operations in a distributed fashion ¨ Both RMOs and COUP do not improve locality ¤ Bottlenecked by memory traffic with large inputs

PHI builds on Update Batching (UB) 8 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Vertices Vertices 0 A B . . C 8 D . . . . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Vertices Vertices 0 A B . . C 8 Cache D fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 B 0 7 9 . . C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 B 0 7 9 . . C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 C 4 6 8 Cache 3 8 D 12 fitting . . . slice . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 C 4 6 8 Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

PHI builds on Update Batching (UB) 8 ¨ Maximizes spatial locality of memory transfers using two-phase execution ¨ Binning phase : Logs updates to memory, dividing them into cache-fitting slices (bins) of vertices ¨ Accumulation phase : Reads and applies logged updates bin-by-bin 2. Accumulation Phase 1. Binning Phase Destination Source Destination Vertices Vertices Ids 0 A 0 5 11 0 A 5 A 0 B ……. 3 D B 0 7 9 . . Bin 0 Main C 4 6 8 memory Cache 3 8 D 12 ……. 12 D 11 A 9 B 7 B fitting . . . slice Bin 1 . 16 Propagation Blocking [ IPDPS’17 ], MILK [ PACT’16 ]

Update Batching tradeoffs 9

Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures

Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs

Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.2 Memory requests Updates 1.0 Destination per edge 0.8 Vertex Source 0.6 Vertex 0.4 CSR 0.2 0.0 Push UB PHI Unstructured input

Update Batching tradeoffs 9 ¨ Perfect spatial locality for all main memory transfers ¤ Compulsory memory traffic for all data structures ¨ Binning phase ignores temporal locality ¤ Generates large stream of updates even with structured inputs Push PageRank on uk-2005 graph 1.4 1.4 1.2 1.2 Memory requests Memory requests Updates 1.0 1.0 Destination per edge per edge 0.8 0.8 Vertex Source 0.6 0.6 Vertex 0.4 0.4 CSR 0.2 0.2 0.0 0.0 Push UB PHI Push UB PHI Unstructured input Structured input

Agenda 10 ¨ Background ¨ PHI Design ¨ Evaluation

Key techniques of PHI 11

Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality

Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality ¨ Selective update batching ¤ Achieves high spatial locality

Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality Bandwidth efficient ¨ Selective update batching ¤ Achieves high spatial locality

Key techniques of PHI 11 ¨ In-cache update buffering and coalescing ¤ Exploits temporal locality Bandwidth efficient ¨ Selective update batching ¤ Achieves high spatial locality ¨ Hierarchical buffering and coalescing Synchronization ¤ Enables update parallelism efficient ¤ Eliminates synchronization overheads

In-cache buffering and coalescing 12

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory ¨ Treat cache as a large coalescing buffer for updates

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever accessing main memory ¨ Treat cache as a large coalescing buffer for updates ¨ Reduction ALU in cache bank performs coalescing

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache ¨ Reduction ALU in cache bank performs coalescing Core

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +4 Core

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +4 Core

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs coalescing Core

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +2 Core

In-cache buffering and coalescing 12 ¨ Buffer updates in cache without ever Memory accessing main memory 10 0xFOO ¨ Treat cache as a large coalescing buffer for updates Cache 4 6 0xFOO ¨ Reduction ALU in cache bank performs UPDATE coalescing 0xFOO, +2 Core

Handling cache evictions 13

Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line

Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively ¤ Achieves good spatial locality in all cases

Handling cache evictions 13 ¨ PHI adapts to the amount of spatial locality in the evicted line ¨ Cache controller performs update batching selectively ¤ Achieves good spatial locality in all cases ¨ Key insight : Update batching is a good tradeoff only when the evicted line has poor spatial locality

Case 1: Evicted line has few updates 14

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache)

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: 0xA4: 0 0 7 0 Line with batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: Evict 0xA4 0xA4: 0 0 7 0 Line with batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F00 4 0x10: Evict 0xA4 Line with INV batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Evict 0xA4 Line with INV batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache Evict 0x10 F00 4 A48 7 0x10: 0x10: F00 4 A48 7 Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: INV Line with INV Evict 0xF8 batched updates 0xF8: 0 3 0 0

Case 1: Evicted line has few updates 14 ¨ Log updates to temporary buffers (stored in cache) ¨ These buffers are later evicted to memory when full Memory 0x10: F00 4 A48 7 INV Invalid line 0xF0: 0 4 0 0 Buffered-updates line Cache F00 4 A48 7 0x10: F84 3 0x11: Line with INV batched updates INV

Case 1: Evicted line has many valid updates 15 ___

Case 1: Evicted line has many valid updates 15 ¨ Fetch line from main memory and merge updates ___

Case 1: Evicted line has many valid updates 15 ¨ Fetch line from main memory and merge updates Memory 0xF0: 1 2 1 7 INV Invalid line Cache 0xF0: 0 4 0 0 4 6 3 0 0xF0: Buffered-updates line ___ 0xDF: 0 7 9 2 0xBC: 5 6 1 8

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND - PowerPoint PPT Presentation

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER UPDATES Anurag Mukkara , Nathan Beckmann, Daniel Sanchez MICRO 2019 Scatter updates are common but inefficient 2 Scatter updates are common but

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Representing Phi Sigma Pis Identity & Mission Your Chapter and the National Fraternity

ALPHA SIGMA PHI Epsilon Upsilon Chapter at Clemson University Our 90-man Alpha Sigma Phi

Phi Gamma Delta University of Tampa Phi Gamma Delta Delta Colony at UT 164 universities

Omega Delta Phi Sorority Alumni Association Constitution Committee Proposals Review Package July

What Can Experimental Philosophy Do? David Chalmers Cast of Characters n X-Phi: Experimental

T HE SCHOOL DI ST RI CT OF PHI L ADE L PHI A PROCURE ME NT DE PART ME NT ST

State of the Art in Microservices Adrian Cockcroft @adrianco Technology Fellow - Battery

Roller micro slides type PMMR TECHNISCHE DATEN ASSEMBLY The mounting holes of each type are

Review EEE118: Electronic Devices and Circuits Considered the four modes of operation of a BJT.

CSE3009: (Software Architecture and Design) Yann-Gal

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg

A Scalable Architecture for Ordered Parallelism Mark Jeffrey , Suvinay Subramanian, Cong Yan,

A micro-perspective on variation and universals Jeroen van Craenenbroeck 1 Marjo van Koppen 2 1

Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi,

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND - PowerPoint PPT Presentation

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER UPDATES Anurag Mukkara , Nathan Beckmann, Daniel Sanchez MICRO 2019 Scatter updates are common but inefficient 2 Scatter updates are common but

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Representing Phi Sigma Pis Identity &amp; Mission Your Chapter and the National Fraternity

ALPHA SIGMA PHI Epsilon Upsilon Chapter at Clemson University Our 90-man Alpha Sigma Phi

Phi Gamma Delta University of Tampa Phi Gamma Delta Delta Colony at UT 164 universities

Omega Delta Phi Sorority Alumni Association Constitution Committee Proposals Review Package July

What Can Experimental Philosophy Do? David Chalmers Cast of Characters n X-Phi: Experimental

T HE SCHOOL DI ST RI CT OF PHI L ADE L PHI A PROCURE ME NT DE PART ME NT ST

State of the Art in Microservices Adrian Cockcroft @adrianco Technology Fellow - Battery

Roller micro slides type PMMR TECHNISCHE DATEN ASSEMBLY The mounting holes of each type are

Review EEE118: Electronic Devices and Circuits Considered the four modes of operation of a BJT.

CSE3009: (Software Architecture and Design) Yann-Gal

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg

A Scalable Architecture for Ordered Parallelism Mark Jeffrey , Suvinay Subramanian, Cong Yan,

A micro-perspective on variation and universals Jeroen van Craenenbroeck 1 Marjo van Koppen 2 1

Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi,

Representing Phi Sigma Pis Identity & Mission Your Chapter and the National Fraternity