Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 - PowerPoint PPT Presentation

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3

Memory Hierarchy

Multi-Core Processors Processor (chip) Processor (chip) Processor (chip) core core core core core core … core core core core core core Socket Socket Socket (to motherboard) Main Memory

Questions • Q1: Bottlenecks in the previous figure? • Q2: Which is larger? • The CPU data processing speed? • The memory bus speed? • Q3: Solutions? 4

General Pattern Cache hit / miss Fast memory (small, expensive) Fetch and cache Evict inactive data recently read data Slow memory (large, cheap) 5

Memory Hierarchy in a Processor PROCESSOR PROCESSOR CORE CORE CORE CPU Registers (L0) … L1 cache (~64KB*2 SRAM) Split: Instructions | Data L2 cache (~512KB SRAM) Unified: Instructions + Data L3 cache aka Last Level Cache (~4MB SRAM) shared across cores RAM (~4GB to 1+TB DRAM) shared across processors 6

Architectural Variants • Non-Uniform Memory Access (NUMA) • Popular in modern multi-socket architectures • Each socket has local RAM • Other sockets can access it (typically via a point-to-point bus) • Remote memory access is slower than local • Fundamental principles remain • Hierarchy • Locality 7

Latency Numbers (2012 & approximate!) 8

Analogy (1 ns = 1 hour) L1 cache access: 0.5 h Main memory reference: 4.17 days Watching an episode of a TV-serie A long camping trip L2 cache access: 7h Disk seek: 1141 years Almost a working day ~ time passed since Charlemagne crowned Emperor 9

Why Caching? • Temporal locality • Recently accessed data is likely to be used again soon • Spatial locality • Data close to recently accessed data is likely to be used soon 10

Example • Top-k integers in unsorted array • Maintain heap to store top-k elements • Scan the array to update heap • Q: Does locality help in this application? Min-heap check min if elem larger Array delete min min 15 insert elem 111 … 15 212 307 556 343 212 111 Scan 11

Some Answers • Temporal locality • Helps for heap management • Does not help for scanning the array • Estimating latency access • Consider the Top-100 example • Array elements are 4 bytes integers • Question: What is the expected latency to fetch a heap element? • Spatial locality? • Assume cache line is 64 bytes 12

Cache Coherency (i.e. Consistency) • Caches may have different replicas of the same data • Replication always creates consistency issues • Programs assume that they access a single shared memory • Keeping caches coherent is expensive! fetch addr. A cache Coherency protocol (HW) Main memory cache fetch addr. A 13

MESI Protocol • A cache line can be in four states • Modified: Not shared, dirty (i.e., inconsistent with main memory) • Exclusive: Not shared, clean (i.e., consistent with main memory) • Shared: Shared, clean • Invalid: Cannot be used • Only clean data is shared • Cache line transitions to Modified à All its copies become Invalid • Invalid data needs to be fetched again • Writes are detected by hw snooping on the bus • Q: Implications for programmers? 14

Write Back vs. Write Through • How to react when cache has dirty data • Write through: update lower-level caches & main memory immediately • Write back: delay that update 15

False Sharing • A core updates variable x and never reads y • Another core reads y and never reads x • Q: Can cache coherence kick in? • A: Yes, if x and y are stored in the same cache line struct foo { int x; int y; } 16

Keeping the CPU busy

Modern CPU Architectures • Many optimizations to deal with stagnant clock speed • Pipelining • Execute multiple instructions in a pipeline, not one at a time • Pre-load and enqueue instructions to be executed next • Out-of-order execution • Execute instructions whose input data is available out-of-order • Q: When do these not work? • A: Branches • Speculation • Processor predicts which branch is taken and pipelines • Speculative work thrown away in case of branch misprediction 18

Example (Again) • Top-k integers in unsorted array • Maintain heap to store top-k elements • Scan the array to update heap • Q: Can we leverage speculation and prefetching? Min-heap check min if elem larger Array delete min min 15 insert elem 111 … 15 212 307 556 343 212 111 Scan 19

Micro-Architectural Analysis q3 L1 LLC branch 50 150 cycles IPC instr. miss miss miss 40 100 30 Q1 Typer 34 2.0 68 0.6 0.57 0.01 Q3 Typer 25 0.8 21 0.5 0.16 0.27 20 50 Q3 TW 24 1.8 42 0.9 0.16 0.08 10 0 Typer TW • CPU counters ( perf tool) – normalized per row scanned • IPC: Instructions per cycle • L1, LLC: Cache misses • Branch miss: branch mispredictions • Q: Which system performs better? Source: T. Kersten et al., “Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask”, VLDB’19

Memory Stalls Memory Stall Cycles Other Cycles Typer Tectorwise 0 Cycles / Tuple 60 40 40 q3 20 20 0 0 100 1 3 10 30 100 1 3 10 30 100 50 Data Size (TPC-H Scale Factor) • CPU cycles wasted waiting for data • Speculation, prefetching = Lower cost for cache misses

External I/O

Direct Memory Access (DMA) • Hardware subsystem to transfers to/from memory • CPU is offloaded • CPU sends request and does something else • It gets notification when transfer is done • Used for • Disk, Network, GPU I/O • Memory copying • RDMA: Remote DMA • Fast data transfer across nodes in a distributed system

Disk I/O • Write/read blocks of bits • Sequential writes and reads are more efficient • There is a fixed cost for each I/O call • Calls to operating system functions are expensive • Q: How to amortize this cost? • A: Batching: read/write larger blocks • Same concept applies to network I/O and request processing • Latency vs. throughput tradeoff

Processes vs. Threads 25

Processes & Threads • We have discussed that multicores is the future • How to make use of parallelism? • OS/PL support for parallel programming • Processes • Threads

Processes vs. Threads • Process: separate memory space • Thread: shared memory space (except stack) Processes Threads Heap not shared shared Global variables not shared shared Local variables (Stack) not shared not shared Code shared shared File handles not shared shared

Parallel Programming • Shared memory • Threads • Access same memory locations (in heap & global variables) • Message-Passing • Processes • Explicit communication: message-passing

Threads/Shared Memory

Shared Memory Example void main (){ x = 12; // assume that x is a global variable t = new ThreadX(); t.start(); // starts thread t y = 12/x; This is “pseudo-Java” System.out.println(y); t.join(); // wait until t completes in C++: } pthread_create pthread_join class ThreadX extends Thread{ void run (){ x = 0; } } • Question: What is printed as output?

Desired: Atomicity and Isolation void foo (){ Thread a Thread b Atomic: All or nothing … x = 0; … Isolation: Run as if there foo() x = 1; foo() was no concurrency … y = 1/x; … } POSSIBLE DESIRED Thread a Thread b Thread a Thread b x = 0 x = 0 x = 1 x = 1 y = 1 happens- x = 0 before x = 0 y = 1/0 x = 1 changes become y = 1 visible time

Race Conditions • Non-deterministic access to shared variables • Correctness requires specific sequence of accesses • But we cannot rely on it because of non-determinism! • Solutions • Enforce a specific order using synchronization • Enforce a sequence of happen-before relationships • Locks, mutexes, semaphores: threads block each other • Lock-free algorithms: threads do not wait for each other • Hard to implement correctly! Typical programmer uses locks • Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap

Locks We use a lock variable l void foo (){ Thread a Thread b and use it to synchronize x = 0; … … x ++; l.lock() l.lock() Equivalent: declare y = 1/x; foo() foo() void synchronized l.unlock() l.unlock() } foo() Possible Impossible now Thread a Thread b Thread a Thread b l.lock() x = 0 l.lock() - waits foo() x = 1 l.unlock() x = 0 l.lock() - acquires foo() l.unlock() time

Inter-Thread Communication Thread a Thread b … … synchronized(o){ synchronized(o){ o.wait(); foo(); foo(); o.notify(); } } Thread a Thread b Notify on an object sends a signal that activates o.wait() other threads waiting on that object … Thread a waits Useful for controlling order of actions: foo() … Thread b executes foo before Thread a o.notify() o.wait() Example: Producer/Consumer pairs . foo() Consumer can avoid busy waiting .

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 - PowerPoint PPT Presentation

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy Multi-Core Processors Processor (chip) Processor (chip) Processor (chip) core core core core core core core core core core core

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

1 Malicious Usage Computer Arithmetic /* Kernel memory region holding user-accessible data */

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides Q. How

2-Level Page Tables Virtual Address (VA): 32 bits Virtual Address Space: 2 32 bytes Page Size: 2

Parallel Models Different ways to exploit parallelism Funding Partners bioexcel.eu Reusing

CPSC 121: Models of Computation able to: Specify the overall architecture of a (Von Neumann)

Last Class: Introduction to Operating Systems User apps Virtual machine interface OS physical

virtual memory 5 / devices 1 last time page replacement metrics optimizing hit rate really

Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray