Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2019/20 � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 1

Part V Execution on Multiple Cores � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 185

Example: Star Joins Task: run parallel instances of the query ( ր introduction) dimension SELECT SUM(lo_revenue) fact table FROM part, lineorder WHERE p_partkey = lo_partkey AND p_category <= 5 To implement � use either � a hash join or lineorder σ an index nested loops join . part � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 186

Execution on “Independent” CPU Cores Co-run independent instances on different CPU cores. HJ alone HJ + HJ HJ + INLJ INLJ alone INLJ + HJ INLJ + INLJ 60 % 40 % 20 % 0 % performance degradation Concurrent queries may seriously affect each other’s performance. � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 187

Shared Caches In Intel Core 2 Quad systems, two cores share an L2 Cache: CPU CPU CPU CPU L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache main memory What we saw was cache pollution . → How can we avoid this cache pollution? � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 188

Cache Sensitivity Dependence on cache sizes for some TPC-H queries: Some queries are more sensitive to cache sizes than others. cache sensitive: hash joins cache insensitive: index nested loops joins; hash joins with very small or very large hash table � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 189

Locality Strength This behavior is related to the locality strength of execution plans: Strong Locality small data structure; reused very frequently e.g. , small hash table Moderate Locality frequently reused data structure; data structure ≈ cache size e.g. , moderate-sized hash table Weak Locality data not reused frequently or data structure ≫ cache size e.g. , large hash table; index lookups � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 190

Execution Plan Characteristics Locality effects how caches are used: strong moderate weak cache pollution amount of cache used small large large amount of cache needed small large small Plans with weak locality have most severe impact on co-running queries. Impact of co-runner on query: strong moderate weak strong low moderate high moderate moderate high high weak low low low � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 191

Experiments: Locality Strength Source: Lee et al. MCC-DB: Minimizing Cache Conflicts Index Join to Index Join Index Join to Hash Join in Multi-core Processors for Databases. VLDB 2009 . Hash Join to Index Join Hash Join to Hash Join Index Join to Index Join (bitmap scan) 60% 50% Performance Degradation 40% 30% 20% 10% 0% 0.4 0.8 1.1 1.5 1.9 2.3 3 3.4 4.1 5.3 7.1 8.9 10.4 12.3 15.3 18.6 Hash Table Size (MB) � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 192

Locality-Aware Scheduling An optimizer could use knowledge about localities to schedule queries. Estimate locality during query analysis. Index nested loops join → weak locality Hash join: hash table ≪ cache size → strong locality hash table ≈ cache size → moderate locality hash table ≫ cache size → weak locality Co-schedule queries to minimize (the impact of) cache pollution. � Which queries should be co-scheduled, which ones not? Only run weak-locality queries next to weak-locality queries. → They cause high pollution, but are not affected by pollution. Try to co-schedule queries with small hash tables. � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 193

Experiments: Locality-Aware Scheduling PostgreSQL; 4 queries (different p_category s); for each query: 2 × hash join plan, 2 × INLJ plan; impact reported for hash joins: hash table size 0.78 MB 2.26 MB 4.10 MB 8.92 MB 0 % Source: Lee et al . VLDB 2009 . performance impact -10 % -20 % -30 % -40 % -50 % default scheduling locality-aware scheduling � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 194

Cache Pollution Weak-locality plans cause cache pollution, because they use much cache space even though they do not strictly need it. By partitioning the cache we could reduce pollution with little impact on the weak-locality plan. moderate-locality plan weak-locality plan shared cache But: Cache allocation controlled by hardware . � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 195

Cache Organization Remember how caches are organized: The physical address of a memory block determines the cache set into which it could be loaded. byte address tag set index offset block address Thus, We can influence hardware behavior by the choice of physical memory allocation . � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 196

Page Coloring The address ↔ cache set relationship inspired the idea of page colors . Each memory page is assigned a color . 5 Pages that map to the same cache sets get the same color . cache set cache memory page memory � How many colors are there in a typical system? 5 Memory is organized in pages . A typical page size is 4 kB . � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 197

Page Coloring By using memory only of certain colors, we can effectively restrict the cache region that a query plan uses. Note that Applications (usually) have no control over physical memory. Memory allocation and virtual ↔ physical mapping are handled by the operating system . We need OS support to achieve our desired cache partitioning . � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 198

MCC-DB: Kernel-Assisted Cache Sharing MCC-DB (“Minimizing Cache Conflicts”): Modified Linux 2.6.20 kernel Support for 32 page colors (4 MB L2 Cache: 128 kB per color) Color specification file for each process (may be modified by application at any time) Modified instance of PostgreSQL Four colors for regular buffer pool � Implications on buffer pool size (16 GB main memory)? For strong- and moderate-locality queries, allocate colors as needed ( i.e. , as estimated by query optimizer) � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 199

Experiments Moderate-locality hash join and weak-locality co-runner (INLJ): 50 % weak locality (INLJ) Source: Lee et al. VLDB 2009 . 40 % L2 Cache Miss Rate single-threaded execution 30 % 20 % moderate locality (HJ) 10 % single-threaded execution 0 % 32 24 16 8 4 Colors to Weak-Locality Plan � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 200

Experiments Moderate-locality hash join and weak-locality co-runner (INLJ): 70 weak locality (INLJ) 60 Source: Lee et al. VLDB 2009 . single-threaded execution Execution Time [sec] 50 moderate locality (HJ) 40 single-threaded execution 30 20 10 0 32 24 16 8 4 Colors to Weak-Locality Plan � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 201

Experiments: MCC-DB PostgreSQL; 4 queries (different p_category s); for each query: 2 × hash join plan, 2 × INLJ plan; impact reported for hash joins: hash table size 0.78 MB 2.26 MB 4.10 MB 8.92 MB 0 % Source: Lee et al . VLDB 2009 . performance impact -10 % -20 % -30 % -40 % -50 % default locality-aware page coloring � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 202

Building a Shared-Memory Multiprocessor What the programmer likes to think of. . . CPU core CPU core CPU core CPU core shared main-memory � Scalability? Moore’s Law? � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 203

Centralized Shared-Memory Multiprocessor Caches help mitigate the bandwidth bottleneck(s). CPU core CPU core CPU core CPU core private private private private cache cache cache cache shared cache shared main-memory A shared bus connects CPU cores and memory. → the “shared bus” may or may not be shared physically. The Intel Core architecture, e.g. , implemented this design. � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 204

Centralized Shared-Memory Multiprocessor The shared bus design with caches makes sense: + symmetric design ; uniform access time for every memory item from every processor + private data gets cached locally → behavior identical to that of a uniprocessor ? shared data will be replicated to private caches → Okay for parallel reads . → But what about writes to the replicated data? → In fact, we’ll want to use memory as a mechanism to communicate between processors. The approach does have limitations , too: – For large core counts , shared bus may still be a (bandwidth) bottleneck. � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 205

Caches and Shared Memory Caching/replicating shared data can cause problems: read x (4) read x (4) CPU CPU x := 42 (42) read x (4) � cache cache x = 42 x = 4 x = 4 x = 4 shared main memory x = 42 x = 4 Challenges: Need well-defined semantics for such scenarios. Must efficiently implement that semantics. � Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 c 206

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2019/20 Jens Teubner Data Processing on Modern Hardware Winter 2019/20 c 1 Part V Execution on Multiple Cores Jens

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2.

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

5/24/10 Modern Hardware is Complex Modern systems built on layers of hardware Tamper Evident

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

Online Algorithms Lectures 1 and 2 Ji r Sgall Computer Science Institute of the Charles

5 CPU Scheduling (1)

Improving C HARM ++ Performance with a NUMA-aware Load Balancer Larcio Lima Pilla 1,2 ,

Claude TADONKI MINES ParisTech PSL Research University Centre de Recherche Informatique

Sticky Expectations and Consumption Dynamics Christopher D. Carroll 1 Edmund Crawley 2 Jiri

Redistributive Taxation in a Partial Insurance Economy Jonathan Heathcote Federal Reserve Bank

Consumption-Based Asset Pricing (1) John Y. Campbell Ec2723 October 2010 John Y. Campbell

Classical Labor Supply: Partial Insurance ECON 34430: Topics in Labor Markets T. Lamadon (U of

Sambuz

Useful Links

Newsletter

Mail Us