Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - - PowerPoint PPT Presentation

data processing on modern hardware
SMART_READER_LITE
LIVE PREVIEW

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2019/20 Jens Teubner Data Processing on Modern Hardware Winter 2019/20 c 1 Part V Execution on Multiple Cores Jens


slide-1
SLIDE 1

Data Processing on Modern Hardware

Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2019/20

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 1

slide-2
SLIDE 2

Part V Execution on Multiple Cores

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 185

slide-3
SLIDE 3

Example: Star Joins

Task: run parallel instances of the query (ր introduction) SELECT SUM(lo_revenue) FROM dimension part, lineorder fact table WHERE p_partkey = lo_partkey AND p_category <= 5

  • σ

part lineorder To implement use either a hash join or an index nested loops join.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 186

slide-4
SLIDE 4

Execution on “Independent” CPU Cores

Co-run independent instances on different CPU cores.

0 % 20 % 40 % 60 %

performance degradation HJ alone HJ + HJ HJ + INLJ INLJ alone INLJ + HJ INLJ + INLJ Concurrent queries may seriously affect each other’s performance.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 187

slide-5
SLIDE 5

Shared Caches

In Intel Core 2 Quad systems, two cores share an L2 Cache: main memory L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Cache CPU CPU CPU CPU What we saw was cache pollution. → How can we avoid this cache pollution?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 188

slide-6
SLIDE 6

Cache Sensitivity

Dependence on cache sizes for some TPC-H queries: Some queries are more sensitive to cache sizes than others. cache sensitive: hash joins cache insensitive: index nested loops joins; hash joins with very small or very large hash table

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 189

slide-7
SLIDE 7

Locality Strength

This behavior is related to the locality strength of execution plans: Strong Locality small data structure; reused very frequently e.g., small hash table Moderate Locality frequently reused data structure; data structure ≈ cache size e.g., moderate-sized hash table Weak Locality data not reused frequently or data structure ≫ cache size e.g., large hash table; index lookups

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 190

slide-8
SLIDE 8

Execution Plan Characteristics

Locality effects how caches are used: strong moderate weak amount of cache used cache pollution small large large amount of cache needed small large small Plans with weak locality have most severe impact on co-running queries. Impact of co-runner on query: strong moderate weak strong low moderate high moderate moderate high high weak low low low

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 191

slide-9
SLIDE 9

Experiments: Locality Strength

0% 10% 20% 30% 40% 50% 60% 0.4 0.8 1.1 1.5 1.9 2.3 3 3.4 4.1 5.3 7.1 8.9 10.4 12.3 15.3 18.6 Performance Degradation Hash Table Size (MB) Index Join to Index Join Index Join to Hash Join Hash Join to Index Join Hash Join to Hash Join Index Join to Index Join (bitmap scan)

Source: Lee et al. MCC-DB: Minimizing Cache Conflicts in Multi-core Processors for Databases. VLDB 2009.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 192

slide-10
SLIDE 10

Locality-Aware Scheduling

An optimizer could use knowledge about localities to schedule queries. Estimate locality during query analysis. Index nested loops join → weak locality Hash join: hash table ≪ cache size → strong locality hash table ≈ cache size → moderate locality hash table ≫ cache size → weak locality Co-schedule queries to minimize (the impact of) cache pollution. Which queries should be co-scheduled, which ones not? Only run weak-locality queries next to weak-locality queries. → They cause high pollution, but are not affected by pollution. Try to co-schedule queries with small hash tables.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 193

slide-11
SLIDE 11

Experiments: Locality-Aware Scheduling

PostgreSQL; 4 queries (different p_categorys); for each query: 2 × hash join plan, 2 × INLJ plan; impact reported for hash joins:

0 %

  • 10 %
  • 20 %
  • 30 %
  • 40 %
  • 50 %

performance impact

0.78 MB 2.26 MB 4.10 MB 8.92 MB

hash table size default scheduling locality-aware scheduling

Source: Lee et al. VLDB 2009.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 194

slide-12
SLIDE 12

Cache Pollution

Weak-locality plans cause cache pollution, because they use much cache space even though they do not strictly need it. By partitioning the cache we could reduce pollution with little impact on the weak-locality plan. moderate-locality plan weak-locality plan shared cache But: Cache allocation controlled by hardware.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 195

slide-13
SLIDE 13

Cache Organization

Remember how caches are organized: The physical address of a memory block determines the cache set into which it could be loaded. byte address block address tag set index

  • ffset

Thus, We can influence hardware behavior by the choice of physical memory allocation.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 196

slide-14
SLIDE 14

Page Coloring

The address ↔ cache set relationship inspired the idea of page colors. Each memory page is assigned a color.5 Pages that map to the same cache sets get the same color. cache set memory page cache memory How many colors are there in a typical system?

5Memory is organized in pages. A typical page size is 4 kB. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 197

slide-15
SLIDE 15

Page Coloring

By using memory only of certain colors, we can effectively restrict the cache region that a query plan uses. Note that Applications (usually) have no control over physical memory. Memory allocation and virtual ↔ physical mapping are handled by the operating system. We need OS support to achieve our desired cache partitioning.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 198

slide-16
SLIDE 16

MCC-DB: Kernel-Assisted Cache Sharing

MCC-DB (“Minimizing Cache Conflicts”): Modified Linux 2.6.20 kernel Support for 32 page colors (4 MB L2 Cache: 128 kB per color) Color specification file for each process (may be modified by application at any time) Modified instance of PostgreSQL Four colors for regular buffer pool Implications on buffer pool size (16 GB main memory)? For strong- and moderate-locality queries, allocate colors as needed (i.e., as estimated by query optimizer)

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 199

slide-17
SLIDE 17

Experiments

Moderate-locality hash join and weak-locality co-runner (INLJ): 0 % 10 % 20 % 30 % 40 % 50 % 32 24 16 8 4 weak locality (INLJ) moderate locality (HJ) single-threaded execution single-threaded execution L2 Cache Miss Rate Colors to Weak-Locality Plan

Source: Lee et al. VLDB 2009.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 200

slide-18
SLIDE 18

Experiments

Moderate-locality hash join and weak-locality co-runner (INLJ): 10 20 30 40 50 60 70 32 24 16 8 4 weak locality (INLJ) moderate locality (HJ) single-threaded execution single-threaded execution Execution Time [sec] Colors to Weak-Locality Plan

Source: Lee et al. VLDB 2009.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 201

slide-19
SLIDE 19

Experiments: MCC-DB

PostgreSQL; 4 queries (different p_categorys); for each query: 2 × hash join plan, 2 × INLJ plan; impact reported for hash joins:

0 %

  • 10 %
  • 20 %
  • 30 %
  • 40 %
  • 50 %

performance impact

0.78 MB 2.26 MB 4.10 MB 8.92 MB

hash table size default locality-aware page coloring

Source: Lee et al. VLDB 2009.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 202

slide-20
SLIDE 20

Building a Shared-Memory Multiprocessor

What the programmer likes to think of. . . shared main-memory CPU core CPU core CPU core CPU core Scalability? Moore’s Law?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 203

slide-21
SLIDE 21

Centralized Shared-Memory Multiprocessor

Caches help mitigate the bandwidth bottleneck(s). shared main-memory shared cache private cache CPU core private cache CPU core private cache CPU core private cache CPU core A shared bus connects CPU cores and memory. → the “shared bus” may or may not be shared physically. The Intel Core architecture, e.g., implemented this design.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 204

slide-22
SLIDE 22

Centralized Shared-Memory Multiprocessor

The shared bus design with caches makes sense: + symmetric design; uniform access time for every memory item from every processor + private data gets cached locally → behavior identical to that of a uniprocessor ? shared data will be replicated to private caches → Okay for parallel reads. → But what about writes to the replicated data? → In fact, we’ll want to use memory as a mechanism to communicate between processors. The approach does have limitations, too: – For large core counts, shared bus may still be a (bandwidth) bottleneck.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 205

slide-23
SLIDE 23

Caches and Shared Memory

Caching/replicating shared data can cause problems: CPU CPU cache cache shared main memory x = 4 x = 42 read x (4) x = 4 read x (4) x = 4 x := 42 (42) x = 42 read x (4) x = 4 Challenges: Need well-defined semantics for such scenarios. Must efficiently implement that semantics.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 206

slide-24
SLIDE 24

Cache Coherence

The desired property (semantics) is cache coherence. Most importantly:6 Writes to the same location are serialized; two writes to the same location (by any two processors) are seen in the same order by all processors. Note: We did not specify which order will be seen by the processors. → Why?

6We also demand that a read by processor P will return P’s most recent write,

provided that no other processor has written to the same location meanwhile. Also, every write must be visible by other processors after “some time.”

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 207

slide-25
SLIDE 25

Cache Coherence Protocol

Multiprocessor (or multicore) systems maintain coherence through a cache coherence protocol. Idea: Know which cache/memory holds the current value of the item. Other replicas might be stale. Two alternatives:

1 Snooping-Based Coherence

→ All processors communicate to agree on item states.

2 Directory-Based Coherence

→ A centralized directory holds information about state/whereabouts of data items.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 208

slide-26
SLIDE 26

Snooping-Based Cache Coherence

Rationale: All processors have access to a shared bus. Can “snoop” on the bus to track other processors’ activities. Use to track the sharing state of each cached item: (sharing) state tag data Meta data for each cache block: (sharing) state block identification (tag) Ignoring Multiprocessors for a moment, which “state” information might make sense to keep?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 209

slide-27
SLIDE 27

Strategy 1: Write Update Protocol

Idea: On every write, propagate the write to every copy. → Use bus to broadcast writes.7 Pros/Cons of this strategy?

7The protocol is thus also called write broadcast protocol. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 210

slide-28
SLIDE 28

Strategy 2: Write Invalidate Protocol

Idea: Before writing an item, invalidate all other copies. Activity Bus Cache A Cache B Memory x = 4 A reads x cache miss for x x = 4 x = 4 B reads x cache miss for x x = 4 x = 4 x = 4 A reads x – (cache hit) x = 4 x = 4 x = 4 B writes x invalidate x / / / / / / / x = 4 x = 42 x = 48 A reads x cache miss for x x = 42 x = 42 x = 42 → Caches will re-fetch invalidated items automatically. Since the bus is shared, other caches may answer “cache miss” messages ( necessary for write-back caches).

8With write-through caches, memory will be updated immediately. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 211

slide-29
SLIDE 29

Write Invalidate—Realization

Realization: To invalidate, broadcast address on bus. All processors continuously snoop on bus: invalidate message for address held in own cache

→ Invalidate own copy

miss message for address held in own cache

→ Reply with own copy (for write-back caches) → Memory will see this and abort its own read

What if two processors try to write at the same time?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 212

slide-30
SLIDE 30

Write Invalidate—Tracking Sharing States

Through snooping, can monitor all bus activities by all processors. → Track sharing state. Idea: Sending an invalidate will make local copy the only one valid. → Mark local cache line as modified (≈ exclusive). If a local cache line is already modified, writes need not be announced on the bus (no invalidate message). Upon read request by other processor: → If local cache line has state modified, answer the request by sending local version. → Change local cache state to shared.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 213

slide-31
SLIDE 31

Write Invalidate—State Machine

Local caches track sharing states using a state machine.

invalid modified shared

“clean” “dirty”

CPU read miss; put read miss on bus CPU write miss; put write miss on bus CPU read miss; write back data put read miss on bus read miss; write back data CPU write miss; put write miss on bus

CPU events

uniprocessor → track “dirty”

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 214

slide-32
SLIDE 32

Write Invalidate—State Machine

Local caches track sharing states using a state machine.

invalid modified shared

“clean” “dirty”

CPU read miss; put read miss on bus CPU write miss; put write miss on bus CPU read miss; write back data put read miss on bus read miss; write back data CPU write miss; put write miss on bus CPU write hit; put invalidate on bus

CPU events

uniprocessor → track “dirty” multiprocessor → also send invalidate

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 214

slide-33
SLIDE 33

Write Invalidate—State Machine

Local caches track sharing states using a state machine.

invalid modified shared

“clean” “dirty”

invalidate write miss CPU read miss; put read miss on bus CPU write miss; put write miss on bus write miss; write back block CPU read miss; write back data put read miss on bus read miss; write back data CPU write miss; put write miss on bus CPU write hit; put invalidate on bus

bus events CPU events

uniprocessor → track “dirty” multiprocessor (cont.) → react to bus events

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 214

slide-34
SLIDE 34

Write Invalidate—Notes

Notes: Because of the three states modified, shared, and invalid, the protocol on the previous slide is also called MSI protocol. The Write Invalidate protocol ensures that any valid cache block is either in the shared state in one or more caches or in the modified state in exactly one cache.

(Any transition to the modified state invalidates all other copies of the block; whenever another cache fetches a copy of the block, the modified state is left.)

The MSI protocol also ensures that every shared item has also been written back to memory.

  • c

Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 215

slide-35
SLIDE 35

MSI Protocol—Extensions

Actual systems often use extensions to the MSI protocol, e.g., MESI (“E” for exclusive) Distinguish between exclusive (but clean) and modified (which implies that the copy is exclusive). Optimizes the (common) case when an item is first read ( exclusive) then modified ( modified). MESIF (“F” for forward) In M(E)SI, if shared items are served by caches (not only by memory), all caches might answer miss requests. MESIF extends the protocol, so at most one shared copy of an item is marked as forward. Only this cache will respond to misses on the bus. Intel i7 employs the MESIF protocol.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 216

slide-36
SLIDE 36

MSI Protocol—Extensions

MOESI (“O” for owned)

  • wned marks an item that might be outdated in memory; the owner

cache is “responsible” for the item. The owner must respond to data requests (since main memory might be outdated). MOESI allows moving around dirty data between caches. The AMD Opteron uses the MOESI protocol. MOESI avoids the need to write every shared cache block back to memory ( ).

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 217

slide-37
SLIDE 37

Limitations of a Shared Bus

Limitations of a shared bus: Large core counts → high bandwidth. Shared buses cannot satisfy bandwidth demands of modern multiprocessor systems. Therefore: Distribute memory Communicate through interconnection network Consequence: Non-uniform memory access (NUMA) characteristics

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 218

slide-38
SLIDE 38

Bandwidth Demand

E.g., Intel Xeon E7-8880 v3: 2.3 GHz clock rate 18 cores per chip (36 threads) Up to 8 processors per system Back-of-the-envelope calculation: 1 byte per cycle per core → 331 GB/s Data-intensive applications might demand much more! Shared memory bus? Modern bus standards can deliver at most a few ten GB/s. Switching very high bandwidths is a challenge.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 219

slide-39
SLIDE 39

Distributed Shared Memory

Idea: Distribute memory → Attach to individual compute nodes CPU CPU CPU CPU interconnect memory memory memory memory

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 220

slide-40
SLIDE 40

Example: 8-Way Intel Nehalem-EX

CPU CPU CPU CPU CPU CPU CPU CPU

MEM MEM MEM MEM MEM MEM MEM MEM I/O I/O I/O I/O Interconnect: “Intel Quick Path Interconnect (QPI)”9 Memory may be local, one hop away, or two hops away. → Non-uniform memory access (NUMA)

9The AMD counterpart is “HyperTransport”. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 221

slide-41
SLIDE 41

Distributed Memory and Snooping

Idea: Extend “snooping” to distributed memory. Broadcast coherence traffic, send data point-to-point. Problem solved?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 222

slide-42
SLIDE 42

Snooping-Based Cache Coherency: Scalability

Miss rate 0% 3% 2% 1% 1 2 4 Processor count FFT 8 16 8% 4% 7% 6% 5% Miss rate 0% 6% 4% 2% 1 2 4 Processor count Ocean 8 16 16% 18% 20% 8% 14% 12% 10% Miss rate 0% 1% 1 2 4 Processor count LU 8 16 2% Miss rate 0% 1 2 4 Processor count Barnes 8 16 1% Coherence miss rate Capacity miss rate

Example: Scientific Applications ր Hennessy & Patterson, Sect. I.5 → AMD Opteron is a system that still uses the approach.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 223

slide-43
SLIDE 43

Directory-Based Cache Coherence

To avoid all-broadcast coherence protocol: Use a directory to keep track of which item is replicated where. Direct coherence messages only to those nodes that actually need them. Directory: Either keep a global directory ( scalability?). Or define a home node for each memory address. → Home node holds directory for that item. → Typically: distribute directory along with memory. Protocol now involves directory/-ies (at item home node(s)), individual caches (local to processors). Parties communicate point-to-point (no broadcasts).

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 224

slide-44
SLIDE 44

Directory-Based Cache Coherence

Messages sent by individual nodes:

Message type Source Destination Message contents Function of this message Read miss Local cache Home directory P, A Node P has a read miss at address A; request data and make P a read sharer. Write miss Local cache Home directory P, A Node P has a write miss at address A; request data and make P the exclusive owner. Invalidate Local cache Home directory A Request to send invalidates to all remote caches that are caching the block at address A. Invalidate Home directory Remote cache A Invalidate a shared copy of data at address A. Fetch Home directory Remote cache A Fetch the block at address A and send it to its home directory; change the state of A in the remote cache to shared. Fetch/invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache. Data value reply Home directory Local cache D Return a data value from the home memory. Data write-back Remote cache Home directory A, D Write-back a data value for address A.

ր Hennessy & Patterson, Computer Architecture, 5th edition, page 381.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 225

slide-45
SLIDE 45

Directory-Based Coherence—State Machine

Individual caches use a state machine similar to the one on slide 214.

invalid modified shared invalidate read miss; send read miss msg write miss; send write miss msg invalidate; write back block read miss; write back data send read miss msg fetch; write back data write miss; send write miss msg write hit; send invalidate msg

messages from home directory CPU events

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 226

slide-46
SLIDE 46

Directory-Based Coherence—State Machine

The directory has its own state machine.

uncached exclusive shared read miss; data value reply Sharers = {P} write miss; data value reply; Sharers = {P} write back; Sharers = {} read miss; fetch; data value reply; Sharers ∪ = {P} write miss; invalidate; Sharers = {P}; data value reply

messages from remote cache CPU events

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 227

slide-47
SLIDE 47

Cache Coherence Cost

Experiment: Several threads randomly increment elements of an integer array; Zipfian probability distribution, no synchronization10.

6.6 13.2 19.6 80.7

1 2 2 2 1 2 3 4 5 8

same core same chip

  • ff-chip

same chip threads nano-seconds / iteration

20 40 60 80 100

Intel Nehalem EX; 1.87 GHz; 2 CPUs, 8 cores/CPU.

10In general, this will yield incorrect counter values. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 228

slide-48
SLIDE 48

Cache Coherence Cost

Two types of coherence misses: true sharing miss → Data shared among processors. → Often-used mechanism to communicate between threads. → These misses are unavoidable. false sharing miss → Processors use different data items, but the items reside in the same cache line. → Items get invalidated/migrated, even though no data is actually shared. How can false sharing misses be avoided?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 229

slide-49
SLIDE 49

NUMA—Non-Uniform Memory Access

Distribution makes memory access locality-sensitive. → Non-Uniform Memory Access (NUMA)

Socket 2 Memory Socket 3 Memory Socket 0 Memory Socket 1 Memory 1 2 3 4

bandwidth latency

1

  • 24.7 GB/s

150 ns

2

  • 10.9 GB/s

185 ns

3

  • 10.9 GB/s

230 ns

3

/ 4 11 5.3 GB/s 235 ns

ր Li et al. NUMA-Aware Algorithms: The Case of Data

  • Shuffling. CIDR 2013

11 3

with cross traffic along 4 .

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 230

slide-50
SLIDE 50

Sorting and NUMA

input relation local sort local sort local sort local sort merge local merge

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 231

slide-51
SLIDE 51

Resulting Throughput

memory bottleneck 1 2 4 8 16 32 64 50 100 150 200 250 300 number of threads throughput [M tuples/sec]

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 232

slide-52
SLIDE 52

NUMA and Bandwidth

Problem: Merging is bandwidth-bound. → Merge multiple runs (from NUMA regions) at once

(Two-way merging would be more CPU-efficient because of SIMD.)

→ Might need more instructions, but brings bandwidth and compute into balance. buf buf buf NUMA3 buf NUMA2 buf buf NUMA1 buf NUMA0

  • ne thread

cache-resident

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 233

slide-53
SLIDE 53

Throughput With Multi-Way Merging

1 2 4 8 16 32 64 50 100 150 200 250 300 number of threads throughput [M tuples/sec]

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 234

slide-54
SLIDE 54

NUMA Effects in Detail

Bandwidth: Single links have lower bandwidth than memory controllers. CPU CPU CPU CPU 12.8 GB/s

(bidirectional)

12.8 GB/s

(bidirectional)

memory memory memory memory 25.6 GB/s 25.6 GB/s 25.6 GB/s 25.6 GB/s Intel Nehalem EX CPU CPU CPU CPU 16 GB/s

(bidirectional)

16 GB/s

(bidirectional)

memory memory memory memory 51.2 GB/s 51.2 GB/s 51.2 GB/s 51.2 GB/s Intel Sandy Bridge EP

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 235

slide-55
SLIDE 55

Parallelize Database Workloads

To leverage the hardware potential, databases must use parallelism. → Inter-Query Parallelism? Requires a sufficient number of co-running queries. May work well for OLTP workloads

(They tend to be characterized by many, many queries, each of which is very simple.)

Data Analytics/OLAP often don’t fulfill the requirement. Won’t help an individual query. Therefore: Intra-query parallelism is a must. Should still allow (few) co-running queries.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 236

slide-56
SLIDE 56

Parallelization Strategies

Parallelization strategies for intra-query parallelism: Pipeline Parallelism? Data Partitioning / Parallel Operator Implementations?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 237

slide-57
SLIDE 57

Data Partitioning / Parallel Operator Implementations

E.g., parallel hash joins (radix joins):

2 4 6 8

time [billion cycles]

1 2 3 4 5 6 7 8

thread id

1

  • 2
  • 3
  • 4
  • 5
  • ր Balkesen et al. Main-Memory Hash Joins on Multi-Core CPUs: Tuning to

the Underlying Hardware. ICDE 2013.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 238

slide-58
SLIDE 58

Data Partitioning / Parallel Operator Implementations

E.g., TPC-H Query 10 on MonetDB:

IO per ms Query sf10-18 Sat Sep 15 12:30:25 2012 reads writes 13 memory in GB 6.5 9.4 CPU threads milliseconds, parallelism usage 62.1 % 4 6 7 8 9 10 11 12 13 961 1922 2884 3845 4807 5768 6729 7691 8652 9614.24 aggr.sum 20 calls 12.18 sec algebra.leftjoin 74 calls 2.89 sec algebra.join 59 calls 16.49 sec algebra.semijoin 9 calls 2.40 ms algebra.kdifference 83 calls 11.47 ms algebra.kunion 58 calls 30.13 ms algebra.slice 1 calls 0.10 ms algebra.markT 59 calls 7.44 ms algebra.thetauselect 1 calls 1.48 sec algebra.* 31 calls 3.13 sec bat.mirror 35 calls 49.45 ms bat.reverse 81 calls 8.36 ms group.multicolumns 10 calls 2.21 ms group.* 11 calls 5.81 sec language.dataflow 2 calls 9.61 sec mat.pack 8 calls 1.95 sec pqueue.* 2 calls 1.85 ms io.stdout 1 calls 0.07 ms sql.* 127 calls 20.71 ms 672 MAL instructions executed

ր Mrunal Gawade. Multi-Core Parallelism in a Column-Store. PhD Thesis, Universiteit van Amsterdam. 2017.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 239

slide-59
SLIDE 59

Data Partitioning

Lessons learned: → Use fine-granular partitioning. Increased scheduling overhead seems bearable. → Assign partitions/tasks dynamically to processors. Makes load balancing easier. E.g., Morsel-Driven Parallelism: (as implemented in HyPer) Break operator inputs into chunks of ≈ 100,000 tuples (“morsels”). Fixed number of operator threads. Morsels dispatched to threads dynamically (task queue).

ր Leis et al. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age. SIGMOD 2014.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 240

slide-60
SLIDE 60

Morsel-Driven Parallelism: Idea

A 16 18 27 5 7 B 8 33 10 5 23 B 8 33 10 5 23 C v x y z u

HT(S) HT(T)

A 16 7 10 27 18 5 7 5 ... ... ... ... ... Z a c i b e j d f ... ... ... ... ...

R

Z a ... ... A 16 ... ... B 8 ... ... C v ... ...

Result

store probe(16) probe(10) probe(8) probe(27) store

Z b ... ... A 27 ... ... B 10 ... ... C y ... ...

morsel morsel

Dispatcher

Probe phase of join query R S T. R broken up in segments; threads grab segments and, for each tuple, probe into hash tables HT(S) and HT(T).

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 241

slide-61
SLIDE 61

Morsel-Driven Parallelism: Query Pipelines

HyPer compiles plan segments between pipeline breakers into machine code. E.g., three pipelines:

1 Scan, filter T, build HT(T). 2 Scan, filter S, build HT(S). 3 Scan, filter R, probe into both hash tables.

After compilation, each pipeline becomes one “operator”. BB BA

S R

v v

T

v

s...s...s... s . . . s . . . s... s... s... s... s...

HT(T) global Hash Table

sel T

se 1: sel- A- ally se 2: A- ea

  • HT

s e l age a of

  • re

age a of

  • re

St blue

can

T)

Data dependencies: → Pipelines

1 and 2 must complete before Pipeline 3 begins.

→ But Pipelines

1 and 2 can run in parallel.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 242

slide-62
SLIDE 62

Avoid Synchronization / Increase Locality

HyPer, in fact, breaks up hash table builds into two phases: Threads implement σ first and move tuples to private storage area. Build global hash table afterward. Advantages?

s...s...s... s . . . s... s... s... s... s... s...

HT(T) global Hash Table

morsel T

Phase 1: process T morsel-wise and store NUMA-locally Phase 2: scan NUMA-local storage area and insert pointers into HT n e x t m

  • r

s e l Storage area of red core Storage area of green core Storage area of blue core

scan scan

Insert the pointer into HT ...(T)

v

...(T)

v

...(T)

v

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 243

slide-63
SLIDE 63

Morsel-Driven Parallelism: NUMA Awareness

NUMA Awareness: Annotate morsels with NUMA region where their data resides. When dispatching to cores favor NUMA-local assignments. Elasticity: There’s only one task pool for the entire system. Multiple queries share the same worker threads. → Parallelism across and within queries.

Dispatcher Code

dispatch(0) (J1, Mr1) Pipeline-Job J1 on morsel Mr1

  • n (red) socket of Core0

Pipeline- Job J1 Pipeline- Job J2

Mr1 Mr2 Mr3 Mg1 Mg2 Mg3 Mb1 Mb2 Mb3 (virtual) lists of morsels to be processed (colors indicates on what socket/core the morsel is located) Lock-free Data Structures of Dispatcher List of pending pipeline-jobs (possibly belonging to different queries)

Core0 Core Core Core Core Core Core Core

DRAM

Core8 Core Core Core Core Core Core Core

DRAM

Core Core Core Core Core Core Core Core

DRAM

Core Core Core Core Core Core Core Core

DRAM Socket Socket

inter connect

Socket Socket Example NUMA Multi-Core Server with 4 Sockets and 32 Cores

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 244

slide-64
SLIDE 64

Joins Over Data Streams:

current window for Stream R wR R current window for Stream S wS S

p

Task: Find all r, s in wR, wS that satisfy p(r, s).

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 245

slide-65
SLIDE 65

Implementation [Kang et al., ICDE 2003]

R S p? S R p?

  • 1. scan window, 2. insert new tuple, 3. invalidate old

NUMA-Aware Execution?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 246

slide-66
SLIDE 66

CellJoin [Gedik et al., VLDBJ 2009]

R

1

S

core 0 core 1 core 2 core 3 core 4 core 5 2 3

replicate partition S

1

R

core 0 core 1 core 2 core 3 core 4 2 3

replicate partition

1 bandwidth bottlenecks 2 long-distance communication 3 centralized coordination and memory

→ Parallel, but not NUMA-aware.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 247

slide-67
SLIDE 67
slide-68
SLIDE 68

Handshake Join Idea Handshake Join:

comparisons window for R window for S input stream R input stream S

Streams flow by in opposite directions Compare tuples when they meet

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 249

slide-69
SLIDE 69

Handshake Join on Many Cores

Data flow representation → parallelization:

core 1 core 2 core 3 core 4 core 5

R S No bandwidth bottleneck

1

Communication/synchronization stays local

2

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 250

slide-70
SLIDE 70

Synchronization

Coordination can now be done autonomously core 1 core 2 core 3 core 4 core 5 R S no more centralized coordination

3

Autonomous load balancing Lock-free message queues between neighbors

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 251

slide-71
SLIDE 71

Example: AMD “Magny Cours” (48 cores)

1 2 3 4 5 6 7

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 252

slide-72
SLIDE 72

Experiments (AMD Magny Cours, 2.2 GHz)

4 8 12 16 20 24 28 32 36 40 44 1000 1500 2000 2500 3000 3250 3500 3750 4000 number of processing cores n throughput / stream (tuples/sec) window size: 10 min 15 min CellJoin

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 253

slide-73
SLIDE 73

Beyond 48 Cores. . . (FPGA-based simulation)

50 100 150 200 number of join cores n 50 100 150 200 250 clock frequency (MHz) 96 % chip utilization

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 254

slide-74
SLIDE 74

Highly Concurrent Workloads

Databases are often faced with highly concurrent workloads. Good news: Exploit parallelism offered by hardware (increasing number of cores). Bad news: Increases relevance of synchronization mechanisms. Two levels of synchronization in databases: Synchronize on User Data to guarantee transaction semantics; database terminology: locks Synchronize on Database-Internal Data Structures short-duration locks; called latches in databases We’ll now look at the latter, even when we say “locks.”

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 255

slide-75
SLIDE 75

Lock (Latch) Implementation

There are two strategies to implement locking: Blocking (operating system service) De-schedule waiting thread until lock becomes free. Cost: two context switches (one to sleep, one to wake up) → ≈ 12–20 µsec Spinning (can be done in user space) Waiting thread repeatedly polls lock until it becomes free. Cost: two cache miss penalties (if implemented well) → ≈ 150 nsec Thread burns CPU cycles while spinning.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 256

slide-76
SLIDE 76

Implementation of Spinlocks

Implementation of a spinlock?

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 257

slide-77
SLIDE 77

Thread Synchronization

Blocking: thread 1 thread 2 thread working lock held de-schedule wake-up time Spinning: thread 1 thread 2 thread working lock held thread spinning short delay time

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 258

slide-78
SLIDE 78

Experiments: Locking Performance

Sun Niagara II (64 hardware contexts):

30 60 90 120 150 32 64 96 128 160 192

Blocking Spinning Ideal 100% load

# Threads Throughput (ktps)

Source: Johnson et al. Decoupling Contention Management from Scheduling. ASPLOS 2010.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 259

slide-79
SLIDE 79

Spinning Under High Load

Under high load, spinning can cause problems: thread 1 thread 2 time More threads than hardware contexts. Operating system preempts running task . Working and spinning threads all appear busy to the OS. Working thread likely had longest time share already → gets de-scheduled by OS. Long delay before working thread gets re-scheduled. By the time working thread gets re-scheduled (and can now make progress), waiting thread likely gets de-scheduled, too.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 260

slide-80
SLIDE 80

Spinning

20 40 60 80 100 15 31 47 63 71 95 127 159 191 Client Threads Work Contention Prio-Invert Machine Util (%)

Source: Johnson et al. Decoupling Contention Management from Scheduling. ASPLOS 2010.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 261

slide-81
SLIDE 81

The Right Tool for the Right Purpose

The properties of spinning and blocking suggest their use for different purposes: Spinning features quick lock hand-offs. → Use spinning to coordinate access to a shared data structure (contention). Blocking reduces system load ( scheduling). → Use blocking at longer time scales. → Block when system load increases to reduce scheduling

  • verhead.

Idea: Monitor system load (using a separate thread) and control spinning/blocking behavior off the critical code path.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 262

slide-82
SLIDE 82

Load Controller

The load controller periodically Determines current load situation from the OS. If system gets overloaded “invite” threads to block with help of a sleep slot buffer. Size of sleep slot buffer: number of threads that should block. When load gets less controller wakes up sleeping threads, which register in sleep slot buffer before going to sleep.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 263

slide-83
SLIDE 83

Lock Handling

A thread that wants to acquire a lock Checks the regular spin lock. If the lock is already taken, it tries to enter the sleep slot buffer and blocks (otherwise it spins). The load controller will wake up the thread in time.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 264

slide-84
SLIDE 84

Controller Overhead

70 90 110 130 100 1000 10000 100000 Update delay (µs) Throughput (ktps) 98% load 110% load 150% load

Source: Johnson et al. Decoupling Contention Management from Scheduling. ASPLOS 2010.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 265

slide-85
SLIDE 85

Performance Under Load

20 40 60 80 1 15 31 63 71 95 127 1 15 31 63 71 95 127 1 15 31 63 71 95 127 pthread TP-MCS LC Normalized Throughput Raytrace TM-1 TPC-C

Source: Johnson et al. Decoupling Contention Management from Scheduling. ASPLOS 2010.

c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 266