Data Processing on Modern Hardware
Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2019/20
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 1
Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - - PowerPoint PPT Presentation
Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Winter 2019/20 Jens Teubner Data Processing on Modern Hardware Winter 2019/20 c 1 Part V Execution on Multiple Cores Jens
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 1
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 185
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 186
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 187
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 188
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 189
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 190
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 191
0% 10% 20% 30% 40% 50% 60% 0.4 0.8 1.1 1.5 1.9 2.3 3 3.4 4.1 5.3 7.1 8.9 10.4 12.3 15.3 18.6 Performance Degradation Hash Table Size (MB) Index Join to Index Join Index Join to Hash Join Hash Join to Index Join Hash Join to Hash Join Index Join to Index Join (bitmap scan)
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 192
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 193
Source: Lee et al. VLDB 2009.
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 194
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 195
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 196
5Memory is organized in pages. A typical page size is 4 kB. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 197
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 198
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 199
Source: Lee et al. VLDB 2009.
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 200
Source: Lee et al. VLDB 2009.
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 201
Source: Lee et al. VLDB 2009.
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 202
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 203
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 204
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 205
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 206
6We also demand that a read by processor P will return P’s most recent write,
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 207
1 Snooping-Based Coherence
2 Directory-Based Coherence
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 208
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 209
7The protocol is thus also called write broadcast protocol. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 210
8With write-through caches, memory will be updated immediately. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 211
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 212
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 213
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 214
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 214
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 214
Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 215
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 216
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 217
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 218
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 219
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 220
9The AMD counterpart is “HyperTransport”. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 221
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 222
Miss rate 0% 3% 2% 1% 1 2 4 Processor count FFT 8 16 8% 4% 7% 6% 5% Miss rate 0% 6% 4% 2% 1 2 4 Processor count Ocean 8 16 16% 18% 20% 8% 14% 12% 10% Miss rate 0% 1% 1 2 4 Processor count LU 8 16 2% Miss rate 0% 1 2 4 Processor count Barnes 8 16 1% Coherence miss rate Capacity miss rate
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 223
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 224
Message type Source Destination Message contents Function of this message Read miss Local cache Home directory P, A Node P has a read miss at address A; request data and make P a read sharer. Write miss Local cache Home directory P, A Node P has a write miss at address A; request data and make P the exclusive owner. Invalidate Local cache Home directory A Request to send invalidates to all remote caches that are caching the block at address A. Invalidate Home directory Remote cache A Invalidate a shared copy of data at address A. Fetch Home directory Remote cache A Fetch the block at address A and send it to its home directory; change the state of A in the remote cache to shared. Fetch/invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache. Data value reply Home directory Local cache D Return a data value from the home memory. Data write-back Remote cache Home directory A, D Write-back a data value for address A.
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 225
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 226
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 227
6.6 13.2 19.6 80.7
Intel Nehalem EX; 1.87 GHz; 2 CPUs, 8 cores/CPU.
10In general, this will yield incorrect counter values. c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 228
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 229
Socket 2 Memory Socket 3 Memory Socket 0 Memory Socket 1 Memory 1 2 3 4
11 3
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 230
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 231
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 232
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 233
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 234
(bidirectional)
(bidirectional)
(bidirectional)
(bidirectional)
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 235
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 236
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 237
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 238
IO per ms Query sf10-18 Sat Sep 15 12:30:25 2012 reads writes 13 memory in GB 6.5 9.4 CPU threads milliseconds, parallelism usage 62.1 % 4 6 7 8 9 10 11 12 13 961 1922 2884 3845 4807 5768 6729 7691 8652 9614.24 aggr.sum 20 calls 12.18 sec algebra.leftjoin 74 calls 2.89 sec algebra.join 59 calls 16.49 sec algebra.semijoin 9 calls 2.40 ms algebra.kdifference 83 calls 11.47 ms algebra.kunion 58 calls 30.13 ms algebra.slice 1 calls 0.10 ms algebra.markT 59 calls 7.44 ms algebra.thetauselect 1 calls 1.48 sec algebra.* 31 calls 3.13 sec bat.mirror 35 calls 49.45 ms bat.reverse 81 calls 8.36 ms group.multicolumns 10 calls 2.21 ms group.* 11 calls 5.81 sec language.dataflow 2 calls 9.61 sec mat.pack 8 calls 1.95 sec pqueue.* 2 calls 1.85 ms io.stdout 1 calls 0.07 ms sql.* 127 calls 20.71 ms 672 MAL instructions executed
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 239
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 240
A 16 18 27 5 7 B 8 33 10 5 23 B 8 33 10 5 23 C v x y z u
HT(S) HT(T)
A 16 7 10 27 18 5 7 5 ... ... ... ... ... Z a c i b e j d f ... ... ... ... ...
R
Z a ... ... A 16 ... ... B 8 ... ... C v ... ...
Result
store probe(16) probe(10) probe(8) probe(27) store
Z b ... ... A 27 ... ... B 10 ... ... C y ... ...
morsel morsel
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 241
1 Scan, filter T, build HT(T). 2 Scan, filter S, build HT(S). 3 Scan, filter R, probe into both hash tables.
S R
T
s...s...s... s . . . s . . . s... s... s... s... s...
HT(T) global Hash Table
sel T
se 1: sel- A- ally se 2: A- ea
s e l age a of
age a of
St blue
can
T)
1 and 2 must complete before Pipeline 3 begins.
1 and 2 can run in parallel.
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 242
HT(T) global Hash Table
morsel T
Phase 1: process T morsel-wise and store NUMA-locally Phase 2: scan NUMA-local storage area and insert pointers into HT n e x t m
s e l Storage area of red core Storage area of green core Storage area of blue core
scan scan
Insert the pointer into HT ...(T)
v
...(T)
v
...(T)
v
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 243
Dispatcher Code
dispatch(0) (J1, Mr1) Pipeline-Job J1 on morsel Mr1
Pipeline- Job J1 Pipeline- Job J2
Mr1 Mr2 Mr3 Mg1 Mg2 Mg3 Mb1 Mb2 Mb3 (virtual) lists of morsels to be processed (colors indicates on what socket/core the morsel is located) Lock-free Data Structures of Dispatcher List of pending pipeline-jobs (possibly belonging to different queries)
Core0 Core Core Core Core Core Core Core
DRAM
Core8 Core Core Core Core Core Core Core
DRAM
Core Core Core Core Core Core Core Core
DRAM
Core Core Core Core Core Core Core Core
DRAM Socket Socket
inter connect
Socket Socket Example NUMA Multi-Core Server with 4 Sockets and 32 Cores
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 244
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 245
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 246
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 247
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 249
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 250
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 251
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 252
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 253
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 254
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 255
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 256
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 257
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 258
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 259
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 260
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 261
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 262
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 263
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 264
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 265
20 40 60 80 1 15 31 63 71 95 127 1 15 31 63 71 95 127 1 15 31 63 71 95 127 pthread TP-MCS LC Normalized Throughput Raytrace TM-1 TPC-C
c Jens Teubner · Data Processing on Modern Hardware · Winter 2019/20 266