Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 1

Part III Instruction Execution � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 90

Pipelining in CPUs Pipelining is a CPU implementation technique whereby multiple instructions are overlapped in execution . Break CPU instructions into smaller units and pipeline. E.g. , classical five-stage pipeline for RISC: 0 1 2 3 4 5 clock instr. i IF ID EX MEM WB instr. i + 1 IF ID EX MEM WB instr. i + 2 IF ID EX MEM WB parallel execution � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 91

Pipelining in CPUs Ideally, a k -stage pipeline improves performance by a factor of k . Slowest (sub-)instruction determines clock frequency. Ideally, break instructions into k equi-length parts. Issue one instruction per clock cycle (IPC = 1). Example: Intel Pentium 4: 31+ pipeline stages. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 92

Hazards The effectiveness of pipelining is hindered by hazards . Structural Hazard Different pipeline stages need same functional unit (resource conflict; e.g. , memory access ↔ instruction fetch) Data Hazard Result of one instruction not ready before access by later instruction. Control Hazard Arises from branches or other instructions that modify PC (“data hazard on PC register”). Hazards lead to pipeline stalls that decrease IPC. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 93

Structural Hazards A structural hazard will occur if a CPU has only one memory access unit and instruction fetch and memory access are scheduled in the same cycle. 0 1 2 3 4 5 clock IF ID EX MEM WB instr. i instr. i + 1 IF ID EX MEM WB instr. i + 2 IF ID EX MEM WB instr. i + 3 instr. i + 3 stall IF ID IF EX ID MEM EX MEM WB Resolution: Provision hardware accordingly ( e.g. , separate fetch units) Schedule instructions (at compile- or runtime) � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 94

Structural Hazards Structural hazards can also occur because functional units are not fully pipelined . E.g. , a (complex) floating point unit might not accept new data on every clock cycle. Often a space/cost ↔ performance trade-off. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 95

Data Hazards LD R1, 0(R2) Instructions read R1 before it was DSUB R4, R1, R5 written by DADD (stage WB writes AND R6, R1, R7 register results). OR R8, R1, R9 Would cause incorrect execution result. XOR R10, R1, R11 0 1 2 3 4 5 clock IF ID EX MEM WB LD R1,0(R2) IF ID EX MEM WB DSUB R4,R1,R5 IF ID EX MEM WB AND R6,R1,R7 OR R8,R1,R9 IF ID EX MEM WB XOR R10,R1,R11 IF ID EX MEM � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 96

Data Hazards Resolution: Forward result data from instruction to instruction. Could resolve hazard LD ↔ AND on previous slide (forward R1 between cycles 3 and 4). Cannot resolve hazard LD ↔ DSUB on previous slide. Schedule instructions (at compile- or runtime). Cannot avoid all data hazards. Detecting data hazards can be hard, e.g. , if they go through memory. SD R1, 0(R2) LD R3, 0(R4) � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 97

Tight loops are a good candidate to improve instruction scheduling. for (i = 1000; i > o; i = i + 1) x[i] = x[i] + s; l: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) l: L.D F0, 0(R1) ADD.D F4, F0, F2 l: L.D F0, 0(R1) DADDUI R1, R1, #-8 ADD.D F8, F6, F2 ADD.D F4, F0, F2 ADD.D F4, F0, F2 ADD.D F12, F10, F2 S.D F4, 0(R1) stall ADD.D F16, F14, F2 DADDUI R1, R1, #-8 stall S.D F4, 0(R1) BNE R1, R2, l S.D F4, 0(R1) S.D F8, -8(R1) BNE R1, R2, l DADDUI R1, R1, #-32 S.D F12, 16(R1) S.D F16, 8(R1) BNE R1, R2, l na ¨ ıve code re-schedule loop unrolling source: Hennessy & Patterson, Chapter 2 � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 98

Control Hazards Control hazards are often more severe than are data hazards. Most simple implementation: flush pipeline , redo instr. fetch 0 1 2 3 4 5 clock branch instr. i IF ID EX MEM WB instr. i + 1 IF idle idle idle idle target instr. IF ID EX MEM WB target instr. + 1 IF ID EX MEM WB With increasing pipeline depths, the penalty gets worse . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 99

Control Hazards A simple optimization is to only flush if the branch was taken . Penalty only occurs for taken branches. If the two outcomes have different (known) likeliness: Generate code such that a non-taken branch is more likely. Aborting a running instruction is harder when the branch outcome is known late. → Should not change exception behavior . This scheme is called predicted-untaken . → Likewise: predicted-taken (but often less effective) � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 100

Branch Prediction Modern CPUs try to predict the target of a branch and execute the target code speculatively . Prediction must happen early (ID stage too late). Thus: Branch Target Buffers (BTBs) Lookup Table: PC → � predicted target, taken? � . Lookup PC Predicted PC Taken? . . . . . . . . . Consult Branch Target Buffer parallel to instruction fetch . If entry for current PC can be found: follow prediction. If not, create entry after branching. Inner workings of modern branch predictors are highly involved (and typically kept secret). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 101

Selection Conditions Selection queries are sensitive to branch prediction: SELECT COUNT(*) FROM lineitem WHERE quantity < n Or, written as C code: for (unsigned int i = 0; i < num_tuples; i++) if (lineitem[i].quantity < n ) count++; � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 102

Selection Conditions (Intel Q6700) 900 750 execution time [a.u.] 600 450 300 150 0 0 % 20 % 40 % 60 % 80 % 100 % predicate selectivity � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 103

Predication Predication: Turn control flow into data flow . for (unsigned int i = 0; i < num_tuples; i++) count += (lineitem[i].quantity < n ); This code does not use a branch any more. 3 The price we pay is a + operation for every iteration. Execution cost should now be independent of predicate selectivity. 3 except to implement the loop � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 104

Predication (Intel Q6700) 900 750 execution time [a.u.] 600 450 300 150 0 0 % 20 % 40 % 60 % 80 % 100 % predicate selectivity � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 105

Predication This was an example of software predication . ✛ How about this query? SELECT quantity FROM lineitem WHERE quantity < n Some CPUs also support hardware predication . E.g. , Intel Itanium2: Execute both branches of an if-then-else and discard one result. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 106

Experiments (AMD AthlonMP / Intel Itanium2) int sel_lt_int_col_int_val(int n, int* res, int* in, int V) { for(int i=0,j=0; i<n; i++){ Itanium2 branch 100 Itanium2 predicated /* branch version */ msec. AthlonMP branch if (src[i] < V) AthlonMP predicated 80 out[j++] = i; /* predicated version */ 60 bool b = (src[i] < V); out[j] = i; 40 j += b; } 20 return j; 20 40 60 80 100 } 0 query selectivity ր Boncz, Zukowski, Nes. MonetDB/X100: Hyper-Pipelining Query Execution. CIDR 2005 . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 107

Two Cursors The count += . . . still causes a data hazard . This limits the CPUs possibilities to execute instructions in parallel. Some tasks can be rewritten to use two cursors : for (unsigned int i = 0; i < num_tuples / 2; i++) { count1 += (data[i] < n ); count2 += (data[i + num_tuples / 2] < n ); } count = count1 + count2; � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 108

Experiments (Intel Q6700) 900 750 execution time [a.u.] 600 450 300 150 0 0 % 20 % 40 % 60 % 80 % 100 % predicate selectivity � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 109

Conjunctive Predicates In general, we have to handle multiple predicates: SELECT A 1 , . . . , A n FROM R WHERE p 1 AND p 2 AND . . . AND p k The standard C implementation uses && for the conjunction: for (unsigned int i = 0; i < num_tuples; i++) if ( p 1 && p 2 && . . . && p k ) ...; � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 110

Conjunctive Predicates The && introduce even more branches. The use of && is equivalent to: for (unsigned int i = 0; i < num_tuples; i++) if ( p 1 ) if ( p 2 ) . . . if ( p k ) ...; An alternative is the use of the logical & : for (unsigned int i = 0; i < num_tuples; i++) if ( p 1 & p 2 & . . . & p k ) ...; � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 111

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 Jens Teubner Data Processing on Modern Hardware Summer 2014 c 1 Part III Instruction Execution Jens Teubner

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2.

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

5/24/10 Modern Hardware is Complex Modern systems built on layers of hardware Tamper Evident

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

Software Quality It doesnt happen by accident. Dimensions of Software Quality Software

x A n toine Craske 1 TestCon Europe 2020 2 TestCon Europe 2020 La Redoute ? TestCon Europe

ARCHI TECTURAL CHOI CES FOR DEPENDABLE SYSTEMS Nicole Levy Laboratoire PRISM Universit de

Game Design Document (GDD) Rongkai Guo Why Documentation? The purpose of design documentation

Resource Allocation for Hardware Implementations of Map Richard Townsend Martha A. Kim Stephen

Testing: Our Experiences Test Case Sof tware Testing Software to be tested Output When to

ASIC accelerators 1 Part 2 serial codes out Part 1 due tomorrow, 11:59PM Homework 3

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material