CPU design e ff ects that can degrade performance of your programs - PowerPoint PPT Presentation

Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken

Simple branch predictor - unsorted array if (data[i] < 6) { ... } � � � � � � � � Prediction: Not taken � hits, � misses ( �� % hit rate)

Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � Prediction: Not taken Prediction: Not taken

Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Taken

Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken Prediction: Not taken

Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � � < � ? Prediction: Not taken

Simple branch predictor - sorted array if (data[i] < 6) { ... } � � � � � � � � Prediction: Not taken � hits, � misses ( �� % hit rate)

How can the compiler help? With float , there are two branches per iteration

How can the compiler help? With int , one branch is removed (using cmov )

How to measure? branch-misses How many times was a branch mispredicted?

How to measure? branch-misses How many times was a branch mispredicted? $ perf stat -e branch-misses ./example0a with sort -> 383 902 without sort -> 101 652 009

How to help the branch predictor? •More predictable data

How to help the branch predictor? •More predictable data •Pro fi le-guided optimization

How to help the branch predictor? •More predictable data •Pro fi le-guided optimization •Remove (unpredictable) branches

How to help the branch predictor? •More predictable data •Pro fi le-guided optimization •Remove (unpredictable) branches •Compiler hints (use with caution) if (__builtin_expect(will_it_blend(), 0)) { // this branch is not likely to be taken }

Branch target prediction •Target of a jump is not known at compile time:

Branch target prediction •Target of a jump is not known at compile time: •Function pointer

Branch target prediction •Target of a jump is not known at compile time: •Function pointer •Function return address

Branch target prediction •Target of a jump is not known at compile time: •Function pointer •Function return address •Virtual method

Code (backup) struct A { virtual void handle(size_t* data) const = 0; }; struct B: public A { void handle(size_t* data) const final { *data += 1; } }; struct C: public A { void handle(size_t* data) const final { *data += 2; } }; std::vector<std::unique_ptr<A>> data = /* 4K random B/C instances */ ; // std::sort(data.begin(), data.end(), /* sort by instance type */); size_t sum = 0; for (auto& x : data) { x->handle(&sum); }

Result (backup)

perf (backup) $ perf stat -e branch-misses ./example0b with sort -> 337 274 without sort -> 84 183 161

Code (backup) // Addresses of N integers, each `offset` bytes apart std::vector<int*> data = ...; for (auto ptr: data) { *ptr += 1; } // Offsets: 4, 64, 4000, 4096, 4128

Result (backup)

Cache memory

CPU design e ff ects that can degrade performance of your programs - PowerPoint PPT Presentation

CPU design e ff ects that can degrade performance of your programs Jakub Bernek jakub.beranek@vsb.cz whoami PhD student @ VSB-TUO, Ostrava, Czech Republic Research assistant @ IT Innovations (HPC center) HPC, distributed

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

World Bank Transport Proj ects World Bank Transport Proj ects in TRACECA Countries TRACECA

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Cleaning Up the Mess: Using Chemistry to Cleaning Up the Mess: Using Chemistry to Degrade

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

What is Administrative S ervices? Proj ects and More Proj ects From There to Here

Ef f ects of rearing conditions Ef f ects of rearing conditions on low- - temperature

Measuring Performance November 17, 2008 Measuring Performance Introduction CPU Peformance and

CPU Scheduling Eric McCreath Introduction CPU scheduling is at the heart of a multiprogrammed

CPU Scheduling Mehdi Kargahi School of ECE University of Tehran Spring 2008 CPU and I/O Bursts

Arrays 1 An array is a collection of values of the same type stored consecutively in memory.

Priority Queues Given x and y , is x less than, equal to, or greater than y Meaning of the

Stefan Heule, Manu Sridharan, Satish Chandra Stanford University, Samsung Research America

Arrows and Reagents KC Sivaramakrishnan Advanced Functional Programming March 3rd, 2016

Moses Vaughan - mjv2123@columbia.edu Binh Vo - bdv2112@columbia.edu Ian Vo - idv2101@columbia.edu

H ETERO R EFACTOR : Refactoring for Heterogeneous Computing with FPGA Jason Lau*, Aishwarya

Harvesting Runtime Values in Android Applications That Feature Anti-Analysis Techniques

Lecture 25 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture25/.

Sambuz

Useful Links

Newsletter

Mail Us

CPU design e ff ects that can degrade performance of your programs - PowerPoint PPT Presentation

CPU design e ff ects that can degrade performance of your programs Jakub Bernek jakub.beranek@vsb.cz whoami PhD student @ VSB-TUO, Ostrava, Czech Republic Research assistant @ IT Innovations (HPC center) HPC, distributed

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

World Bank Transport Proj ects World Bank Transport Proj ects in TRACECA Countries TRACECA

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Cleaning Up the Mess: Using Chemistry to Cleaning Up the Mess: Using Chemistry to Degrade

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

High Performance Hardware, High Performance Hardware, Memory &amp; CPU Memory &amp; CPU Step

What is Administrative S ervices? Proj ects and More Proj ects From There to Here

Ef f ects of rearing conditions Ef f ects of rearing conditions on low- - temperature

Measuring Performance November 17, 2008 Measuring Performance Introduction CPU Peformance and

CPU Scheduling Eric McCreath Introduction CPU scheduling is at the heart of a multiprogrammed

CPU Scheduling Mehdi Kargahi School of ECE University of Tehran Spring 2008 CPU and I/O Bursts

Arrays 1 An array is a collection of values of the same type stored consecutively in memory.

Priority Queues Given x and y , is x less than, equal to, or greater than y Meaning of the

Stefan Heule, Manu Sridharan, Satish Chandra Stanford University, Samsung Research America

Arrows and Reagents KC Sivaramakrishnan Advanced Functional Programming March 3rd, 2016

Moses Vaughan - mjv2123@columbia.edu Binh Vo - bdv2112@columbia.edu Ian Vo - idv2101@columbia.edu

H ETERO R EFACTOR : Refactoring for Heterogeneous Computing with FPGA Jason Lau*, Aishwarya

Harvesting Runtime Values in Android Applications That Feature Anti-Analysis Techniques

Lecture 25 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture25/*.*

Sambuz

Useful Links

Newsletter

Mail Us

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Lecture 25 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture25/.