UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE - PowerPoint PPT Presentation

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE Chester Rebeiro Embedded Lab Embedded Lab IIT Kharagpur

Is Time Proportional to Iterations? p � SIZE = 64MBytes; � unsigned int A[SIZE]; � Iterations A: I i A f or(i=0; i<SIZE; i+=1 ) A[i] *= 3; � Iterations B: � Iterations B: f or(i=0; i<SIZE; i+=16 ) A[i] *= 3; � Is Time(A) / Time(B) = 16 ?

Is Time Proportional to Iterations? p � Not Really ! � Not Really ! � We get Time(A)/Time(B) = 3 ! � Straight forward pencil-and-paper analysis will not suffice � A deeper understanding is needed � For this we use profiling tools

Tools for Profiling Software g � Static Program Modification � Automatic insertion of code to record performance attributes at run time. � Example : QPT (Quick program profiling and tracing) p ( p g p g g) for MIPS and SPARC systems, Gprof, ATOM � Hardware Counters � Requires support from processor for hardware � R i t f f h d performance monitoring � VTune (commercial – Intel) , oprofile, perfmon � Simulators � For simulation of the platform behavior � Valgrind (x86 Simulation), Simplescalar g ( ), p

Valgrind g � Opensource : http://valgrind.org // � Valgrind is an instrumentation framework for building dynamic analysis tools. � There are tools for � There are tools for � Memory checking : to detect memory management problems such as no uninitilized data, leaky, overlapped memcpy s etc. memcpy’s etc � Cachegrind : is a cache profiler � Callgrind : Extends cachegrind and in addition provides i f information about callgraphs . ti b t ll h � Massif : is a heap profiler � Helgrind : is useful in multi-threaded programs.

Cachegrind g � Pinpoints the sources of cache misses in the code. � Pinpoints the sources of cache misses in the code. � Can simulate L1, L2, and D1 cache memories � On Modern processors : � On Modern processors : � L1 cache miss costs around 10 clock cycles � L2 cache miss can cost as much as 200 clock cycles � L2 cache miss can cost as much as 200 clock cycles.

Iteration Example Revisited with C Cachegrind h i d � SIZE = 64MBytes; � unsigned int A[SIZE]; � Iterations A: I i A f or(i=0; i<SIZE; i+=1 ) A[i] *= 3; � Iterations B: � Iterations B: f or(i=0; i<SIZE; i+=16 ) A[i] *= 3; � Is the ratio of Time(A) / Time(B) = 16 ?

Running Cachegrind g g Tool Executable Output file name Console Output : No. of instructions No. o s uc o s No. of misses in I1

Output of Cachegrind (cg1.out) p g ( g ) No. of Instructions No. of Data Writes Missing L2 No. of Instructions Missing L1 Cache N f I i Mi i L1 C h N f D No. of Data Writes Missing L1 No. of Data Reads Missing L1 N f D W i R d Mi i Mi i L1 L1 No. of Instructions Missing L2 Cache No. of Data Reads Missing L2 All Data Writes No. of Data Reads No. of Data Reads Missing L2 No. of Data Reads missing L1

cg_annotate g_

Effects of Cache Line � Unsigned int takes 4 bytes g y � Data cache line is of 64 bytes � So every 16 th byte falls in a new cache line and results in a cache miss

Direct Mapped Cache pp � Consider a Direct Mapped Cache with pp � 1024 Bytes � 32 byte cache line � Number of Cache Lines = 1024/32 = 32 � Assume memory address is of 32 bits 22 bits 5 Bits 5 Bits tag line offset � For e Address = 0 12345678 � For ex: Address = 0x12345678 � Offset : (11000) 2 � Line : (10011) 2 ( ) 2

Direct Mapped Cache pp

Cache Grind Results for Direct Mapped A[31][0] A[31][0] Thrashing in Cache Memories M A[32][0]

Set Associative Cache � Consider a Direct Mapped Cache with � 1024 Bytes, 32 byte cache line � 2 way set-associative � Number of Cache Lines = 1024/32 = 32 (5 bits) � Number of sets = 32/2 = 16 (4 bits) � Assume memory address is of 32 bits 23 bits 4 Bits 5 Bits tag set offset � For ex: Address = 0x12345678 � For ex: Address = 0x12345678 � Offset : (11000) 2 � Set: (0011) 2

2-way Cache Prevents Thrashing y g Direct Mapped 2-way set associative

Traversal for Large Matrices g � ROW MAJOR � COLUMN MAJOR � Miss Rate/Iteration: 8/B � Miss Rate/Iteration: 1

Matrix Multiplication Example p p � We need to multiply C = A*B � We need to multiply C A B Matrix A is accessed in Row Major Matrix A is accessed in Row Major Matrix B is accessed in Column Major

Analysis of Matrix Multiplication y p � Huge miss rate because B is accessed in column major fashion major fashion. � So, each access to B results in a cache miss. � A solution, is to find B transpose, then only row A l i i fi d B h l major traversal is required.

Matrix Multiplication (Naïve Transpose) p ( p ) R d Reduction in number of misses by a factor of almost 98% ti i b f i b f t f l t 98%

A Better Transpose p 21 Cache Memory y A: Partition the Matrix into Tiles Tile - Each sub-matrix A r,s is known as A r,s tile. A s,r A

A Better Transpose (load) p ( ) Cache Memory y A r,s A s,r A A: A r,s A s,r A

A Better Transpose (transpose) p ( p ) 23 Cache Memory y (A s,r ) T (A s,r ) T A A:

A Better Transpose (transfer) p ( ) 24 Cache Memory y (A s,r ) T (A s,r ) T A A: (A s,r ) T (A s,r ) T (A )

Cache Oblivious Algorithms g � An algorithm designed to take advantage of a CPU g g g cache without explicit knowledge the cache parameters. � New branch of algorithm design. � Optimal Cache-oblivious algorithms are known for the O C f � Cooley-Tukey FFT algorithm � Matrix Multiplication � Matrix Multiplication � Sorting � Matrix Transposition

Summary for Cachegrind y g � Easy to use tool to analyze cache memory behavior � Easy to use tool to analyze cache memory behavior for various configurations � Slow, around 20x to 100x slower than normal. � Slow, around 20x to 100x slower than normal. � What you simulate is not what you may get ! � What is needed is a way to analyze software at � What is needed is a way to analyze software at run-time

Related vs Unrelated Memory Accesses y Unrelated Data Accesses Related Data Accesses Time(Related Data Access) = Five x Time(Unrelated Data Accesses) Five x Time(Unrelated Data Accesses)

Vtune � Vtune is an tool for real-time performance analysis p y of software. � Unlike Valgrind has less overhead. � Uses MSRs : Model Specific Performance-Monitoring Counters � Model Specific because MSRs for one processor may not be compatible with another � There are two banks of registers : � There are two banks of registers : � IA32_PERFEVTSELx : Performance event select MSRs � IA32 PMCx : Performance monitoring event counters 3 _ MC g

References Valgrind website : http:// valgrind .org/ � Intel, Vtune : http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/ � Igor Ostrovosky, Gallary of Cache Effects : http://igoro.com/archive/gallery-of- � processor-cache-effects/ Siddhartha Chatterjee and Sandeep Sen , Cache Friendly Matrix Transposition �

Th Thank You k Y

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE - PowerPoint PPT Presentation

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE Chester Rebeiro Embedded Lab Embedded Lab IIT Kharagpur Is Time Proportional to Iterations? p SIZE = 64MBytes; unsigned int A[SIZE]; Iterations A: I i A f or(i=0;

Tuning Valgrind for your Workload Hints, tricks and tips to effectively use Valgrind on small or

Running Valgrind on multiple processors: a prototype Philippe Waroquiers FOSDEM 2015 valgrind

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

valgrind code analyzer Valgrind is another injection-based profiler/analyzer Can be used to

VEX: Where next for Valgrind's dynamic VEX: Where next for Valgrind's dynamic instrumentation

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Valgrind register allocator overhaul Ivo Raisr FOSDEM 2018 Ivo Raisr 39.6 GNU Toolchain

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Accountability Andrew Poelstra Director of Research, Blockstream 4 February 2019 1 / 23

EE-452 13 - 1 Czochralski (CZ) crystal growing Si is purified from SiO2 (sand) by refining,

Austmine Presentation An introduction to BLAST MOVEMENT and the BMM SYSTEM OPEN PIT MINING

Week 6: Clustered Data and Panels Robust Standard Errors, Fixed and Random Effects Max H. Farrell

for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University Jan 9th, 2019 About Me

ABC OF BREWING COFFEE alexandru totolici 1 2 NOTSOGOOD COFFEE (were all

CNT 5410 - Computer and Network Security: Denial of Service Professor Kevin Butler Fall 2015

Webinar 6: Quality Measurement and Data Collection Special Issues Part 1 of 2 Presented by