UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE - - PowerPoint PPT Presentation

understanding processor cache effects with valgrind vtune
SMART_READER_LITE
LIVE PREVIEW

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE - - PowerPoint PPT Presentation

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE Chester Rebeiro Embedded Lab Embedded Lab IIT Kharagpur Is Time Proportional to Iterations? p SIZE = 64MBytes; unsigned int A[SIZE]; Iterations A: I i A f or(i=0;


slide-1
SLIDE 1

UNDERSTANDING PROCESSOR CACHE EFFECTS WITH VALGRIND & VTUNE

Chester Rebeiro Embedded Lab Embedded Lab IIT Kharagpur

slide-2
SLIDE 2

Is Time Proportional to Iterations? p

SIZE = 64MBytes; unsigned int A[SIZE];

I i A

Iterations A:

for(i=0; i<SIZE; i+=1) A[i] *= 3;

Iterations B: Iterations B:

for(i=0; i<SIZE; i+=16) A[i] *= 3;

Is Time(A) / Time(B) = 16 ?

slide-3
SLIDE 3

Is Time Proportional to Iterations? p

Not Really ! Not Really !

We get Time(A)/Time(B) = 3 ! Straight forward pencil-and-paper analysis will not

suffice

A deeper understanding is needed For this we use profiling tools

slide-4
SLIDE 4

Tools for Profiling Software g

Static Program Modification Automatic insertion of code to record performance attributes

at run time.

Example : QPT (Quick program profiling and tracing)

p ( p g p g g) for MIPS and SPARC systems, Gprof, ATOM

Hardware Counters

R

i t f f h d

Requires support from processor for hardware

performance monitoring

VTune (commercial – Intel), oprofile, perfmon Simulators

For simulation of the platform behavior

Valgrind (x86 Simulation), Simplescalar

g ( ), p

slide-5
SLIDE 5

Valgrind g

Opensource : http://valgrind.org

//

Valgrind is an instrumentation framework for building

dynamic analysis tools.

There are tools for There are tools for

Memory checking : to detect memory management

problems such as no uninitilized data, leaky, overlapped memcpy’s etc memcpy s etc.

Cachegrind : is a cache profiler Callgrind : Extends cachegrind and in addition provides

i f ti b t ll h information about callgraphs.

Massif : is a heap profiler Helgrind : is useful in multi-threaded programs.

slide-6
SLIDE 6

Cachegrind g

Pinpoints the sources of cache misses in the code. Pinpoints the sources of cache misses in the code. Can simulate L1, L2, and D1 cache memories On Modern processors : On Modern processors :

L1 cache miss costs around 10 clock cycles L2 cache miss can cost as much as 200 clock cycles L2 cache miss can cost as much as 200 clock cycles.

slide-7
SLIDE 7

Iteration Example Revisited with C h i d Cachegrind

SIZE = 64MBytes; unsigned int A[SIZE];

I i A

Iterations A:

for(i=0; i<SIZE; i+=1) A[i] *= 3;

Iterations B: Iterations B:

for(i=0; i<SIZE; i+=16) A[i] *= 3;

Is the ratio of Time(A) / Time(B) = 16 ?

slide-8
SLIDE 8

Running Cachegrind g g

Console Output :

Tool Output file name Executable

  • No. of instructions
  • No. o s uc o s
  • No. of misses in I1
slide-9
SLIDE 9

Output of Cachegrind (cg1.out) p g ( g )

  • No. of Instructions

N f I i Mi i L1 C h N f D R d Mi i L1

  • No. of Data Writes Missing L2

N f D W i Mi i L1

  • No. of Instructions Missing L1 Cache
  • No. of Instructions Missing L2 Cache
  • No. of Data Reads
  • No. of Data Reads Missing L2
  • No. of Data Reads Missing L2
  • No. of Data Reads Missing L1

All Data Writes

  • No. of Data Writes Missing L1
  • No. of Data Reads missing L1
slide-10
SLIDE 10

cg_annotate g_

slide-11
SLIDE 11

Effects of Cache Line

Unsigned int takes 4 bytes

g y

Data cache line is of 64 bytes So every 16th byte falls in a new cache line and results in a

cache miss

slide-12
SLIDE 12

Direct Mapped Cache pp

Consider a Direct Mapped Cache with

pp

1024 Bytes 32 byte cache line

Number of Cache Lines = 1024/32 = 32 Assume memory address is of 32 bits For e Address = 0 12345678

22 bits 5 Bits 5 Bits

  • ffset

line tag

For ex: Address = 0x12345678

Offset : (11000)2 Line : (10011)2

( )2

slide-13
SLIDE 13

Direct Mapped Cache pp

slide-14
SLIDE 14

Cache Grind Results for Direct Mapped

A[31][0] A[31][0] Thrashing in M Cache Memories A[32][0]

slide-15
SLIDE 15

Set Associative Cache

Consider a Direct Mapped Cache with

1024 Bytes, 32 byte cache line 2 way set-associative

Number of Cache Lines = 1024/32 = 32 (5 bits) Number of sets = 32/2 = 16 (4 bits) Assume memory address is of 32 bits For ex: Address = 0x12345678

23 bits 4 Bits 5 Bits

  • ffset

set tag

For ex: Address = 0x12345678

Offset : (11000)2 Set: (0011)2

slide-16
SLIDE 16

2-way Cache Prevents Thrashing y g

Direct Mapped 2-way set associative

slide-17
SLIDE 17

Traversal for Large Matrices g

ROW MAJOR Miss Rate/Iteration: 8/B COLUMN MAJOR Miss Rate/Iteration: 1

slide-18
SLIDE 18

Matrix Multiplication Example p p

We need to multiply C = A*B We need to multiply C A B

Matrix A is accessed in Row Major Matrix A is accessed in Row Major Matrix B is accessed in Column Major

slide-19
SLIDE 19

Analysis of Matrix Multiplication y p

Huge miss rate because B is accessed in column

major fashion major fashion.

So, each access to B results in a cache miss.

A l i i fi d B h l

A solution, is to find B transpose, then only row

major traversal is required.

slide-20
SLIDE 20

Matrix Multiplication (Naïve Transpose) p ( p )

R d ti i b f i b f t f l t 98% Reduction in number of misses by a factor of almost 98%

slide-21
SLIDE 21

A Better Transpose

21

Cache Memory

p

y A: Partition the Matrix into Tiles Ar,s Tile - Each sub-matrix Ar,s is known as tile. As,r A

slide-22
SLIDE 22

A Better Transpose (load)

Cache Memory

p ( )

y A As,r Ar,s A: Ar,s As,r A

slide-23
SLIDE 23

A Better Transpose (transpose)

23

p ( p )

Cache Memory A (As,r)T (As,r)T y A:

slide-24
SLIDE 24

A Better Transpose (transfer)

24

p ( )

Cache Memory A (As,r)T (As,r)T y A: (As,r)T (As,r)T (A )

slide-25
SLIDE 25

Cache Oblivious Algorithms g

An algorithm designed to take advantage of a CPU

g g g cache without explicit knowledge the cache parameters.

New branch of algorithm design.

O C f

Optimal Cache-oblivious algorithms are known for the

Cooley-Tukey FFT algorithm Matrix Multiplication Matrix Multiplication Sorting Matrix Transposition

slide-26
SLIDE 26

Summary for Cachegrind y g

Easy to use tool to analyze cache memory behavior Easy to use tool to analyze cache memory behavior

for various configurations

Slow, around 20x to 100x slower than normal. Slow, around 20x to 100x slower than normal. What you simulate is not what you may get ! What is needed is a way to analyze software at What is needed is a way to analyze software at

run-time

slide-27
SLIDE 27

Related vs Unrelated Memory Accesses y

Related Data Accesses Unrelated Data Accesses

Time(Related Data Access) = Five x Time(Unrelated Data Accesses) Five x Time(Unrelated Data Accesses)

slide-28
SLIDE 28

Vtune

Vtune is an tool for real-time performance analysis

p y

  • f software.

Unlike Valgrind has less overhead. Uses MSRs : Model Specific Performance-Monitoring

Counters

Model Specific because MSRs for one processor may

not be compatible with another

There are two banks of registers : There are two banks of registers :

IA32_PERFEVTSELx : Performance event select MSRs IA32 PMCx : Performance monitoring event counters

3 _ MC g

slide-29
SLIDE 29

References

  • Valgrind website : http://valgrind.org/
  • Intel, Vtune : http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
  • Igor Ostrovosky, Gallary of Cache Effects : http://igoro.com/archive/gallery-of-

processor-cache-effects/

  • Siddhartha Chatterjee and Sandeep Sen , Cache Friendly Matrix Transposition
slide-30
SLIDE 30

Th k Y Thank You