E E Energy Efficiency in Energy Efficiency in Effi i Effi i i - - PowerPoint PPT Presentation

e e energy efficiency in energy efficiency in effi i effi
SMART_READER_LITE
LIVE PREVIEW

E E Energy Efficiency in Energy Efficiency in Effi i Effi i i - - PowerPoint PPT Presentation

E E Energy Efficiency in Energy Efficiency in Effi i Effi i i i Graphics Rendering Graphics Rendering Graphics Rendering Graphics Rendering Preeti Ranjan Panda Department of Computer Science and Engineering Indian Institute of


slide-1
SLIDE 1

E Effi i i E Effi i i Energy Efficiency in Energy Efficiency in Graphics Rendering Graphics Rendering Graphics Rendering Graphics Rendering

Preeti Ranjan Panda Department of Computer Science and Engineering Indian Institute of Technology Delhi Indian Institute of Technology Delhi

Presentation at TU Dortmund, June 2011 J

slide-2
SLIDE 2

Graphics Power Consumption Graphics Power Consumption p p p p

Desktop computer Mobile computer CPU Cooling Fan 4% R VR Power Supply Loss CPU 7% Chipset 13% Power Supply 4% Rest 13% Other 7% VR 1% Loss 22% HDD/ DVD Loss 7% HDD/ DVD 4% 7% 14.1' LCD 33% 9% Monitor 56% CPU 4% 4% Graphics Graphics 14%

B.

  • V. N. Silpa and P. R. Panda, 2011

[ Ref : PC Energy-EfficiencyTrends and Technology, source: intel.com] 6%

slide-3
SLIDE 3

Observation Observation Observation Observation

GPU/Graphics rendering power is GPU/Graphics rendering power is

significant (greater than CPU)

Yet, very little research on GPU energy

efficiency! y

  • GPU performance was/is primary
  • Proprietary GPU architectures
  • Proprietary GPU architectures

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-4
SLIDE 4

Graphics Pipeline Graphics Pipeline Graphics Pipeline Graphics Pipeline

T exture From CPU

Display

Command Setup and Fragment Image T exture Vertex

Display

Command

processor

Clipping Setup and Rasterize Fragment Processor Image Composition Vertex processor Receives Transform Delete Generate Pixel Blend with Receives vertices and commands from CPU Transform vertices to screen d Delete unseen part of Generate pixels Pixel Coloring and Z Blend with Frame Buffer from CPU space and Light scene Z-test

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-5
SLIDE 5

Adding Energy Efficiency Adding Energy Efficiency Adding Energy Efficiency Adding Energy Efficiency

IIT Delhi – Intel T exture From CPU

Component Level: TEXTURE MAPPING

Collaboration

Display

Command Setup and Fragment Image T exture Vertex

Display

Command

processor

Clipping Setup and Rasterize Fragment Processor Image Composition Vertex processor Receives Transform Delete Generate Pixel Blend with Receives vertices and commands from CPU Transform vertices to screen d Delete unseen part of Generate pixels Pixel Coloring and Z Blend with Frame Buffer from CPU space and Light scene Z-test

System Level: DVFS

B.

  • V. N. Silpa and P. R. Panda, 2011

System Level: DVFS

slide-6
SLIDE 6

LOW POWER LOW POWER TEXTURE MAPPING TEXTURE MAPPING TEXTURE MAPPING TEXTURE MAPPING

[ICCAD’08] [ICCAD’08] [ ] [ ]

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-7
SLIDE 7

Power Profile of the Pipeline Power Profile of the Pipeline Power Profile of the Pipeline Power Profile of the Pipeline

80% 100% gy 40% 60% 80% ized energ 0% 20% 40% Normali City Fire Teapot Tunnel Benchmark Transform and lighting Setup and rasterize Texture Memory Transform and lighting Setup and rasterize Texture Memory Fragment processing Frame buffer write

T exture memory consumes 30-40% of total power.

B.

  • V. N. Silpa and P. R. Panda, 2011

T exture memory consumes 30 40% of total power.

slide-8
SLIDE 8

Texture Mapping Texture Mapping Texture Mapping Texture Mapping

Add detail and surface texture to an object. Reduces the modeling effort for the programmer.

g p g

Object T exture T exture Mapped j pp Object

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-9
SLIDE 9

Texture Filtering Texture Filtering

Texture space and object space could be at arbitrary

angles to each other g

Nearest neighbor Nearest neighbor Bilinear interpolation :

weighted average of four

B C

we g te ave age o ou texels nearest to the pixel center.

C A B

B.

  • V. N. Silpa and P. R. Panda, 2011

B

slide-10
SLIDE 10

Texture Access Pattern Texture Access Pattern

Texture mapping exhibits high spatial and temporal

locality

Bilinear filtering requires 4

Pixel center (tx,ty) (tx+1,ty)

Bilinear filtering requires 4 neighbouring texels

Neighbouring pixels map to

ll l l l

center (tx+1,ty+1) (tx,ty+1)

spatially local texels

Repetitive textures

A C B

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-11
SLIDE 11

Blocking and Texture Cache Blocking and Texture Cache

Blocked Representation

T l d 4 4 bl k

  • T

exels stored as 4x4 blocks

  • Reduces dependency on texture orientation, and exploits

spatial locality p y

Texture memory accessed through a Cache

hierarchy (“TEXTURE CACHE”) y ( )

Familiar architectural space BUT, application knowledge could help improve the

U , app cat o

  • w e ge cou e p

p ove t e HW over a “standard cache”

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-12
SLIDE 12

Predictability in Texture Accesses Predictability in Texture Accesses Predictability in Texture Accesses Predictability in Texture Accesses

Access to first texel gives

information about access to the next 3 texels

Pixel (tx,ty) (tx+1,ty) (bx,by) (bx1,by)

next 3 texels

The four texels could be mapped

to either one, two or four i hb i bl k

Pixel center (tx+1,ty+1) (tx,ty+1) (bx1 by1) (bx by1)

neighbouring blocks.

(bx1,by1) (bx,by1) Case 4 Case 1 Case 2 Case 3

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-13
SLIDE 13

Low Power Texture Memory Architecture Low Power Texture Memory Architecture Low Power Texture Memory Architecture Low Power Texture Memory Architecture

Lower power memory architecture than Lower power memory architecture than

Cache for texturing

U f i t t filt t

  • Use a few registers to filter accesses to

blocks expected to be reused Access stream has redictabilit c ntr lled

  • Access stream has predictability - controlled

access mechanism reduces tag lookups

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-14
SLIDE 14

How many blocks to buffer? How many blocks to buffer? How many blocks to buffer? How many blocks to buffer?

Need to buffer up to 4 blocks

Buffer

A buffer is a set of 4x4 registers, each 32 bit

Texture Buffer Array Texture Buffer Array

T

exture Buffer Array is a group of 4 such buffers

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-15
SLIDE 15

Texture Lookup Texture Lookup Texture Lookup Texture Lookup

Case 1:

  • Lookup (block0)
  • Lookup (block0)
  • Get the 4 texels from the block using offsets
  • SAVING: 3 LOOKUPS
  • Cases 2 & 3:
  • Lookup (block 0)

p ( )

  • Get texel0 and texel1 from this block
  • Lookup (block 2)
  • Get texel2 and texel3 from this block
  • SAVING: 2 LOOKUPS

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-16
SLIDE 16

Contd Contd Contd.. Contd..

Case 4:

  • Lookup all 4 blocks and get the texels
  • Lookup all 4 blocks and get the texels

from the respective blocks using offsets

Power Savings from: Reduced Tag lookups Reduced Tag lookups Smaller buffer than cache

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-17
SLIDE 17

Distribution of access among various cases Distribution of access among various cases Distribution of access among various cases Distribution of access among various cases

Distribution of accesses among the four cases

40% 50% 60% 20% 30% 40% 0% 10% case 1 case 2 case 3 case 4

Number of comparisons per access is 1.38 instead of 4

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-18
SLIDE 18

Architecture of Architecture of Texture exture Filter ilter Memory emory Architecture of Architecture of Texture exture Filter ilter Memory emory

From L1 Cache

Controller

Load Hit Enable Cur Level Bank Sel R/W

Bank-I

512-bit

T exel Fetch Unit Add C II Addr Comp-I

Block Addr index R/W ADDR

Bank I

(256 bytes)

Bank-II

4 4

Addr Comp-II

(256 bytes) 32-bit = Load Hit

TBA

Offset

T

  • Filter

EN

REG

= = inde

NCODER

REG REG

= ex Block Addr B.

  • V. N. Silpa and P. R. Panda, 2011

REG

=

Addr Comp

slide-19
SLIDE 19

Hit Rate into TFM Hit Rate into TFM Hit Rate into TFM Hit Rate into TFM

Hit Rate

100% 60% 80% 100% 0% 20% 40% 0% Fire Teapot Tunnel Gloss Gearbox Sphere 16KB 2-way assoc 512B direct mapped 512B fully asscoc TFM

TFM gives 4.5% better hit rate than a direct mapped filter of the same size

B.

  • V. N. Silpa and P. R. Panda, 2011

g ves .5% bette t ate t a a ect appe te o t e sa e s e

slide-20
SLIDE 20

Energy per Access Energy per Access Energy per Access Energy per Access

Energy per Access 0.1 0.06 0.08 rgy(nJ) 0.02 0.04 Ener Fire Teapot Tunnel Gloss Gearbox Sphere 16KB 2-way assoc L1 512B direct mapped L1 512B direct mapped filter 512B Full assoc filter 512B direct mapped filter 512B Full assoc filter TFM

TFM consumes 75% lesser energy than the conventional T exture cache

B.

  • V. N. Silpa and P. R. Panda, 2011

co su es 75% esse e e gy t a t e co ve t o a e tu e cac e

slide-21
SLIDE 21

Texture Filter Memory Summary Texture Filter Memory Summary Texture Filter Memory Summary Texture Filter Memory Summary

In addition to high spatial locality texture In addition to high spatial locality, texture

mapping access pattern also has predictability p y

Replaced high energy cache lookups with

low energy register buffer reads gy g

TFM consumes ~75% lesser energy than

conventional texture mapping system pp g y

Overheads:

  • TFM access 4x faster than cache access

TFM access 4x faster than cache access

  • 0.48% area overhead over texture cache

subsystem y

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-22
SLIDE 22

DYNAMIC VOLTAGE AND DYNAMIC VOLTAGE AND FREQUENCY SCALING FREQUENCY SCALING FREQUENCY SCALING FREQUENCY SCALING (DVFS) (DVFS) ( ) ( )

[CODES+ISSS’10] [CODES+ISSS’10]

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-23
SLIDE 23

Tiled Graphics Rendering Tiled Graphics Rendering Tiled Graphics Rendering Tiled Graphics Rendering

Tile 1 T l 2 Tile 1 Tile 2 A B C Tile 3 Tile 4 A,B C C Bin 1 Bin 2 Bin 3 Bin 4 C Vertex Primitive Setup & Pixel Raster Vertex Shader Primitive Assembly Clipping Setup & Rasterize Pixel Shading Raster Operations

Geometry Pipeline Pixel Pipeline

Geometry Processing Tiling Pixel Processing

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-24
SLIDE 24

Workload of games Workload of games Workload of games Workload of games

Diff h i ifi b d l Different games have significant but gradual workload variation within a game

2.E+06

8 9 10

6.E+05 8.E+05 1.E+06 1.E+06 1.E+06

Cycles 4 5 6 7 M Cycles

0.E+00 2.E+05 4.E+05 6.E 05

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 C 1 2 3 Frame Number B.

  • V. N. Silpa and P. R. Panda, 2011
slide-25
SLIDE 25

Spatial Correlation in frames Spatial Correlation in frames Spatial Correlation in frames Spatial Correlation in frames

Continuity of motion leads to frame level spatial correlation, resulting in slow workload variation

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-26
SLIDE 26

Temporal Correlation of Tile Workload Temporal Correlation of Tile Workload Temporal Correlation of Tile Workload Temporal Correlation of Tile Workload

M l l d f kl d f

Many tiles are correlated, even if workloads of

consecutive frames differ

80% tiles within 10% diff

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-27
SLIDE 27

Dynamic Voltage and Frequency Scaling Dynamic Voltage and Frequency Scaling

V/2 Predicted Workload

Dynamic Voltage and Frequency Scaling Dynamic Voltage and Frequency Scaling

#1 #2 T 2T Predicted Workload – run Tiles-1,2 at V/2 #1 over-predicted #1 under-predicted #1 #2 #2 V/2 V/3 V/2 V #1 #2 #1 T/2 2T 2T 3T/2 V/3 V/2 => slow down #2 => speed up #2 Continuously track and take corrective action after rendering each tile

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-28
SLIDE 28

Frame Rank (R Frame Rank (R R R R ) Frame Rank (R Frame Rank (RG, , Rp, , Rt, , Rr)

Vertex processing workload of a primitive of V Vertex processing workload of a primitive of V

vertices using a Shader Nv instructions long

Sh d kl d V * N

  • Shader workload ~ V * Nv
  • Clipping and Binning ~

V

  • unt

PrimitiveC erLength VertexShad t VertexCoun R

Batches g

+ × = ∑

Pixel shading workload

  • Number of pixels per primitive ~ Area of bounding

f box of the primitive

∑ ∑

× =

Batches Primitives p

rLength PixelShade rea PrimitiveA R

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-29
SLIDE 29

Frame Rank ( R Frame Rank ( R R R R ) Frame Rank ( R Frame Rank ( RG, , Rp, , Rt, , Rr)

T

exture mapping workload

T

exture mapping workload

  • Texture footprint – number of texels to be

filtered per pixel ∑ ∑

× × =

t

tPrint TextureFoo nt TextureCou rea PrimitiveA R

Raster operations workload

∑ ∑

Batches Primitives t

Raster operations workload

  • Each raster operation results in a read and

f ff write to frame buffer

RasterOps rea PrimitiveA Rr

∑ ∑

× × = 2

B.

  • V. N. Silpa and P. R. Panda, 2011

Batches Primitives

slide-30
SLIDE 30

Tile Rank ( Tile Rank (T T T) Tile Rank ( Tile Rank (Tp,T ,TT,T ,T

r)

Tile rank computation is similar to frame

rank computation p

Pixel count is computed as overlap area of

the bounding box and the tile the bounding box and the tile.

Approx No. of Pixels

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-31
SLIDE 31

Rank Based DVFS Scheme Rank Based DVFS Scheme Rank Based DVFS Scheme Rank Based DVFS Scheme

Divide the tiles into set of Heavy tiles ( Tile rank in

current frame greater than its rank in previous frame) and Light tiles and Light tiles.

Frame_Rank (current) > Frame_Rank (previous) ? Process Heavy tiles at Process Heavy tiles at frequency Yes No Process Heavy tiles at Frequency F=FMax Process Heavy tiles at frequency determined by frame history Use tile history based scheme

B.

  • V. N. Silpa and P. R. Panda, 2011

for light tiles

slide-32
SLIDE 32

Tile Rank Based DVFS Summary Tile Rank Based DVFS Summary Tile Rank Based DVFS Summary Tile Rank Based DVFS Summary

Tile Rank Based DVFS gives 75% better Tile Rank Based DVFS gives 75% better

performance than history based scheme

Energy/FrameRate minimum for Tile Rank

based DVFS scheme

Overheads

0 01% i

  • < 0.01% computation
  • < 0.01% storage

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-33
SLIDE 33

Future Work Future Work Future Work Future Work

Extension to multi core GPUs Extension to multi-core GPUs Other stages of the graphics pipeline

B.

  • V. N. Silpa and P. R. Panda, 2011
slide-34
SLIDE 34

Thank ThankYou! You!