Atlas Tracking Optimization on GPU Luis Domingues Professor: - - PowerPoint PPT Presentation

atlas tracking optimization on gpu
SMART_READER_LITE
LIVE PREVIEW

Atlas Tracking Optimization on GPU Luis Domingues Professor: - - PowerPoint PPT Presentation

Master Thesis Atlas Tracking Optimization on GPU Luis Domingues Professor: Frdric Bapst Supervisors: Paolo Calafiura Wim Lavrijsen Expert: Mathieu Monney 02/25/2015 Target Luis Domingues - January 2015 2 Code we started from


slide-1
SLIDE 1

Master Thesis

Atlas Tracking Optimization on GPU

Luis Domingues

Professor: Frédéric Bapst Supervisors: Paolo Calafiura Wim Lavrijsen Expert: Mathieu Monney

02/25/2015

slide-2
SLIDE 2

Luis Domingues - January 2015 2

Target

slide-3
SLIDE 3

Luis Domingues - January 2015 3

Code we started from

  • Demonstrator of ATLAS trigger on GPUs
  • Basic host side

– Take data – Send and compute data on GPU – Sleep waiting the response

slide-4
SLIDE 4

Luis Domingues - January 2015 4

Code we started from

slide-5
SLIDE 5

Luis Domingues - January 2015 5

Overlapping pixels and SCT

  • The pixel and SCT processing are done in sequence
  • Same event, but sequential processing...

Time Pixel SCT Kernels Time stamp Time stamp Kernels Time stamp Time stamp

slide-6
SLIDE 6

Luis Domingues - January 2015 6

Overlapping pixels and SCT

slide-7
SLIDE 7

Luis Domingues - January 2015 7

CUDA Streams

  • A stream is a queue of execution
  • Non-default streams can be executed in parallel

Time H2D H2D H2D Stream1 Stream2 Stream3 Kernel Kernel Kernel D2H D2H D2H H2D = Host to device transfer D2H = Device to host transfer

slide-8
SLIDE 8

Luis Domingues - January 2015 8

Overlapping pixels and SCT

  • Use CUDA Streams
  • Start the processing of SCT before pixels end

Time Pixel stream SCT stream Kernels Time stamp Time stamp Kernels Time stamp Time stamp

slide-9
SLIDE 9

Luis Domingues - January 2015 9

Overlapping pixels and SCT

slide-10
SLIDE 10

Luis Domingues - January 2015 10

Overlapping pixels and SCT

  • For 2000 events, without overlapping

– Avg Pixel: 2.03 ms – Avg SCT: 1.95 ms – Total avg: 3.98 ms

  • For 2000 events, overlapping

– Avg Pixel: 2.3 ms – Avg SCT: 2.5 ms

slide-11
SLIDE 11

Luis Domingues - January 2015 11

Overlapping pixels and SCT

  • Total execution

– Without overlapping: 8.65 s – With overlapping:

6.53 s

slide-12
SLIDE 12

Luis Domingues - January 2015 12

Multi-thread server side

  • Huge amount of “small” data

– They do not fulfill the GPU

  • Parallelize the “event” level processing with streams
slide-13
SLIDE 13

Luis Domingues - January 2015 13

Multi-thread server side

Client Client Client Client Client Client Client Client FIFO

slide-14
SLIDE 14

Luis Domingues - January 2015 14

Multi-thread server side

  • Life of a thread
slide-15
SLIDE 15

Luis Domingues - January 2015 15

Multi-thread server side

slide-16
SLIDE 16

Luis Domingues - January 2015 16

Multi-thread server side

  • Executions time

– Without overlapping:

8.65 s

– With overlapping:

6.53 s

– Multi-threading server side: 4.7 s

slide-17
SLIDE 17

Luis Domingues - January 2015 17

CUDA Occupancy

  • A good setup of Grid/Block size in card can be

significant

  • CUDA offers an API to maximize the occupancy of the

kernels

slide-18
SLIDE 18

Luis Domingues - January 2015 18

CUDA Occupancy

Cuda Core GPU Multiprocessor

slide-19
SLIDE 19

Luis Domingues - January 2015 19

CUDA Occupancy

Cuda Core GPU

  • Bad block size Setup

Kernel 1 Kernel 2 Intra-block synchronization Multiprocessor

slide-20
SLIDE 20

Luis Domingues - January 2015 20

CUDA Occupancy

Cuda Core Multiprocessor GPU

  • Better block Setup

Kernel 1 Kernel 2 Intra-block synchronization

slide-21
SLIDE 21

Luis Domingues - January 2015 21

CUDA Occupancy

  • Maximize the occupancy kills global performances
  • Runs results for 2000 events

– Big Blocks size:

10.88 s

– Original configuration:

4.7 s

– Small blocks size:

4.4 s

slide-22
SLIDE 22

Luis Domingues - January 2015 22

CUDA Occupancy

  • Maximize the occupancy kills global performances
  • Runs results for 2000 events

– Big blocks size:

3 kernels in parallel (Max 5)

– Small blocks size:

4 kernels in parallel (Max 7)

slide-23
SLIDE 23

Luis Domingues - January 2015 23

Conclusion

  • Important points when using a GPU

– Port of an algorithm to the GPU – Communicate with the GPU – Host side design

  • Keep the GPU busy
  • Big occupancy does not allow the GPU to schedule its

tasks efficiently