Landing GNU-Based Landing GNU-Based OpenMP OpenMP on CELL: on - - PowerPoint PPT Presentation

landing gnu based landing gnu based openmp openmp on cell
SMART_READER_LITE
LIVE PREVIEW

Landing GNU-Based Landing GNU-Based OpenMP OpenMP on CELL: on - - PowerPoint PPT Presentation

Landing GNU-Based Landing GNU-Based OpenMP OpenMP on CELL: on CELL: Progress Report and Perspectives Progress Report and Perspectives Guang R. Gao Computer Architecture and Parallel System Laboratory Department of Electrical & Computer


slide-1
SLIDE 1

2007/6/19 Gao-CELL-06-2007 1

Landing GNU-Based Landing GNU-Based OpenMP OpenMP on CELL:

  • n CELL:

Progress Report and Perspectives Progress Report and Perspectives

Guang R. Gao

Computer Architecture and Parallel System Laboratory Department of Electrical & Computer Engineering University of Delaware ggao@capsl.udel.edu

slide-2
SLIDE 2

2007/6/19 Gao-CELL-06-2007 2

Outline

Background Why GNU OpenMP on CELL ? Project Status Report Preliminary Results Future Perspectives

slide-3
SLIDE 3

2007/6/19 Gao-CELL-06-2007 3

CAPSL Research Layout

High End Computing Architecture & Programming Models

Fine Grained Multithreading (i.e. EARTH, CARE)

Infrastructure & Tools

System Tools Simulation/Emulation Analytical Modeling

Base Execution Model High Performance Bio-computing Kernels Other High End Applications Scientific Computation Kernels

slide-4
SLIDE 4

2007/6/19 Gao-CELL-06-2007 4

Outline

Background Why GNU OpenMP on CELL ? Project Status Report Preliminary Results Future Perspectives

slide-5
SLIDE 5

2007/6/19 Gao-CELL-06-2007 5

CBE Architecture Overview

Local storage size per SPU: 256KB Area: 221 mm² Technology 90nm SOI Total number of transistors: 234M Observed clock speed: a wide range of operating frequencies are supported to optimize for power and yield; Peak performance (single precision): > 256 GFlops Peak performance (double precision): >26 GFlops

slide-6
SLIDE 6

2007/6/19 Gao-CELL-06-2007 6

State on Parallel Languages

(based on a recent survey by G. Pfister, IBM)

200+ parallel language efforts in the past. At first glance: Most of them are not used!!! When talking about parallel languages, you usually hear MPI (90% of the time) and OpenMP (10%) Auto-parallelization has drifted from the general scene toward obscurity.

slide-7
SLIDE 7

2007/6/19 Gao-CELL-06-2007 7

Why OpenMP?

OpenMP is an industrial standard for writing parallel programs on shared memory architecture. OpenMP is available. OpenMP is being productively used. OpenMP is …

slide-8
SLIDE 8

2007/6/19 Gao-CELL-06-2007 8

OpenMP

Major Issues and Challenges

For Compiler Writers, not Users Pragma / Directive Based Default is set to make it easy to write fast (but not necessarily correct) programs. OpenMP does not support sequential consistency Data layout and locality management Lack of support for OpenMP by the GCC compilers for the CBE. It only has 10% of parallel programming user community.

ACK: this list comes from private communication with a number of people: William Gropp, John, Mellor-Crummy, Rick Stevens, Thomas Sterling, Ross Towle, Kathy Yelick, etc.

slide-9
SLIDE 9

2007/6/19 Gao-CELL-06-2007 9

Issue #7

“I think its a waste of time to focus on trying to force these old broken poor parallel processing languages/protocols into the new approach.” However OpenMP is widely available today

This is evident from its inclusion in the

GNU Compiler Collection in Release 4.2.0

slide-10
SLIDE 10

2007/6/19 Gao-CELL-06-2007 10

GOMP Status

See http://gcc.gnu.org/projects/gomp/ OpenMP support for C, C++, and Fortran 95 Will support 2.5 and 3.0 soon Released in May 5, 2007 as part of the

  • fficial release of GCC 4.2
slide-11
SLIDE 11

2007/6/19 Gao-CELL-06-2007 11

Outline

Background Why GNU OpenMP on CELL ? Project Description Preliminary Results Future Perspectives

slide-12
SLIDE 12

2007/6/19 Gao-CELL-06-2007 12

GNU Based OpenMP on CELL

Objectives

A working OpenMP-CELL plarform Has the following features

Single source compilation Code partition and overlay Software caching A simple runtime system

Should finish in 1-yr, and pass a set of (non-toy)

benchmarks and publish papers

Optimization is NOT an objective, but Should propose a wish list of research topics for

the next phase

Try to leverage knowledge/experience from the

Cyclops-64 project

slide-13
SLIDE 13

2007/6/19 Gao-CELL-06-2007 13

Single Source Compilation

Progress Report Source Code

SPU-cc PPU-cc

Modified compiler, assembler and linker

SPU exec PPU exec Embedder Final exec

  • Partition creation by clustering
  • Addition of assembly directives
  • Insertion of library calls
  • Outlining of parallel functions
  • Partition creation by clustering
  • Addition of assembly directives
  • Insertion of library calls
  • Outlining of parallel functions

SPU binary plus partition manager and software cache libraries SPU binary plus partition manager and software cache libraries Final Executable with all the necessary (static) libraries Final Executable with all the necessary (static) libraries

  • Insertion of library calls
  • Outlining of sequential code
  • Insertion of library calls
  • Outlining of sequential code

PPU binary plus GOMP & SPE libraries PPU binary plus GOMP & SPE libraries

slide-14
SLIDE 14

2007/6/19 Gao-CELL-06-2007 14

The Code Overlay Problem

slide-15
SLIDE 15

2007/6/19 Gao-CELL-06-2007 15

Our Code Overlay Manager

Features

Semi –static sub-division of buffer Replacement policies and buffer behaviors

LRU vs. other replacement Policies Lazy Reuse [cache-like] Buffer Behavior

Modified Toolchain

User aided and automatic code partitioning Command line options

Remarks

compiler does no need to break object code into multiple

files, and explicitly put the names of the files into a linker script,

simply link the partition manager library and use the

default GNU linker script

slide-16
SLIDE 16

2007/6/19 Gao-CELL-06-2007 16

Softw are Cache

Why software caching ? Features:

Cache-Coherence enforced at synchronization

points (e.g. barrier, lock, etc.)

Handle false-sharing at byte level

Other cache design decisions

Cache parameters (32-bit address, block size: 128B, 128

blocks (16k)

Cache organization (set-associative, current: 4W) Write back vs. write through Replacement policy: LRU

Remark: Only used as a backup solution

slide-17
SLIDE 17

2007/6/19 Gao-CELL-06-2007 17

Softw are Cache

An Overview Smooth the heterogeneity among different memory modules; The SPEs can simultaneously source/sink 8 bytes per processor cycles (25.6+25.6GB/s at 3.2GHz) 6 cycle load latency to 256KB local storage (LS) on SPE; Bytewise dirty bits but is adaptive; Cache line fill/flush are performed via DMA transfer;

Element Interconnect Bus

SPU0 SPU1 SPU2 SPU3 SPU4 SPU5 SPU6 SPU7 LS

PPU

$

LS LS LS LS LS LS LS

Main Mem

tag & status dirty bit vector data 128 bytes 4 bytes 0-16 bytes tag & status dirty bit vector data tag & status dirty bit vector data tag & status dirty bit vector data tag & status dirty bit vector data tag & status dirty bit vector data tag & status dirty bit vector data tag & status dirty bit vector data

slide-18
SLIDE 18

2007/6/19 Gao-CELL-06-2007 18

A Simple Runtime System

Why a simple runtime system? Features of our simple runtime system

Shadow (PPU) threads and worker

(SPU) threads

Mainly used for testing the compiler and tool-chain

slide-19
SLIDE 19

2007/6/19 Gao-CELL-06-2007 19

A Simple Runtime System

An Overview

P P U S i d e SPU Side Thr 0 serves as the Master Thread and creates all other threads

POSIX Thread SPU Thread Communication

slide-20
SLIDE 20

2007/6/19 Gao-CELL-06-2007 20

A Simple Runtime System

Threads and Communication

Command Buffer

POSIX Thread SPU Thread

Initial Signal Command Buffer reply Command Buffer request Completion signal Incoming signal Outgoing signal

slide-21
SLIDE 21

2007/6/19 Gao-CELL-06-2007 21

Status Summary

Code partition between SPU and PPU

Single source compilation Outline parallel sections for SPU

Explicit data movement between main memory and SPU

Software cache Double buffering

Code overlay to support large programs

Code partition support by the tool-chain Object code format changes Partition manager: decide when to load a new partition

OpenMP runtime

PPU and SPU work together

slide-22
SLIDE 22

2007/6/19 Gao-CELL-06-2007 22

Outline

Background Why GNU OpenMP on CELL ? Project Description Preliminary Results Future Perspectives

slide-23
SLIDE 23

2007/6/19 Gao-CELL-06-2007 23

Experimental Framew ork

spu-ld v2.16.1 spu-as v2.16.1 spu-gcc v4.2.0

PS3* Hardware

Software cache Partition Manager

Yellow Dog Linux v5.0

Hardware O.S. Software

Tool-chain Modified components Extra libraries

*PS3 is a trademark of Sony corporation

slide-24
SLIDE 24

2007/6/19 Gao-CELL-06-2007 24

Benchmarks

Benchmark Name Description huff, huff2 huffman decoding from MPEG2 idct, idct_2 IDCT and IQuantization from MPEG2 resize, reside_2 YUV file resizing algorithm alphablend A process of combining a translucent foreground color with a background (stream) file convert YUV2RGB - convert yuv file to raw stream file prgb2gm convert RBB file into BMP file gzip SPEC compression utility OpenMP Validation Suite V1.0 OpenMP test cases from University of Houston

slide-25
SLIDE 25

2007/6/19 Gao-CELL-06-2007 25

Preliminary Experimental Results

Pass preliminary tests for all benchmarks The automatic code overlay works

provides important performance gains for different

applications

Modulus is better when the code / partitions have no re-use LRU is better when the code / partitions have re-use Degradation 8 % in the worst case

slide-26
SLIDE 26

2007/6/19 Gao-CELL-06-2007 26

Outline

Background Project and Problem Formulation Status Report Results Related Work Future Perspectives

slide-27
SLIDE 27

2007/6/19 Gao-CELL-06-2007 27

Related Work

Manzano, et. all, IWOMP 2007 O'Brien, et. all, IWOMP 2007 IBM Research, “Compiler and Runtime Support for Code Partitioning.” Cell Broadband Engine Programming Handbook. Version 1.0. 2006.Page 616, 617. Software model for Cyclops-64 chips/system Many others (including work presented in this workshop)

slide-28
SLIDE 28

2007/6/19 Gao-CELL-06-2007 28

Summary and Future Work

A preliminary GNU OpenMP on CELL is in good progress Release plan

Alpha release is being planned Beta release partner(s) – please contact us

Compiler optimization and infrastructure

slide-29
SLIDE 29

2007/6/19 Gao-CELL-06-2007 29

Acknow ledgement

The CELL team at UD and ETI Other Members of CAPSL IBM CELL team (especially, Peter

Hofstee, Michael Gschwind, Kevin O’Brian, etc.)

Other collaborators Our hosts