Programming of hierarchic array processors: The physical layer - - PowerPoint PPT Presentation

programming of hierarchic array processors the physical
SMART_READER_LITE
LIVE PREVIEW

Programming of hierarchic array processors: The physical layer - - PowerPoint PPT Presentation

Programming of hierarchic array processors: The physical layer February 18, 2013 Many core Problems of many core Cores need to communicate to the memory Memory bandwidth needs to be shared between cores in an optimal way Cores need to


slide-1
SLIDE 1

Programming of hierarchic array processors: The physical layer

February 18, 2013

Many core

slide-2
SLIDE 2

Problems of many core

Cores need to communicate to the memory Memory bandwidth needs to be shared between cores in an

  • ptimal way

Cores need to communicate and synchronize with each other The solution to these problems gives rise to different many core architectures

Many core

slide-3
SLIDE 3

Why architecture matters?

General parameters FLOPS : Floating Operations Per Second MIPS : Mega Instructions Per Second Memory bandwidth (GB/s) Chip area Power consumption General parameters cannot give real information about how fast an algorithm can be run on the architecture

Many core

slide-4
SLIDE 4

Why architecture matters?

General parameters FLOPS : Floating Operations Per Second MIPS : Mega Instructions Per Second Memory bandwidth (GB/s) Chip area Power consumption General parameters cannot give real information about how fast an algorithm can be run on the architecture The more general an architecture, the more inefficient it tends to be

Many core

slide-5
SLIDE 5

Why architecture matters?

General parameters FLOPS : Floating Operations Per Second MIPS : Mega Instructions Per Second Memory bandwidth (GB/s) Chip area Power consumption General parameters cannot give real information about how fast an algorithm can be run on the architecture The more general an architecture, the more inefficient it tends to be A strictly serial (and irreducible) algorithm on a parallel architecture can orders of magnitudes slower!

Many core

slide-6
SLIDE 6

Why architecture matters?

Access pattern of the memory has a huge impact on performance Some architectures have a very strict preferred access pattern Thread communication is limited, in many cases can be a bottleneck

Many core

slide-7
SLIDE 7

Architectures

Turing completeness Every architecture should be Turing complete (in a sense) to be useful for general purpose processing. It means that it could emulate any other Turing machines, or any other machines, and run any algorithm. In practice it is not enough, because it tells us nothing about how fast it can run an algorithm. Maximal algorithmic speed In this case, we are interested in the maximal achievable speed of a given algorithm on a given architecture. We usually allow any modifications on the algorithm as long as the outputs are matching for any inputs. Computing this speed is mathematically impossible, but in practice we will have to guess it. Knowing the fine details of the architecture is very important for any meaningful guess.

Many core

slide-8
SLIDE 8

Programmed machines

We use an Universal Turing Machine, so can store the program on the Tape We allow the machine to read from anywhere on the tape, we address the Tape We separate reading data from the Tape, and reading instructions. What we have got is effectively a Programmed Machine, where the Tape is the memory, and program is stored in instructions

Many core

slide-9
SLIDE 9

Data and Instructions

Data It is information which we want to process, or processed. We store any data in “registers” right before and after processing. The “register” is the fastest type of memory, which is the nearest to the processing units (ALU, FPU). Usually this is the most limited, least flexible, but by far fastest memory resource we have. Instructions Instructions define what the machine should do on the Data it was

  • given. The Instruction Set is a very descriptive feature of an

architecture. Data-flow instructions: only process Data, executed on Processing Units (like FPU, ALU) Control-flow instructions: change the flow of the control inside the program, (similar to branches and loops), executed

  • n Control Units

Memory access instructions: Read and/or write various kinds

Many core

slide-10
SLIDE 10

Memory hierarchy

Memory Most practical architectures employ multiply levels of memory, to mitigate the weak points of the physical realizations. The general idea is to combine small and fast memories with bigger but slower

  • memories. Where the fast memory is generally orders of

magnitudes more expensive than the slow one. Registers: innermost memory, fastest, but a very limited resource, runs at core clock speed, or double speed Local memory: addressable, but small memory but only accessible from nearby cores. L1 cache: Most architectures have it, an associative memory, used with cache coherence algorithm to simulate global access L2 cache: bigger and slower than L1 cache. System RAM: usually Dynamic Random Access Memory System storage: slowest/biggest memory

Many core

slide-11
SLIDE 11

DRAM is not truly random access

Dynamic RAM cells are slow to read, and reading is destructive Lines in the dynamic capacitor array need to be prepared for reading operation DRAMs can only achieve high bandwidths by reading/writing big parts of the array at the same time Knowing where the next access will happen is also helpful These requirements mean that we can only access the DRAM in big chunks (bursts) Architectures usually map these chunks into their memory address space in a predefined way Any efficient access pattern must follow this mapping!

Many core

slide-12
SLIDE 12

Memory architectures

DRAM: dynamic RAM, bits are stored in capacitors, reading is slow and destructive, must be refreshed SRAM: static RAM, bit are stored in bistable circuits, access is fast but SRAMs are expensive multi-port: SRAMs can be multiport, which means they can have more than port, where they can be read/written in parallel associative: usually caches are associative, fine tuned to the cache coherence algorithm, the granularity of the associativity is very important. For GPUs there are texture caches, where the granularity is 2D tiles of a texture.

Many core

slide-13
SLIDE 13

Fetch-Decode-Execute cycle

fetch Fetch the instruction from the memory to the instruction register. It is usually cached by the instruction cache. This step is not always trivial, because instructions can have different sizes. decode Often it is a simple lookup in a lookup table, where the table contains the breakdown of the instructions into parts to various processing units (register IO, FPU, ALU, Memory IO). For x86 this stage is very complex, because it needs to compile the x86 instruction into its inside RISC instruction set. For GPUs this stage is partially missing the the instruction is directly wired to the processing units with very little logic between

Many core

slide-14
SLIDE 14

Fetch-Decode-Execute cycle

execute Actually doing what the instruction instructs the core to do. In many cases it is preferred if this step takes only a single clock cycle. It can be usually broken down to: Read registers Read memory Do processing Write memory and/or registers

Many core

slide-15
SLIDE 15

Pipelining

Most of the time the steps of the FDE cycle cannot be done in a single cycle, it is sometimes more than 10 cycles The basic idea of pipelining is that all operations are done on different units, or different parts of units so they can be done in parallel Everything happens in the same time, but on different data, the data shifts between pipeline stages A fully pipelined logic does any operations effectively in a single cycle, as long as it has enough operations/data to fill it

Many core

slide-16
SLIDE 16

Pipelining and scale-down

Scale down, means smaller wires and transistors it decreases the speed on wires, but increases the speed of transistors wires are much slower than transistors at the moment signal speed strictly limits the distance accessible within a single cycle combined with increasing clock speeds, the amount of logic accessible within a single cycle decreases these factors force us to use longer and longer pipelines to fully utilize the processing power

Many core

slide-17
SLIDE 17

Pipelining: the smart solution

x86 does HUGE amounts of processing to figure out the intentions and future of the code it is running Loop are tracked, and statistics are collected about every loop and branch Knowing the future helps to predict to fill the pipeline with instructions yet to be executed It also does on-the-fly reordering of the instructions based on their data-flow graph to run the parallel on multi-ALU/FPUs It hides the parallelism inside its core architecture The algorithms to achieve this, is one of the most well protected trade secret of x86 technology The programmer has almost zero worries about pipelining

Many core

slide-18
SLIDE 18

Pipelining: the dumb solution

GPUs use a radically different approach from x86 architectures Pipelines are filled the most primitive way, by scheduling different threads for different stages of the pipeline It means that the minimum amount of threads needed to fully utilize the architecture is number of cores multiplied by pipeline depth It is easy to assure good pipeline fill this way, if we have enough threads Memory operations could hold up threads, but if we use bit more threads, we can feed the pipeline when some threads are waiting for the memory Global memory operations are either implicitly or explicitly implemented by messages, where we have a choice when to wait for completion. This way even inside a single thread memory and computation can be parallel

Many core

slide-19
SLIDE 19

Many cores: Topology vs. hierarchy

Topological many cores Memory, and inter core communication is ordered into a topology Very well fit for processing data with the given topology usually very computation efficient potentially high memory bandwidth Very poor fit for different or non-topological computations CNN, systolic arrays

Many core

slide-20
SLIDE 20

Many cores: Topology vs. hierarchy

Hierarchical many cores Memory, and inter core communication is ordered into a hierarchy Good, but not perfect fit for wide area of topologies Synchronization between cores can sometimes be difficult to implement efficiently Practical for general purpose many core computing GPU, Cell BE, multicore x86

Many core

slide-21
SLIDE 21

Caches

Coherency to simulate global memory access Caches are opaque memory levels False sharing Typical failure of the vertical cache coherency between cores. Where a memory address is often written on one core and a different address accessed on another, but they fall to the same L1 cache page, so the coherency algorithm believes them to be the same address. It generates unacceptable amount of traffic between cores (caches). Usually in case of heavy false sharing the algorithm is much slower than in a single core case.

Many core

slide-22
SLIDE 22

Caches

Access prediction For many small memory access caches only work well if they can predict well enough the next accessed memory. Cache is opaque When caches don’t work well, it is very difficult to work around the problem, because the running code has no explicit access on x86 to the cache algorithm. On GPUs are is a limited control of caches. Opaqueness is otherwise a good thing, because neither the programmer nor the compiler need to take heed of the inner workings of the cache.

Many core

slide-23
SLIDE 23

VLIW

Very Long Instruction Word It usually means that more than instructions can be compounded to be executed at the same time Different sub instructions execute on different Processing Units Can have internal reduction support for implementing complex instructions built from different sub instructions Used for older AMD (ATI Radeon) GPUs and Intel Itanium (i64)

Many core

slide-24
SLIDE 24

SIMD

SIMD Single Instruction Multiply Data, we execute the same instruction

  • n multiply data at the same time. It generally means that more

than one Processing Units are connected together to the same Control Unit to execute the same instruction.

Many core

slide-25
SLIDE 25

SIMD

SIMD architecture is optimal for vector processing Smarter variants of SIMD can handle “splitting” the control flow, which enables it to run non vector processing jobs used for GPUs, SSE MMX for x86, Cell BE GPU SIMDs have physical length 16-32, and logical length of 32-64 elements. GPUs map a thread to an element of the SIMD, emulating a non-SIMD architecture

Many core

slide-26
SLIDE 26

SIMD: Why so (in)efficient

SIMD naturally uses vector read/write, which is a burst memory operation, which is a very good fit for DRAMs It only fetches a single instruction for performing 32-64

  • perations,

which means that the time and bandwidth for reading instructions are negligible, unrolling loops is always a good idea Non-vector operations are inefficient, particularly: emulating non-vector operations can be very suboptimal

Many core

slide-27
SLIDE 27

Branches on modern SIMD

Modern SIMD can deal with branches in two ways

1

Execute all possible paths in series, but throw away the results

2

Reschedule on-the-fly the threads executing different paths, and walk them till they reach a convergence point

The first way is usually done by the compiler, but it needs the code to be trivially reducible to loops and branches The second way can deal with arbitrary branching code, but is much less efficient in practice Nvidia GPUs mostly employ the second way, while AMD GPUs favor the first one. For simple computations in branches, the first one should be favored!

Many core

slide-28
SLIDE 28

Memory transfers on SIMD

SIMD likes vector memory transfers the best It means that we should coalesce memory transfers as best as we could For the threaded approach, it implies that the memory address is linearly dependent of the thread-id Modern GPUs employ very advanced auto-coalescing caches, where most of these constraints can be relaxed Generally speaking, the memory read by a thread should be close in address to other threads which are close in thread-id We also have options to choose between caches we would like to use for our memory transfers on GPUs.

Many core