GPGPU Programming in Haskell with Accelerate Trevor L. McDonell - PowerPoint PPT Presentation

Dot product four ways • CUDA (parallel): - Step 2: vector reduction … is somewhat complex struct ¡SharedMemory ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡2]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡switch ¡(threads) ¡ ¡ ¡ ¡__device__ ¡inline ¡operator ¡float ¡*() ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡ ¡2) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡512: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡extern ¡__shared__ ¡int ¡__smem[]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡1]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<512, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡return ¡(float ¡*)__smem; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡256: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<256, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡__device__ ¡inline ¡operator ¡const ¡float ¡*() ¡const ¡ ¡ ¡ ¡if ¡(tid ¡== ¡0) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡g_odata[blockIdx.x] ¡= ¡sdata[0]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡128: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡extern ¡__shared__ ¡int ¡__smem[]; } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<128, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡return ¡(float ¡*)__smem; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡} void ¡getNumBlocksAndThreads(int ¡n, ¡int ¡maxBlocks, ¡int ¡maxThreads, ¡int ¡&blocks, ¡int ¡&threads) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡64: }; { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡64, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡cudaDeviceProp ¡prop; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; template ¡<unsigned ¡int ¡blockSize, ¡bool ¡nIsPow2> ¡ ¡ ¡ ¡int ¡device; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡32: __global__ ¡void ¡ ¡ ¡ ¡checkCudaErrors(cudaGetDevice(&device)); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡32, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); reduce_kernel(float ¡*g_idata, ¡float ¡*g_odata, ¡unsigned ¡int ¡n) ¡ ¡ ¡ ¡checkCudaErrors(cudaGetDeviceProperties(&prop, ¡device)); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡16: ¡ ¡ ¡ ¡float ¡*sdata ¡= ¡SharedMemory(); ¡ ¡ ¡ ¡threads ¡= ¡(n ¡< ¡maxThreads*2) ¡? ¡nextPow2((n ¡+ ¡1)/ ¡2) ¡: ¡maxThreads; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡16, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡blocks ¡ ¡= ¡(n ¡+ ¡(threads ¡* ¡2 ¡-‑ ¡1)) ¡/ ¡(threads ¡* ¡2); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡unsigned ¡int ¡tid ¡ ¡ ¡ ¡ ¡ ¡= ¡threadIdx.x; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡8: ¡ ¡ ¡ ¡unsigned ¡int ¡i ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡blockIdx.x*blockSize*2 ¡+ ¡threadIdx.x; ¡ ¡ ¡ ¡if ¡(blocks ¡> ¡prop.maxGridSize[0]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡8, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡unsigned ¡int ¡gridSize ¡= ¡blockSize*2*gridDim.x; ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡blocks ¡ ¡/= ¡2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡4: ¡ ¡ ¡ ¡float ¡sum ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡threads ¡*= ¡2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡4, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡while ¡(i ¡< ¡n) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡2: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sum ¡+= ¡g_idata[i]; ¡ ¡ ¡ ¡blocks ¡= ¡min(maxBlocks, ¡blocks); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡2, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(nIsPow2 ¡|| ¡i ¡+ ¡blockSize ¡< ¡n) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡1: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sum ¡+= ¡g_idata[i+blockSize]; float ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡1, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); reduce(int ¡n, ¡float ¡*d_idata, ¡float ¡*d_odata) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡i ¡+= ¡gridSize; { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡int ¡threads ¡ ¡ ¡ ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡int ¡blocks ¡ ¡ ¡ ¡ ¡= ¡0; ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum; ¡ ¡ ¡ ¡int ¡maxThreads ¡= ¡256; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡size ¡= ¡(size ¡+ ¡(threads*2-‑1)) ¡/ ¡(threads*2); ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡int ¡maxBlocks ¡ ¡= ¡64; ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡int ¡size ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡n ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡512) ¡{ ¡ ¡ ¡ ¡float ¡sum; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(tid ¡< ¡256) ¡{ ¡ ¡ ¡ ¡while ¡(size ¡> ¡1) ¡{ ¡ ¡ ¡ ¡checkCudaErrors(cudaMemcpy(&sum, ¡d_odata, ¡sizeof(float), ¡cudaMemcpyDeviceToHost)); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡sdata[tid ¡+ ¡256]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡getNumBlocksAndThreads(size, ¡maxBlocks, ¡maxThreads, ¡blocks, ¡threads); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡int ¡smemSize ¡= ¡(threads ¡<= ¡32) ¡? ¡2 ¡* ¡threads ¡* ¡sizeof(float) ¡: ¡threads ¡* ¡sizeof(float); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1, ¡1); ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dim3 ¡dimGrid(blocks, ¡1, ¡1); ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡256) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(isPow2(size)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(tid ¡< ¡128) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡sdata[tid ¡+ ¡128]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡switch ¡(threads) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡512: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<512, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡256: ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡128) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<256, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(tid ¡< ¡ ¡64) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡sdata[tid ¡+ ¡ ¡64]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡128: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<128, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡64: ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡64, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡if ¡(tid ¡< ¡32) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡32: ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡32, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡volatile ¡float ¡*smem ¡= ¡sdata; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡16: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡64) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡16, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡32]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡8: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡8, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡32) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡16]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡4: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡4, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡16) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡2: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡8]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡2, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡1: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡ ¡8) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡1, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡4]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡ ¡4) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡else Friday, 17 May 13

Dot product four ways • CUDA (parallel): - Step 2: vector reduction … is somewhat complex struct ¡SharedMemory ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡2]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡switch ¡(threads) ¡ ¡ ¡ ¡__device__ ¡inline ¡operator ¡float ¡*() ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡ ¡2) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡512: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡extern ¡__shared__ ¡int ¡__smem[]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡1]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<512, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡return ¡(float ¡*)__smem; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡256: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<256, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡__device__ ¡inline ¡operator ¡const ¡float ¡*() ¡const ¡ ¡ ¡ ¡if ¡(tid ¡== ¡0) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡g_odata[blockIdx.x] ¡= ¡sdata[0]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡128: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡extern ¡__shared__ ¡int ¡__smem[]; } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<128, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡return ¡(float ¡*)__smem; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡} void ¡getNumBlocksAndThreads(int ¡n, ¡int ¡maxBlocks, ¡int ¡maxThreads, ¡int ¡&blocks, ¡int ¡&threads) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡64: }; { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡64, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡cudaDeviceProp ¡prop; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; template ¡<unsigned ¡int ¡blockSize, ¡bool ¡nIsPow2> ¡ ¡ ¡ ¡int ¡device; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡32: __global__ ¡void ¡ ¡ ¡ ¡checkCudaErrors(cudaGetDevice(&device)); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡32, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); reduce_kernel(float ¡*g_idata, ¡float ¡*g_odata, ¡unsigned ¡int ¡n) ¡ ¡ ¡ ¡checkCudaErrors(cudaGetDeviceProperties(&prop, ¡device)); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡16: ¡ ¡ ¡ ¡float ¡*sdata ¡= ¡SharedMemory(); ¡ ¡ ¡ ¡threads ¡= ¡(n ¡< ¡maxThreads*2) ¡? ¡nextPow2((n ¡+ ¡1)/ ¡2) ¡: ¡maxThreads; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡16, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡blocks ¡ ¡= ¡(n ¡+ ¡(threads ¡* ¡2 ¡-‑ ¡1)) ¡/ ¡(threads ¡* ¡2); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡unsigned ¡int ¡tid ¡ ¡ ¡ ¡ ¡ ¡= ¡threadIdx.x; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡8: ¡ ¡ ¡ ¡unsigned ¡int ¡i ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡blockIdx.x*blockSize*2 ¡+ ¡threadIdx.x; ¡ ¡ ¡ ¡if ¡(blocks ¡> ¡prop.maxGridSize[0]) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡8, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡unsigned ¡int ¡gridSize ¡= ¡blockSize*2*gridDim.x; ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡blocks ¡ ¡/= ¡2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡4: ¡ ¡ ¡ ¡float ¡sum ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡threads ¡*= ¡2; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡4, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡while ¡(i ¡< ¡n) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡2: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sum ¡+= ¡g_idata[i]; ¡ ¡ ¡ ¡blocks ¡= ¡min(maxBlocks, ¡blocks); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡2, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(nIsPow2 ¡|| ¡i ¡+ ¡blockSize ¡< ¡n) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡1: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sum ¡+= ¡g_idata[i+blockSize]; float ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡1, ¡false><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); reduce(int ¡n, ¡float ¡*d_idata, ¡float ¡*d_odata) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡i ¡+= ¡gridSize; { ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡int ¡threads ¡ ¡ ¡ ¡= ¡0; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡int ¡blocks ¡ ¡ ¡ ¡ ¡= ¡0; ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum; ¡ ¡ ¡ ¡int ¡maxThreads ¡= ¡256; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡size ¡= ¡(size ¡+ ¡(threads*2-‑1)) ¡/ ¡(threads*2); ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡int ¡maxBlocks ¡ ¡= ¡64; ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡int ¡size ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡n ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡512) ¡{ ¡ ¡ ¡ ¡float ¡sum; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(tid ¡< ¡256) ¡{ ¡ ¡ ¡ ¡while ¡(size ¡> ¡1) ¡{ ¡ ¡ ¡ ¡checkCudaErrors(cudaMemcpy(&sum, ¡d_odata, ¡sizeof(float), ¡cudaMemcpyDeviceToHost)); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡sdata[tid ¡+ ¡256]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡getNumBlocksAndThreads(size, ¡maxBlocks, ¡maxThreads, ¡blocks, ¡threads); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡int ¡smemSize ¡= ¡(threads ¡<= ¡32) ¡? ¡2 ¡* ¡threads ¡* ¡sizeof(float) ¡: ¡threads ¡* ¡sizeof(float); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1, ¡1); ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dim3 ¡dimGrid(blocks, ¡1, ¡1); ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡256) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(isPow2(size)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(tid ¡< ¡128) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡sdata[tid ¡+ ¡128]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡switch ¡(threads) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡512: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<512, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; o_O ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡256: ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡128) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<256, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(tid ¡< ¡ ¡64) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡sdata[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡sdata[tid ¡+ ¡ ¡64]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡128: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel<128, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡__syncthreads(); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡64: ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡64, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡if ¡(tid ¡< ¡32) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡32: ¡ ¡ ¡ ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡32, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡volatile ¡float ¡*smem ¡= ¡sdata; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡16: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡64) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡16, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡32]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡8: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡8, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡32) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡16]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡4: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡4, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡16) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡2: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡8]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡2, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡case ¡ ¡1: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡ ¡8) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡reduce_kernel< ¡ ¡1, ¡true><<< ¡dimGrid, ¡dimBlock, ¡smemSize ¡>>>(d_idata, ¡d_odata, ¡size); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡smem[tid] ¡= ¡sum ¡= ¡sum ¡+ ¡smem[tid ¡+ ¡ ¡4]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡break; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if ¡(blockSize ¡>= ¡ ¡ ¡4) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡else Friday, 17 May 13

Dot product four ways • Accelerate (parallel): dotp ¡:: ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Scalar ¡Float) dotp ¡xs ¡ys ¡= ¡fold ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) Friday, 17 May 13

Dot product four ways • Accelerate (parallel): dotp ¡:: ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Scalar ¡Float) dotp ¡xs ¡ys ¡= ¡fold ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) • Recall the sequential Haskell version: dotp ¡:: ¡[Float] ¡-‑> ¡[Float] ¡-‑> ¡Float dotp ¡xs ¡ys ¡= ¡foldl ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) Friday, 17 May 13

Dot product four ways • Accelerate (parallel): dotp ¡:: ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Scalar ¡Float) dotp ¡xs ¡ys ¡= ¡fold ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) • Recall the sequential Haskell version: dotp ¡:: ¡[Float] ¡-‑> ¡[Float] ¡-‑> ¡Float dotp ¡xs ¡ys ¡= ¡foldl ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) left-to-right traversal Friday, 17 May 13

Dot product four ways • Accelerate (parallel): dotp ¡:: ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Scalar ¡Float) dotp ¡xs ¡ys ¡= ¡fold ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) neither left nor right: happens in parallel (tree-like) • Recall the sequential Haskell version: dotp ¡:: ¡[Float] ¡-‑> ¡[Float] ¡-‑> ¡Float dotp ¡xs ¡ys ¡= ¡foldl ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) left-to-right traversal Friday, 17 May 13

Dot product four ways • Accelerate (parallel): dotp ¡:: ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Vector ¡Float) ¡-‑> ¡Acc ¡(Scalar ¡Float) dotp ¡xs ¡ys ¡= ¡fold ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) neither left nor right: happens in parallel (tree-like) • Recall the sequential Haskell version: dotp ¡:: ¡[Float] ¡-‑> ¡[Float] ¡-‑> ¡Float dotp ¡xs ¡ys ¡= ¡foldl ¡(+) ¡0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡( ¡zipWith ¡(*) ¡xs ¡ys ¡) left-to-right traversal But… how does it perform? Friday, 17 May 13

Dot product four ways Dot product 100 10 Run Time (ms) 1 sequential Accelerate 0.1 CUBLAS 2 4 6 8 10 12 14 16 18 20 Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz) Elements (millions) Friday, 17 May 13

Dot product four ways Dot product 100 10 Run Time (ms) 1 1.2x sequential Accelerate 0.1 CUBLAS 2 4 6 8 10 12 14 16 18 20 Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz) Elements (millions) Friday, 17 May 13

Dot product four ways Dot product 100 30x 10 Run Time (ms) 1 1.2x sequential Accelerate 0.1 CUBLAS 2 4 6 8 10 12 14 16 18 20 Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz) Elements (millions) Friday, 17 May 13

Friday, 17 May 13

Mandelbrot fractal Friday, 17 May 13

Mandelbrot fractal n-body gravitational simulation Friday, 17 May 13

Mandelbrot fractal n-body gravitational simulation Canny edge detection Friday, 17 May 13

Mandelbrot fractal n-body gravitational simulation SmoothLife cellular automata Canny edge detection Friday, 17 May 13

Mandelbrot fractal n-body gravitational simulation stable fluid flow SmoothLife cellular automata Canny edge detection Friday, 17 May 13

Mandelbrot fractal n-body gravitational simulation stable fluid flow ... d6b821d937a4170b3c4f8ad93495575d: ¡saitek1 d0e52829bf7962ee0aa90550ffdcccaa: ¡laura1230 494a8204b800c41b2da763f9bbbcc462: ¡lina03 d8ff07c52a95b30800809758f84ce28c: ¡Jenny10 e81bed02faa9892f8360c705241191ae: ¡carmen89 46f7d75718029de99dd81fd907034bc9: ¡mellon22 0dd3c176cf34486ec00b526b6920b782: ¡helena04 9351c4bc8c8ba17b58d5a6a1f839f356: ¡85548554 9c36c5599f40d08f874559ac824d091a: ¡585123456 4b4dce6c91b429e8360aa65f97342e90: ¡5678go 3aa561d4c17d9d58443fc15d10cc86ae: ¡momo55 Recovered ¡150/1000 ¡(15.00 ¡%) ¡digests ¡in ¡59.45 ¡s, ¡185.03 ¡MHash/sec Password “recovery” (MD5 dictionary attack) SmoothLife cellular automata Canny edge detection Friday, 17 May 13

Accelerate • Accelerate is a Domain-Specific Language for GPU programming Copy result back to Haskell Haskell/Accelerate program Compile with NVIDIA’s compiler & load onto the GPU Transform Accelerate program into CUDA program e o d c A D U C Friday, 17 May 13

Accelerate • Accelerate is a Domain-Specific Language for GPU programming - This process may happen several times during program execution - Code and data fragments get cached and reused • An Accelerate program is a Haskell program that generates a CUDA program - However, in many respects this still looks like a Haskell program - Shares various concepts with Repa , a Haskell array library for CPUs Friday, 17 May 13

Accelerate • To execute an Accelerate computation (on the GPU): run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a - run comes from whichever backend we have chosen (CUDA) Friday, 17 May 13

Accelerate • To execute an Accelerate computation (on the GPU): run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a - run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array , or tuple thereof Friday, 17 May 13

Accelerate • To execute an Accelerate computation (on the GPU): run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a - run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array , or tuple thereof • What is Acc ? Friday, 17 May 13

Accelerate • To execute an Accelerate computation (on the GPU): run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a - run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array , or tuple thereof • What is Acc ? - This is our DSL type - A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’ Friday, 17 May 13

Accelerate • To execute an Accelerate computation (on the GPU): run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a - run comes from whichever backend we have chosen (CUDA) - Arrays constrains the result to be an Array , or tuple thereof • What is Acc ? - This is our DSL type - A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’ Accelerate is a library of collective operations over arrays of type Acc ¡a Friday, 17 May 13

Accelerate run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a • Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Accelerate Arrays in Arrays out computation Friday, 17 May 13

Accelerate run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a • Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Accelerate Arrays in Arrays out computation • Arrays have two type parameters data ¡Array ¡sh ¡e Friday, 17 May 13

Accelerate run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a • Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Accelerate Arrays in Arrays out computation • Arrays have two type parameters data ¡Array ¡sh ¡e - The shape of the array, or dimensionality Friday, 17 May 13

Accelerate run ¡:: ¡Arrays ¡a ¡=> ¡Acc ¡a ¡-‑> ¡a • Accelerate computations take place on arrays - Parallelism is introduced in the form of collective operations over arrays Accelerate Arrays in Arrays out computation • Arrays have two type parameters data ¡Array ¡sh ¡e - The shape of the array, or dimensionality - The element type of the array: Int , Float , etc. Friday, 17 May 13

Arrays data ¡Array ¡sh ¡e • Supported array element types are members of the Elt class: - () - Int , Int32 , Int64 , Word , Word32 , Word64 ... - Float , Double - Char - Bool - Tuples up to 9-tuples of these, including nested tuples • Note that Array itself is not an allowable element type. There are no nested arrays in Accelerate, regular arrays only! Friday, 17 May 13

Accelerate by example Mandelbrot fractal Friday, 17 May 13

Mandelbrot set generator • Basics - Pick a window onto the complex plane & divide into pixels | z | - A point is in the set if the value of does not diverge to infinity z n +1 = c + z 2 n - Each pixel has a value given by its coordinates in the complex plane c - Colour depends on number of iterations before divergence n • Each pixel is independent: lots of data parallelism Friday, 17 May 13

Mandelbrot set generator data ¡Array ¡sh ¡e • First, some types: data ¡Complex ¡ ¡ = ¡(Float, ¡Float) - A pair of floating point numbers for the real and imaginary parts Friday, 17 May 13

Mandelbrot set generator data ¡Array ¡sh ¡e • First, some types: data ¡Complex ¡ ¡ = ¡(Float, ¡Float) data ¡ComplexPlane ¡ = ¡Array ¡DIM2 ¡Complex - A pair of floating point numbers for the real and imaginary parts - DIM2 is a type synonym for a two dimensional Shape Friday, 17 May 13

data ¡Z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡Z Shapes data ¡tail ¡:. ¡head ¡= ¡tail ¡:. ¡head • Shapes determines the dimensions of the array and the type of the index - Z represents a rank-zero array (singleton array with one element) - (:.) increases the rank by adding a new dimension on the right • Examples: - One-dimensional array ( Vector ) indexed by Int : (Z ¡:. ¡Int) - Two-dimensional array, indexed by Int : (Z ¡:. ¡Int ¡:. ¡Int) • This style is used at both the type and value level: Friday, 17 May 13

data ¡Z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡Z Shapes data ¡tail ¡:. ¡head ¡= ¡tail ¡:. ¡head • Shapes determines the dimensions of the array and the type of the index - Z represents a rank-zero array (singleton array with one element) - (:.) increases the rank by adding a new dimension on the right • Examples: - One-dimensional array ( Vector ) indexed by Int : (Z ¡:. ¡Int) - Two-dimensional array, indexed by Int : (Z ¡:. ¡Int ¡:. ¡Int) Friday, 17 May 13

data ¡Z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡Z Shapes data ¡tail ¡:. ¡head ¡= ¡tail ¡:. ¡head • Shapes determines the dimensions of the array and the type of the index - Z represents a rank-zero array (singleton array with one element) - (:.) increases the rank by adding a new dimension on the right • Examples: - One-dimensional array ( Vector ) indexed by Int : (Z ¡:. ¡Int) - Two-dimensional array, indexed by Int : (Z ¡:. ¡Int ¡:. ¡Int) • This style is used at both the type and value level: sh ¡:: ¡Z ¡:. ¡Int sh ¡= ¡ ¡Z ¡:. ¡10 Friday, 17 May 13

Mandelbrot set generator • The initial complex plane: z 0 = c • generate is a collective operation that yields a value for an index in the array generate ¡:: ¡(Shape ¡sh, ¡Elt ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡=> ¡Exp ¡sh ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡(Exp ¡sh ¡-‑> ¡Exp ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡Acc ¡(Array ¡sh ¡a) Friday, 17 May 13

Mandelbrot set generator • The initial complex plane: z 0 = c • generate is a collective operation that yields a value for an index in the array generate ¡:: ¡(Shape ¡sh, ¡Elt ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡=> ¡Exp ¡sh ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡(Exp ¡sh ¡-‑> ¡Exp ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡Acc ¡(Array ¡sh ¡a) - Supported shape and element types: we will use DIM2 and Complex Friday, 17 May 13

Mandelbrot set generator • The initial complex plane: z 0 = c • generate is a collective operation that yields a value for an index in the array generate ¡:: ¡(Shape ¡sh, ¡Elt ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡=> ¡Exp ¡sh ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡(Exp ¡sh ¡-‑> ¡Exp ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡Acc ¡(Array ¡sh ¡a) - Supported shape and element types: we will use DIM2 and Complex - Size of the result array: number of pixels in the image Friday, 17 May 13

Mandelbrot set generator • The initial complex plane: z 0 = c • generate is a collective operation that yields a value for an index in the array generate ¡:: ¡(Shape ¡sh, ¡Elt ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡=> ¡Exp ¡sh ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡(Exp ¡sh ¡-‑> ¡Exp ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡Acc ¡(Array ¡sh ¡a) - Supported shape and element types: we will use DIM2 and Complex - Size of the result array: number of pixels in the image - A function to apply at every index: generate the values of at each pixel c Friday, 17 May 13

Mandelbrot set generator • The initial complex plane: z 0 = c • generate is a collective operation that yields a value for an index in the array generate ¡:: ¡(Shape ¡sh, ¡Elt ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡=> ¡Exp ¡sh ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡(Exp ¡sh ¡-‑> ¡Exp ¡a) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡Acc ¡(Array ¡sh ¡a) - Supported shape and element types: we will use DIM2 and Complex - Size of the result array: number of pixels in the image - A function to apply at every index: generate the values of at each pixel c If Acc is our DSL type, what is Exp ? Friday, 17 May 13

A Stratified Language • Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt Friday, 17 May 13

A Stratified Language • Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp , executed in parallel over Arrays - Scalar operations can not contain collective operations Friday, 17 May 13

A Stratified Language • Accelerate is split into two worlds: Acc and Exp - Acc represents collective operations over instances of Arrays - Exp is a scalar computation on things of type Elt • Collective operations in Acc comprise many scalar operations in Exp , executed in parallel over Arrays - Scalar operations can not contain collective operations • This stratification excludes nested data parallelism Friday, 17 May 13

Collective Operations • Collective operations comprise many scalar operations applied in parallel example ¡:: ¡Acc ¡(Vector ¡Int) example ¡= ¡generate ¡(constant ¡(Z:.10)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡f ¡ix) Friday, 17 May 13

Collective Operations • Collective operations comprise many scalar operations applied in parallel example ¡:: ¡Acc ¡(Vector ¡Int) example ¡= ¡generate ¡(constant ¡(Z:.10)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡f ¡ix) - constant lifts a plain value into Exp land of scalar expressions constant ¡:: ¡Elt ¡e ¡=> ¡e ¡-‑> ¡Exp ¡e Friday, 17 May 13

Collective Operations • Collective operations comprise many scalar operations applied in parallel example ¡:: ¡Acc ¡(Vector ¡Int) example ¡= ¡generate ¡(constant ¡(Z:.10)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡f ¡ix) - constant lifts a plain value into Exp land of scalar expressions constant ¡:: ¡Elt ¡e ¡=> ¡e ¡-‑> ¡Exp ¡e generate ¡(constant ¡(Z:.10)) ¡f Friday, 17 May 13

Collective Operations • Collective operations comprise many scalar operations applied in parallel example ¡:: ¡Acc ¡(Vector ¡Int) example ¡= ¡generate ¡(constant ¡(Z:.10)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡f ¡ix) - constant lifts a plain value into Exp land of scalar expressions constant ¡:: ¡Elt ¡e ¡=> ¡e ¡-‑> ¡Exp ¡e generate ¡(constant ¡(Z:.10)) ¡f f ¡0 f ¡1 f ¡2 f ¡3 f ¡4 f ¡5 f ¡6 f ¡7 f ¡8 f ¡9 f ¡:: ¡Exp ¡DIM1 ¡-‑> ¡Exp ¡Int Friday, 17 May 13

Collective Operations • Collective operations comprise many scalar operations applied in parallel example ¡:: ¡Acc ¡(Vector ¡Int) example ¡= ¡generate ¡(constant ¡(Z:.10)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡f ¡ix) - constant lifts a plain value into Exp land of scalar expressions constant ¡:: ¡Elt ¡e ¡=> ¡e ¡-‑> ¡Exp ¡e Acc ¡(Vector ¡Int) f ¡0 f ¡1 f ¡2 f ¡3 f ¡4 f ¡5 f ¡6 f ¡7 f ¡8 f ¡9 Friday, 17 May 13

Mandelbrot set generator genPlane ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡ComplexPlane genPlane ¡view ¡= ¡ ¡generate ¡(constant ¡(Z:.600:.800)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡let ¡Z ¡:. ¡y ¡:. ¡x ¡= ¡unlift ¡ix ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡lift ¡( ¡xmin ¡+ ¡(fromIntegral ¡x ¡* ¡viewx) ¡/ ¡800 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ymin ¡+ ¡(fromIntegral ¡y ¡* ¡viewy) ¡/ ¡600 ¡)) ¡ ¡where ¡ ¡ ¡ ¡(xmin,ymin,xmax,ymax) ¡= ¡unlift ¡view ¡ ¡ ¡ ¡viewx ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡xmax ¡-‑ ¡xmin ¡ ¡ ¡ ¡viewy ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ymax ¡-‑ ¡ymin - unlift has a funky type, but just unpacks “stu ff ”. lift does the reverse. - Even though operations are in Exp , can still use standard Haskell operations like (*) and (-‑) Friday, 17 May 13

Mandelbrot set generator genPlane ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡ComplexPlane genPlane ¡view ¡= ¡ ¡generate ¡(constant ¡(Z:.600:.800)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡let ¡Z ¡:. ¡y ¡:. ¡x ¡= ¡unlift ¡ix ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡lift ¡( ¡xmin ¡+ ¡(fromIntegral ¡x ¡* ¡viewx) ¡/ ¡800 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ymin ¡+ ¡(fromIntegral ¡y ¡* ¡viewy) ¡/ ¡600 ¡)) ¡ ¡where unlift ¡:: ¡Exp ¡(Z ¡:. ¡Int ¡:. ¡Int) ¡ ¡ ¡ ¡(xmin,ymin,xmax,ymax) ¡= ¡unlift ¡view ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡Z ¡:. ¡Exp ¡Int ¡:. ¡Exp ¡Int ¡ ¡ ¡ ¡viewx ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡xmax ¡-‑ ¡xmin ¡ ¡ ¡ ¡viewy ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ymax ¡-‑ ¡ymin - unlift has a funky type, but just unpacks “stu ff ”. lift does the reverse. Friday, 17 May 13

Mandelbrot set generator genPlane ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡ComplexPlane genPlane ¡view ¡= ¡ ¡generate ¡(constant ¡(Z:.600:.800)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡let ¡Z ¡:. ¡y ¡:. ¡x ¡= ¡unlift ¡ix ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡lift ¡( ¡xmin ¡+ ¡(fromIntegral ¡x ¡* ¡viewx) ¡/ ¡800 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ymin ¡+ ¡(fromIntegral ¡y ¡* ¡viewy) ¡/ ¡600 ¡)) ¡ ¡where ¡ ¡ ¡ ¡(xmin,ymin,xmax,ymax) ¡= ¡unlift ¡view unlift ¡:: ¡Exp ¡(F,F,F,F) ¡ ¡ ¡ ¡viewx ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡xmax ¡-‑ ¡xmin ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡(Exp ¡F, ¡Exp ¡F, ¡Exp ¡F, ¡Exp ¡F) ¡ ¡ ¡ ¡viewy ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ymax ¡-‑ ¡ymin - unlift has a funky type, but just unpacks “stu ff ”. lift does the reverse. Friday, 17 May 13

Mandelbrot set generator genPlane ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡ComplexPlane genPlane ¡view ¡= ¡ ¡generate ¡(constant ¡(Z:.600:.800)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡let ¡Z ¡:. ¡y ¡:. ¡x ¡= ¡unlift ¡ix ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡lift ¡( ¡xmin ¡+ ¡(fromIntegral ¡x ¡* ¡viewx) ¡/ ¡800 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ymin ¡+ ¡(fromIntegral ¡y ¡* ¡viewy) ¡/ ¡600 ¡)) ¡ ¡where ¡ ¡ ¡ ¡(xmin,ymin,xmax,ymax) ¡= ¡unlift ¡view ¡ ¡ ¡ ¡viewx ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡xmax ¡-‑ ¡xmin ¡ ¡ ¡ ¡viewy ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ymax ¡-‑ ¡ymin - unlift has a funky type, but just unpacks “stu ff ”. lift does the reverse. Friday, 17 May 13

Mandelbrot set generator genPlane ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡ComplexPlane genPlane ¡view ¡= ¡ ¡generate ¡(constant ¡(Z:.600:.800)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(\ix ¡-‑> ¡let ¡Z ¡:. ¡y ¡:. ¡x ¡= ¡unlift ¡ix ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡lift ¡( ¡xmin ¡+ ¡(fromIntegral ¡x ¡* ¡viewx) ¡/ ¡800 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ymin ¡+ ¡(fromIntegral ¡y ¡* ¡viewy) ¡/ ¡600 ¡)) ¡ ¡where ¡ ¡ ¡ ¡(xmin,ymin,xmax,ymax) ¡= ¡unlift ¡view ¡ ¡ ¡ ¡viewx ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡xmax ¡-‑ ¡xmin ¡ ¡ ¡ ¡viewy ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ymax ¡-‑ ¡ymin - unlift has a funky type, but just unpacks “stu ff ”. lift does the reverse. - Even though operations are in Exp , can still use standard Haskell operations like (*) and (-‑) Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Let’s define the function we will iterate next ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex next ¡c ¡z ¡= ¡plus ¡c ¡(times ¡z ¡z) Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Let’s define the function we will iterate next ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex next ¡c ¡z ¡= ¡plus ¡c ¡(times ¡z ¡z) plus ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex plus ¡a ¡b ¡= ¡lift ¡(ax+bx, ¡ay+by) ¡ ¡where ¡ ¡ ¡ ¡(ax,ay) ¡= ¡unlift ¡a ¡ ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) ¡ ¡ ¡ ¡(bx,by) ¡= ¡unlift ¡b ¡ ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) - Use lift and unlift as before to unpack the tuples Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Let’s define the function we will iterate next ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex next ¡c ¡z ¡= ¡plus ¡c ¡(times ¡z ¡z) plus ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex lift ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) plus ¡a ¡b ¡= ¡lift ¡(ax+bx, ¡ay+by) ¡ ¡ ¡ ¡ ¡-‑> ¡Exp ¡(Float, ¡Float) ¡ ¡where ¡ ¡ ¡ ¡(ax,ay) ¡= ¡unlift ¡a ¡ ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) ¡ ¡ ¡ ¡(bx,by) ¡= ¡unlift ¡b ¡ ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) - Use lift and unlift as before to unpack the tuples Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Let’s define the function we will iterate next ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex next ¡c ¡z ¡= ¡plus ¡c ¡(times ¡z ¡z) plus ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex ¡-‑> ¡Exp ¡Complex lift ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) plus ¡a ¡b ¡= ¡lift ¡(ax+bx, ¡ay+by) ¡ ¡ ¡ ¡ ¡-‑> ¡Exp ¡(Float, ¡Float) ¡ ¡where ¡ ¡ ¡ ¡(ax,ay) ¡= ¡unlift ¡a ¡ ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) ¡ ¡ ¡ ¡(bx,by) ¡= ¡unlift ¡b ¡ ¡:: ¡(Exp ¡Float, ¡Exp ¡Float) - Use lift and unlift as before to unpack the tuples - Note that we had to add some type signatures to unlift Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Complication: GPUs must do the same thing to lots of di ff erent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged - fst and snd to extract individual tuple components - dot is the magnitude of a complex number squared - Conditionals can lead to SIMD divergence, so use sparingly Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Complication: GPUs must do the same thing to lots of di ff erent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged step ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡(Complex, ¡Int) ¡-‑> ¡Exp ¡(Complex, ¡Int) step ¡c ¡zi ¡= ¡f ¡(fst ¡zi) ¡(snd ¡zi) ¡ ¡where ¡ ¡ ¡ ¡f ¡z ¡i ¡= ¡let ¡z' ¡= ¡next ¡c ¡z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡ ¡dot ¡z' ¡>* ¡4 ¡? ¡( ¡zi, ¡lift ¡(z', ¡i+1) ¡) - fst and snd to extract individual tuple components - dot is the magnitude of a complex number squared - Conditionals can lead to SIMD divergence, so use sparingly Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Complication: GPUs must do the same thing to lots of di ff erent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged step ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡(Complex, ¡Int) ¡-‑> ¡Exp ¡(Complex, ¡Int) step ¡c ¡zi ¡= ¡f ¡(fst ¡zi) ¡(snd ¡zi) ¡ ¡where ¡ ¡ ¡ ¡f ¡z ¡i ¡= ¡let ¡z' ¡= ¡next ¡c ¡z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡ ¡dot ¡z' ¡>* ¡4 ¡? ¡( ¡zi, ¡lift ¡(z', ¡i+1) ¡) - fst and snd to extract individual tuple components Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Complication: GPUs must do the same thing to lots of di ff erent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged step ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡(Complex, ¡Int) ¡-‑> ¡Exp ¡(Complex, ¡Int) step ¡c ¡zi ¡= ¡f ¡(fst ¡zi) ¡(snd ¡zi) ¡ ¡where ¡ ¡ ¡ ¡f ¡z ¡i ¡= ¡let ¡z' ¡= ¡next ¡c ¡z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡ ¡dot ¡z' ¡>* ¡4 ¡? ¡( ¡zi, ¡lift ¡(z', ¡i+1) ¡) - fst and snd to extract individual tuple components - dot is the magnitude of a complex number squared Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Complication: GPUs must do the same thing to lots of di ff erent data - Keep a pair (z,i) for each pixel, where i is the point iteration diverged step ¡:: ¡Exp ¡Complex ¡-‑> ¡Exp ¡(Complex, ¡Int) ¡-‑> ¡Exp ¡(Complex, ¡Int) step ¡c ¡zi ¡= ¡f ¡(fst ¡zi) ¡(snd ¡zi) ¡ ¡where ¡ ¡ ¡ ¡f ¡z ¡i ¡= ¡let ¡z' ¡= ¡next ¡c ¡z ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡in ¡ ¡dot ¡z' ¡>* ¡4 ¡? ¡( ¡zi, ¡lift ¡(z', ¡i+1) ¡) - fst and snd to extract individual tuple components - dot is the magnitude of a complex number squared - Conditionals can lead to SIMD divergence, so use sparingly (?) ¡:: ¡Elt ¡t ¡=> ¡Exp ¡Bool ¡-‑> ¡(Exp ¡t, ¡Exp ¡t) ¡-‑> ¡Exp ¡t Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Accelerate is a meta programming language - So, just use regular Haskell to unroll the loop a fixed number of times mandel ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡(Array ¡DIM2 ¡(Complex, ¡Int)) mandel ¡view ¡= ¡ ¡Prelude.foldl1 ¡(.) ¡(Prelude.replicate ¡255 ¡go) ¡zs0 ¡ ¡where ¡ ¡ ¡ ¡cs ¡ ¡ ¡ ¡ ¡= ¡genPlane ¡view ¡ ¡ ¡ ¡zs0 ¡ ¡ ¡ ¡= ¡zip ¡cs ¡(fill ¡(shape ¡cs) ¡0) ¡ ¡ ¡ ¡go ¡zs ¡ ¡= ¡zipWith ¡step ¡cs ¡zs Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Accelerate is a meta programming language - So, just use regular Haskell to unroll the loop a fixed number of times mandel ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡(Array ¡DIM2 ¡(Complex, ¡Int)) mandel ¡view ¡= ¡ ¡Prelude.foldl1 ¡(.) ¡(Prelude.replicate ¡255 ¡go) ¡zs0 ¡ ¡where ¡ ¡ ¡ ¡cs ¡ ¡ ¡ ¡ ¡= ¡genPlane ¡view ¡ ¡ ¡ ¡zs0 ¡ ¡ ¡ ¡= ¡zip ¡cs ¡(fill ¡(shape ¡cs) ¡0) ¡ ¡ ¡ ¡go ¡zs ¡ ¡= ¡zipWith ¡step ¡cs ¡zs - zipWith applies its function step pairwise to elements of the two arrays Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Accelerate is a meta programming language - So, just use regular Haskell to unroll the loop a fixed number of times mandel ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡(Array ¡DIM2 ¡(Complex, ¡Int)) mandel ¡view ¡= ¡ ¡Prelude.foldl1 ¡(.) ¡(Prelude.replicate ¡255 ¡go) ¡zs0 ¡ ¡where ¡ ¡ ¡ ¡cs ¡ ¡ ¡ ¡ ¡= ¡genPlane ¡view ¡ ¡ ¡ ¡zs0 ¡ ¡ ¡ ¡= ¡zip ¡cs ¡(fill ¡(shape ¡cs) ¡0) ¡ ¡ ¡ ¡go ¡zs ¡ ¡= ¡zipWith ¡step ¡cs ¡zs - zipWith applies its function step pairwise to elements of the two arrays - Replicate the transition step from to z n +1 z n Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Accelerate is a meta programming language - So, just use regular Haskell to unroll the loop a fixed number of times mandel ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡(Array ¡DIM2 ¡(Complex, ¡Int)) mandel ¡view ¡= ¡ ¡Prelude.foldl1 ¡(.) ¡(Prelude.replicate ¡255 ¡go) ¡zs0 ¡ ¡where ¡ ¡ ¡ ¡cs ¡ ¡ ¡ ¡ ¡= ¡genPlane ¡view ¡ ¡ ¡ ¡zs0 ¡ ¡ ¡ ¡= ¡zip ¡cs ¡(fill ¡(shape ¡cs) ¡0) ¡ ¡ ¡ ¡go ¡zs ¡ ¡= ¡zipWith ¡step ¡cs ¡zs - zipWith applies its function step pairwise to elements of the two arrays - Replicate the transition step from to z n +1 z n - A pplies the steps in sequence, beginning with the initial data zs0 Friday, 17 May 13

Mandelbrot set generator z n +1 = c + z 2 n • Accelerate is a meta programming language - So, just use regular Haskell to unroll the loop a fixed number of times mandel ¡:: ¡Exp ¡(Float,Float,Float,Float) ¡-‑> ¡Acc ¡(Array ¡DIM2 ¡(Complex, ¡Int)) mandel ¡view ¡= ¡ ¡Prelude.foldl1 ¡(.) ¡(Prelude.replicate ¡255 ¡go) ¡zs0 ¡ ¡where ¡ ¡ ¡ ¡cs ¡ ¡ ¡ ¡ ¡= ¡genPlane ¡view f (g x) ≣ (f . g) x ¡ ¡ ¡ ¡zs0 ¡ ¡ ¡ ¡= ¡zip ¡cs ¡(fill ¡(shape ¡cs) ¡0) ¡ ¡ ¡ ¡go ¡zs ¡ ¡= ¡zipWith ¡step ¡cs ¡zs - zipWith applies its function step pairwise to elements of the two arrays - Replicate the transition step from to z n +1 z n - A pplies the steps in sequence, beginning with the initial data zs0 Friday, 17 May 13

In the workshop… • More example code, computational kernels - Common pitfalls - Tips for good performance - Figuring out what went wrong (or knowing who to blame) Friday, 17 May 13

In the workshop… • More example code, computational kernels - Common pitfalls - Tips for good performance - Figuring out what went wrong (or knowing who to blame) me Friday, 17 May 13

Questions? https://github.com/AccelerateHS/ http://xkcd.com/365/ Friday, 17 May 13

Extra Slides… Friday, 17 May 13

Seriously? Friday, 17 May 13

Arrays data ¡Array ¡sh ¡e • Create an array from a list: fromList ¡:: ¡(Shape ¡sh, ¡Elt ¡e) ¡=> ¡sh ¡-‑> ¡[e] ¡-‑> ¡Array ¡sh ¡e - Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order • Example: ghci> ¡fromList ¡(Z:.10) ¡[1..10] Friday, 17 May 13

Arrays data ¡Array ¡sh ¡e • Create an array from a list: > ¡fromList ¡(Z:.10) ¡[1..10] ¡:: ¡Vector ¡Float Array ¡(Z ¡:. ¡10) ¡[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] Friday, 17 May 13

Arrays data ¡Array ¡sh ¡e • Create an array from a list: > ¡fromList ¡(Z:.10) ¡[1..10] ¡:: ¡Vector ¡Float Array ¡(Z ¡:. ¡10) ¡[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0] • Multidimensional arrays are similar: - Elements are filled along the right-most dimension first > ¡fromList ¡(Z:.3:.5) ¡[1..] ¡:: ¡Array ¡DIM2 ¡Int Array ¡(Z ¡:. ¡3 ¡:. ¡5) ¡[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friday, 17 May 13

Arrays data ¡Array ¡sh ¡e • Array indices start counting from zero > ¡let ¡mat ¡= ¡fromList ¡(Z:.3:.5) ¡[1..] ¡:: ¡Array ¡DIM2 ¡Int > ¡indexArray ¡mat ¡(Z:.2:.1) 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friday, 17 May 13

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell - PowerPoint PPT Presentation

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales @tlmcdonell tmcdonell@cse.unsw.edu.au https://github.com/AccelerateHS Friday, 17 May 13 What is GPGPU Programming? General Purpose Programming

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Haskell-RL An Equational Specification of Haskell in Maude Andrew Bennett Presented on 24 April

Haskell Overview David Grisham 31 October 2017 Haskell Overview David Grisham

wrangling the internet of things with haskell production haskell Reid Draper @reiddraper

Bringing Haskell to the World www.fpcomplete.com Experience Report Building Haskell Development

haskell cons In haskell consing is done via the infix operator (:). For example: (cons 1 (cons 2

An overview of Haskell Haggai Eran 23/7/2007 Haggai Eran An overview of Haskell Introduction

Dr. Strange- Todd L. Montgomery @toddlmontgomery Haskell Erlang Haskell Clojure

Metaprogramming Haskell, Metaprogramming Haskell, Metaprogramming Haskell, The Racket Way The

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

Parsing beyond context-free grammar: adjunction: replacing an internal node with a new tree.

Weather Unit Weather 101 Video from National Geographic 3:19 Weather Vocabulary 1. Atmosphere

Day 2 Large scale models Todays schedule (morning) 9:00-9:10 Wit, Insight, and Matters of

East Asia Regional Reanalysis Project of CMA Xudong LIANG, JinFang YIN, Feng CHEN, Ying LIU,

Course schedule INTERFACE AESTHETICS Beyond desktop 2/04 Beyond Desktop 2/11 Typography I

Flat Bunches in the Tevatron Chandra Bhat Fermilab (LARP) Tevatron Accelerator Studies Workshop

Queue ADT Tiziana Ligorio 1 Todays Plan Announcements Queue ADT Applications 2

Project Takeaway submitted. Next submission: Standard Template Library Nim