2.0 1.5 1.0 0.5 0.0 16 64 256 1k 4k 16k 64k 256k 1M

2.0 1.5 1.0 0.5 0.0 16 64 256 1k 4k 16k 64k 256k 1M 40 35 30 25 20 15 10 5 0 16 64 256 1k 4k 16k 64k 256k 1M

40 ... t282 = _mm_addsub_ps(t268, U247); t283 = _mm_add_ps(t282, _mm_addsub_ps(U247, _mm_shuffle_ps(t275, t275, _MM_SHUFFLE(2, 3, 0, 1)))); 35 t284 = _mm_add_ps(t282, _mm_addsub_ps(U247, _mm_sub_ps(_mm_setzero_ps (), ………) s217 = _mm_addsub_ps(t270, U247); s218 = _mm_addsub_ps(_mm_mul_ps(t277, _mm_set1_ps((- 0.70710678118654757))), ………) 30 t285 = _mm_add_ps(s217, s218); t286 = _mm_sub_ps(s217, s218); s219 = _mm_shuffle_ps(t278, t280, _MM_SHUFFLE(1, 0, 1, 0)); 25 s220 = _mm_shuffle_ps(t278, t280, _MM_SHUFFLE(3, 2, 3, 2)); s221 = _mm_shuffle_ps(t283, t285, _MM_SHUFFLE(1, 0, 1, 0)); ... 20 15 10 5 0 16 64 256 1k 4k 16k 64k 256k 1M   Matrix multiplication WiFi Receiver Performance [Gflop/s] Performance [Mbit/s] 70 50 60 40 50 30 40 30 20 20 10 10 0 0 0 2000 4000 6000 8000 10000 6 12 18 24 30 36 42 48 54

40 35 30 25 20 15 10 5 0 16 64 256 1k 4k 16k 64k 256k 1M // straightforward code // unrolling + scalar replacement for(i = 0; i < N; i += 1) for(i = 0; i < N; i += MU) { for(j = 0; j < N; j += 1) for(j = 0; j < N; j += NU) { for(k = 0; k < N; k += 1) for(k = 0; k < N; k += KU) { c[i][j] += a[i][k]*b[k][j]; t1 = A[i*N + k]; t2 = A[i*N + k + 1]; t3 = A[i*N + k + 2]; t4 = A[i*N + k + 3]; t5 = A[(i + 1)*N + k]; <more copies> t10 = t1 * t9; t17 = t17 + t10; t21 = t1 * t8; t18 = t18 + t21; t12 = t5 * t9; t19 = t19 + t12; t13 = t5 * t8; t20 = t20 + t13; <more ops> C[i*N + j] = t17; C[i*N + j + 1] = t18; C[(i+1)*N + j] = t19; • C[(i+1)*N + j + 1] = t20; } • } }

• • • • • • • • • • • •

Correct code: easy fast code: very difficult void sub(double *y, double *x) { double f0, f1, f2, f3, f4, f7, f8, f10, f11; f0 = x[0] - x[3]; f1 = x[0] + x[3]; f2 = x[1] - x[2]; f3 = x[1] + x[2]; f4 = f1 - f3; y[0] = f1 + f3; y[2] = 0.7071067811865476 * f4; f7 = 0.9238795325112867 * f0; < more lines>

A Processor 0 A Processor 1 A Processor 2 A Processor 3 x y x y

void dft( int n, cpx *y, cpx *x) { if ( use_dft_base_case(n) ) dft_bc(n, y, x); else { int k = choose_dft_radix(n) ; for ( int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for ( int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } }

S- S S

configure/make configure/make d = dft(n) d(x,y) d = dft(n) d(x,y)

DFT on Sandybridge (3.3 GHz, 4 Cores, AVX) Performance [Gflop/s]

S MKL MKL BTO

SPL Data flow graph Scala function def f(x: Array[Double], y: Array[Double]) = { for (i <- 0 until 2) { y(2*i) = x(i*2) + x(i*2+1) y(2*i+1) = x(i*2) - x(i*2+1) } } def f(x: Array[ Rep[ Double ] ], y: Array[ Rep[ Double ] ]) = { t0 = s0 + s1; for (i <- 0 until 2) { t1 = s0 - s1; y(2*i) = x(i*2) + x(i*2+1) t2 = s2 + s3; y(2*i+1) = x(i*2) - x(i*2+1) t2 = s2 - s3; } } t0 = x[0]; t1 = x[1]; t2 = t0 + t1; def f(x: Rep[ Array[Double] ] , y[0] = t2; y: Rep[ Array[Double] ] ) = { t3 = t0 - t1; for (i <- 0 until 2) { y[1] = t3; y(2*i) = x(i*2) + x(i*2+1) t4 = x[0]; y(2*i+1) = x(i*2) - x(i*2+1) t5 = x[1]; } t6 = t4 + x5; } y[0] = t6; t7 = t4 – x5; y[3] = t7; for (int i=0; i < 2; i++) def f(x: Rep[ Array[Double] ] , { y: Rep[ Array[Double] ] ) = { t0 = x[i]; for (i <- 0 until 2: Rep[ Range ] ) { t1 = x[i+1]; y(2*i) = x(i*2) + x(i*2+1) t2 = t0 + t1; y(2*i+1) = x(i*2) - x(i*2+1) y[i] = t2; } t3 = t0 - t1; } y[i+1] = t3; }

def f(x: Array[ Rep[ Double ] ], y: Array[ Rep[ Double ] ]) = { t0 = s0 + s1; for (i <- 0 until 2) { t1 = s0 - s1; y(2*i) = x(i*2) + x(i*2+1) t2 = s2 + s3; y(2*i+1) = x(i*2) - x(i*2+1) t2 = s2 - s3; } } t0 = x[0]; t1 = x[1]; t2 = t0 + t1; def f(x: Rep[ Array[Double] ] , y[0] = t2; y: Rep[ Array[Double] ] ) = { t3 = t0 - t1; for (i <- 0 until 2) { y[1] = t3; y(2*i) = x(i*2) + x(i*2+1) t4 = x[0]; y(2*i+1) = x(i*2) - x(i*2+1) t5 = x[1]; } t6 = t4 + x5; } def f[L[_],A[_],T](looptype: L, x: A[Array[T]], y: A[Array[T]]) = { y[0] = t6; for (i <- 0 until 2: L[Range]) { t7 = t4 – x5; y[3] = t7; y(2*i) = x(i*2) + x(i*2+1) y(2*i+1)= x(i*2) - x(i*2+1) } } for (int i=0; i < 2; i++) def f(x: Rep[ Array[Double] ] , { y: Rep[ Array[Double] ] ) = { t0 = x[i]; for (i <- 0 until 2: Rep[ Range ] ) { t1 = x[i+1]; y(2*i) = x(i*2) + x(i*2+1) t2 = t0 + t1; y[i] = t2; y(2*i+1) = x(i*2) - x(i*2+1) } t3 = t0 - t1; } y[i+1] = t3; } ≥ 70 papers on side threads

2.0 1.5 1.0 0.5 0.0 16 64 256 1k 4k 16k 64k 256k 1M - PDF document

2.0 1.5 1.0 0.5 0.0 16 64 256 1k 4k 16k 64k 256k 1M 2.0 1.5 1.0 0.5 0.0 16 64 256 1k 4k 16k 64k 256k 1M 40 35 30 25 20 15 10 5 0 16 64 256 1k 4k 16k 64k 256k 1M 40 ... t282 = _mm_addsub_ps(t268,

Hardware MSP430F1611 Ports: P1-P6 Each port == 8 pins Memory, clock RAM: between 256 and 16K

SPolly: Speculative Optimizations in the Polyhedral Model Johannes Doerfert, Clemens Hammacher,

Advanced FORK-256 Presented by Seokhie Seokhie Hong Hong Presented by hsh@cist.korea.ac.kr

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

PA 256 of 2011 Details of PA 256 Michigan Fireworks Emergency Rules Established Safety

Sampling and Reconstruction Digital Image Processing How does it help? Filtering reduces

ICON GROUP LIMITED PLOT 5A KIMERA ROAD, NTINDA, KAMPALA P.O.BOX 16357, KAMPALA, UGANDA Contact:

Better proofs for rekeying D. J. Bernstein Security of AES-256 key k is far below 2 256 in most

Spectral analysis of ZUC-256 The algorithm of ZUC-256 Attack approaches Spectral

Schr dinger equation on Schr 256^ 4 grids 256^ 4 grids , * Toshiyuki Imamura 13

Analysis of reduced-SHAvite-3-256 v2 Marine Minier 1 , Mar a Naya-Plasencia 2 , Thomas Peyrin 3

A Cache Timing Analysis of HC-256 Erik Zenner Technical University Denmark (DTU) Institute for

CONTACTS BTSCD Staff Contact Responsible Janette Lopez 602.256.9408 Student Registration

Improved Single-Key Attacks on 9-Round AES-192/256 Leibo Li 1 , Keting Jia 2 and Xiaoyun Wang 1 ,

Practical Near-Collisions and Collisions on Reduced-Round ECHO-256 Compression Function Jrmy

Partial-Collision Attack on the Round- Reduced Compression Function of Skein-256 Hongbo Yu,

Complex Unit Circle Polar coordinates x 2 = 1 has two solutions: x { 1 } . Imaginary Real

DIFFERENTIAL TREATMENT APPROACHES FOR THERAPY-RELATED ACUTE LEUKEMIAS Adriano Venditti

More Data Flow Analyses Reading: NNH 2.1 17-654/17-765 Analysis of Software Artifacts Jonathan

On enumerating factorizations in reflection groups. Theo Douvropoulos Paris VII, IRIF

internals mc on multiple cores Evaluating Players HOMM-III Skill-Selection Strategies in

MACISA Mathematics applied to cryptology and information security in Africa 2014/09/24

System on on a Chip (SoC) Cristian Sisterna Universidad Nacional San Juan Argentina SoC ICTP

Deep Drilling of the Chesapeake Bay Impact Crater Finding Order in the Chaos Ward

Sambuz

Useful Links

Newsletter

Mail Us