Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism - PowerPoint PPT Presentation

Software Vector Chaining M. Anton Ertl TU Wien

Data Parallelism and SIMD instructions • Data parallelism in programming problems • Hardware provides SIMD instructions Cray-1 vector instructions, Intel/AMD SSE/AVX, ARM Neon/SVE vmulpd %ymm2, %ymm3, %ymm1 ymm2 ymm3 * * * * ymm1 • Little programming language support

Programming language support: How? • Manual Vectorization • Application vector length • Opaque, immutable vectors with value semantics • Vector stack : vcomp ( va vb -- vc ) vdup sf*v vswap vdup sf*v sf+v sfnegatev ;

Properties, benefits and drawbacks scalars • Vectors are immutable (value semantics) vectors − Explicit conversion from/to memory arrays + gives control to programmer, new FloatVect array who can make conversions infrequent vector + Padding to SIMD granularity scalar + Aligning to SIMD granularity mul adressing + No aliasing problems indexing add + Results do not overlap input operands control flow ... + only explicit dependences intoArray vector + vectors are a separate world array vector sum + Compiler can arrange computations scalar

Implementation simple sf+v fused vcomp simple: fused: vmovaps (%rdi,%r10,1),%ymm0 vmovaps (%rdi,%r10,1),%ymm0 vaddps (%rsi,%r10,1),%ymm0,%ymm0 vmulps %ymm0,%ymm0,%ymm1 vmovaps %ymm0,(%rdx,%r10,1) vmovaps (%rsi,%r10,1),%ymm2 add $0x20,%r10 vmulps %ymm2,%ymm2,%ymm3 cmp %r10,%rcx vaddps %ymm1,%ymm3,%ymm1 ja simple vxorps %ymm1,%ymm4,%ymm1 vmovaps %ymm0,(%rdx,%r10,1) add $0x20,%r10 cmp %r10,%r9 ja fused ... but how?

Who performs vector loop fusion? Compiler Run-time Library + Low run-time overhead − High run-time overhead − High implementation effort? + Low implementation effort − Control-flow may limit fusion + Fuses across control flow − Aliasing plays a role + Dependencies resolved Software Vector Chaining

Implementing a vector operation simple chaining make result vector add to trace perform operation loop make result vector stub trace ends? done done n y trace in cache? n generate code y cache code allocate result vector memory execute code clear trace done

Generate code vdup sf*v vswap vdup sf*v sf+v sfnegatev $24147C0 refs= 0 bytes=16 $24147A0 :14 $2414B10 refs= 0 bytes=16 $2415150 :15 sftimesv_ 15 15 temporary :16 sftimesv_ 14 14 temporary :17 sfplusv_ 16 17 temporary :18 sfnegatev_ 18 0 $2415030 refs= 0 bytes=16 $2417300 :19 fused: vmovaps (%rdi,%r10,1),%ymm0 vmulps %ymm0,%ymm0,%ymm1 vmovaps (%rsi,%r10,1),%ymm2 vmulps %ymm2,%ymm2,%ymm3 vaddps %ymm1,%ymm3,%ymm1 vxorps %ymm1,%ymm4,%ymm1 vmovaps %ymm0,(%rdx,%r10,1) add $0x20,%r10 cmp %r10,%r9 ja fused

Evaluation Multiply 50 × 50 with 50 × n Double matrix for varying n , 500 times on Core i5 6600K (Skylake) instructions cycles simple simple 30G 40G 30G 20G compiler fused compiler fused fused unrolled 20G chaining fused unrolled chaining 10G 10G 5G 5G 2G 2G n n 0G 0G 1 1000 3000 6000 9000 12000 1 1000 3000 6000 9000 12000

Conclusion • How to use SIMD instructions for data parallelism? • Manual vectorization, application vector size, opaque vectors gives freedom to the compiler/library writer • Software vector chaining Build trace at run-time Compile if not cached + Can be implemented as library 315 source lines of code + For long vectors > 2 × as fast as simple − High per-operation overhead Useful only for long vectors Select between simple and chaining per operation • github.com/AntonErtl/vectors Paper at ManLang 2018 https://www.complang.tuwien.ac.at/papers/ertl18.pdf

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism - PowerPoint PPT Presentation

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data parallelism in programming problems Hardware provides SIMD instructions Cray-1 vector instructions, Intel/AMD SSE/AVX, ARM Neon/SVE vmulpd %ymm2,

Chaining Operator in Climb Method Chaining jQuery Method Chaining Extended Climb Christopher

Using first order logic (Ch. 9) Backward chaining Backward chaining is almost the opposite of

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Priority queues Hash tables chaining Priority queue ADT Binary heap March 13, 2020 Cinda

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Blockchain (Bitcoin) Four ideas Hash chaining Unalterable history Public key

Chaining HALs ABS 2015 PRELIMINARY HY Research LLC http://www.hy-research.com/ Mar 15, 2015

The generalized TSP and trip chaining IWSSSCM3 John Gunnar Carlsson Epstein Department of

Anchors Page 53 Transform the World So When Do You Use Chains? Use Chaining Anchors when the

Container service chaining Martin ual INTRO AGENDA ETSI NFV MANO IETF SFC

Functional Chaining System in ICN Phil Brown Fujitsu Laboratories of America, Inc. December 7,

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

. Vector Graphics Introduction to Web Design Vector graphics contain geometric objects, such as

Class 7: Vector and scalar, components Vector operations in components Multiplying a vector with a

Databases and keys Integer keys A database stores records with various attributes. Lets make

Linked List Implementation Data-structure-palooza Introduction to Markov Chaining Checkout

A look inside the Windows Kernel CVE-2011-1237 Evolution from XP to 8 Bruno Pujos

Overview of DBT Skills Training for Suicidal Adolescents Shawn S. Sidhu, M.D. University of New

Lecture 5 Data Structures (DAT037) Ramona Enache

Week 9 Oliver Kullmann Generalising arrays Hash tables Direct addressing Hashing in

Exercises, II part Forward Chaining: 12 Jul 2012 Exercises, II part Consider the following set

RuQAR : Reasoning with OWL 2 RL Using Forward Chaining Engines Jaroslaw Bak Institute of Control