Writing better code with Writing better code with help from the - PowerPoint PPT Presentation

Writing better code with Writing better code with help from the compiler help from the compiler Thiago Macieira Thiago Macieira Qt Developer Days & LinuxCon Europe – October/2014 Qt Developer Days & LinuxCon Europe – October/2014

Who am I? 2

Example scenario Interview questjon You have 2 MB of data and you want to calculate how many bits are set, how would you do it? Memory usage is not a constraint (within reason). 3

Approach 1: count the number of bits in each byte static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; int bitcount() int bitcount() int bitcount() int bitcount() { { { { int result = 0; int result = 0; int result = 0; int result = 0; for (int i = 0; i < sizeof(data); ++i) { for (int i = 0; i < sizeof(data); ++i) { unsigned char x = data[i]; for (int i = 0; i < sizeof(data); ++i) { unsigned char x = data[i]; for (int i = 0; i < sizeof(data); ++i) { result += !!(x & 1); result += !!(x & 1); unsigned char x = data[i]; unsigned char x = data[i]; result += !!(x & 2); result += !!(x & 2); for ( ; x; ++result) for ( ; x; ++result) result += !!(x & 4); result += !!(x & 4); x &= x - 1; x &= x - 1; result += !!(x & 8); result += !!(x & 8); } } result += !!(x & 16); result += !!(x & 16); return result; result += !!(x & 32); return result; result += !!(x & 32); result += !!(x & 64); result += !!(x & 64); } } result += !!(x & 128); result += !!(x & 128); } } return result; return result; } } 4

Approach 2: use a lookup table static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; extern const ushort bitcount_table[65536]; extern const ushort bitcount_table[65536]; int bitcount() int bitcount() { { int result = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 2) for (int i = 0; i < sizeof(data); i += 2) result += bitcount_table[*(ushort*)(data + i)]; result += bitcount_table[*(ushort*)(data + i)]; return result; return result; } } 5

My answer • Use the POPCNT instructjon – Added with the fjrst Intel Core-i7 generatjon, Nehalem (SSE4.2, but separate CPUID bit) 6

How do you use the POPCNT instruction? • Write assembly • Use the GCC intrinsic: __builtin_popcount() • Use the Intel intrinsic: _mm_popcnt_u32() 7

When can I use the instruction? • Use unconditjonally! • Check CPUID • Ask the linker for help • Check if surrounding code already requires a CPU that supports the feature anyway 8

Choosing the solution • What afgects the choice: – CPUs it will run on – Compilers / toolchains it will be compiled with – Libraries you're using 9

Other architectures • Intrinsics exist for ARM and PowerPC too (Neon and Altjvec) • Not all compiler features work on those architectures yet • But not discussed on this presentatjon 10

Using intrinsics Using intrinsics 11

Finding out which intrinsic to use • Use the SDM, Luke! 12

Examples using intrinsics • The populatjon count • Calculatjng CRC32 static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; int crc32() int bitcount() int bitcount() int crc32() { { { { int h = 0; int result = 0; int h = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) h = _mm_crc32_u32 (h, *(unsigned int*)(data + i)); h = _mm_crc32_u32 (h, *(unsigned int*)(data + i)); result += result += return h; return h; __builtin_popcount (*(unsigned int*)(data + i)); __builtin_popcount (*(unsigned int*)(data + i)); } } return result; return result; } } 13

Where are intrinsics allowed? For all compilers: recent enough (e.g., GCC 4.7 for AVX2, 4.9 for AVX512F, etc.) Compiler Permitted usage Microsoft Visual Studio Anywhere, no special build options required Intel C++ Compiler Clang Anywhere, as long as code generation is enabled (e.g., -mavx / -mavx2 / -march=core-avx-i / etc. active) GCC 4.8 or earlier Code generation enabled; or GCC 4.9 functions decorated with __attribute__((target("avx"))) (etc.) 14

How I solved this for Qt 5.4 • Macro for testjng with #if • Macro that expands to __attribute__((target(xxx)) (or empty) #if QT_COMPILER_SUPPORTS_HERE(SSE4_2) #if QT_COMPILER_SUPPORTS_HERE(SSE4_2) QT_FUNCTION_TARGET(SSE4_2) QT_FUNCTION_TARGET(SSE4_2) static uint crc32(const char *ptr, size_t len, uint h) static uint crc32(const char *ptr, size_t len, uint h) { { // Implementation using _mm_crc32_u64 / u32 / u16 / u8 goes here // Implementation using _mm_crc32_u64 / u32 / u16 / u8 goes here } } #else #else static uint crc32(...) static uint crc32(...) { { Q_UNREACHABLE(); Q_UNREACHABLE(); return 0; return 0; } } #endif #endif 15

Runtime dispatching Runtime dispatching 16

Runtime dispatching basics 1) Detect CPU void function_sse2(); void function_sse2(); void function_plain(); void function_plain(); void function() void function() 2) Determine best implementatjon { { if (/* CPU supports SSE2 */) if (/* CPU supports SSE2 */) function_sse2(); 3) Run it function_sse2(); else else function_plain(); function_plain(); } } void function_sse2(); void function_sse2(); With GCC 4.8: void function_plain(); void function_plain(); (doesn't work with void function() void function() Clang, ICC or MSVC) { { if (__builtin_cpu_supports("sse2")) if (__builtin_cpu_supports("sse2")) function_sse2(); function_sse2(); else else function_plain(); function_plain(); } } 17

Identifying the CPU • Running CPUID lefu as an exercise to the reader • Just remember: cache the result extern int qt_cpu_features; extern int qt_cpu_features; CPUID goes extern void qDetectCpuFeatures(void); extern void qDetectCpuFeatures(void); here static inline int qCpuFeatures() static inline int qCpuFeatures() { { int features = qt_cpu_features; int features = qt_cpu_features; if (Q_UNLIKELY(features == 0)) { if (Q_UNLIKELY(features == 0)) { qDetectCpuFeatures(); qDetectCpuFeatures(); features = qt_cpu_features; features = qt_cpu_features; } } return features; return features; } } 18

Checking surrounding code 19

Putting it together • Result on 64-bit: unconditjonal call to the SSE2 version void function_sse2(); void function_sse2(); void function_plain(); void function_plain(); void function() void function() { { if (qCpuHasFeature(SSE2)) if (qCpuHasFeature(SSE2)) function_sse2(); function_sse2(); else else function_plain(); function_plain(); } } 20

Asking the linker and dynamic linker for help • Requires: – Glibc 2.11.1, Binutjls 2.20.1, GCC 4.8 / ICC 14.0 Magic – Not supported with Clang or on Android (due to Bionic) goes here void *memcpy(void *, const void *, size_t) void *memcpy(void *, const void *, size_t) void *memcpy(void *, const void *, size_t) void *memcpy(void *, const void *, size_t) __attribute__((ifunc("resolve_memcpy"))); __attribute__((ifunc("resolve_memcpy"))); __attribute__((ifunc("resolve_memcpy"))); __attribute__((ifunc("resolve_memcpy"))); decltype(memcpy) memcpy_avx, memcpy_sse2; void *memcpy_avx(void *, const void *, size_t); decltype(memcpy) memcpy_avx, memcpy_sse2; void *memcpy_avx(void *, const void *, size_t); void *memcpy_sse2(void *, const void *, size_t); void *memcpy_sse2(void *, const void *, size_t); static void *(*resolve_memcpy(void))(void *, const void *, size_t) auto resolve_memcpy() auto resolve_memcpy() static void *(*resolve_memcpy(void))(void *, const void *, size_t) { { { { return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; } } } } 21

GCC 4.9 auto-dispatcher (a.k.a. “Function Multi Versioning”) • C++ only! __attribute__((target("popcnt"))) __attribute__((target("popcnt"))) int bitcount() int bitcount() { { int result = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) result += __builtin_popcount(*(uint*)(data + i)); result += __builtin_popcount(*(uint*)(data + i)); return result; return result; } } __attribute__((target("default"))) __attribute__((target("default"))) int bitcount() int bitcount() { { int result = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 2) for (int i = 0; i < sizeof(data); i += 2) result += bitcount_table[*(ushort*)(data + i)]; result += bitcount_table[*(ushort*)(data + i)]; return result; return result; } } 22

Writing better code with Writing better code with help from the - PowerPoint PPT Presentation

Writing better code with Writing better code with help from the compiler help from the compiler Thiago Macieira Thiago Macieira Qt Developer Days & LinuxCon Europe October/2014 Qt Developer Days & LinuxCon Europe October/2014

Lesson 5 Emphasis WRITING CAN BE WRITING CAN BE BOLD WRITING CAN BE BOLD COLOR WRITING CAN

11-823 Conlanging Writing Writing Systems Different Writing Systems What makes a writing

Writing for Funding Part 1: General Writing and Writing for Specific Review Alicia J. Knoedler,

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

Writing Your CAREER Proposal Writing Your CAREER Proposal 2016 NSF/CMMI CAREER Proposal Writing

Writing and SPAG Lickey Hills Primary School and Nursery November 2017 Aims: Writing Writing

Leah Soule 27 March 2010 Virtual Writing Center 2.0 A virtual writing center, or online writing

Writing Linguistics 203 Languages of the World Writing and Language Many people associate

Writing Home 8: Formal and Informal Writing We We use formal language for: Writing to

Early Help Clare Mittelstadt Early Help Manager What is Early Help? Early Help is about

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

WRITING FASTER CODE 1 . 1 WRITING FASTER CODE AND NOT HATING YOUR JOB AS A SOFTWARE DEVELOPER

Looking and Writing: Teaching Writing in the Discipline Teaching Writing in the Discipline of

>>>CLICK HERE<<< Powerpoint presentation writing persuasive essay Shawinigan.

SUSD Elementary Writing Program Overview Elementary Writing Workshop Series - Session 2 Writing

Writing Your CAREER Proposal Writing Your CAREER Proposal 2016 NSF CAREER Proposal Writing

Binary Search Trees These slides are not fully polished: - some transitions are rough - some

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches

Effectiveness of CMPs Prepared for: ICO Ref: jn1666/BW Date: April/2014 1 UK I

Modifications Progress Update Place your chosen image here. The four corners must just cover

Infinite Resources for Optimistic Concurrency Control with NOCC Theo Jepsen, Leandro Pacheco de

Engineering a Sort Function Engineering a Sort Function JON L. BENTLEY M. DOUGLAS McILROY

Control Structures CS2253, Owen Kaser Control Structures Implementing familiar HLL control

Sambuz

Useful Links

Newsletter

Mail Us

Writing better code with Writing better code with help from the - PowerPoint PPT Presentation

Writing better code with Writing better code with help from the compiler help from the compiler Thiago Macieira Thiago Macieira Qt Developer Days & LinuxCon Europe October/2014 Qt Developer Days & LinuxCon Europe October/2014

Lesson 5 Emphasis WRITING CAN BE WRITING CAN BE BOLD WRITING CAN BE BOLD COLOR WRITING CAN

11-823 Conlanging Writing Writing Systems Different Writing Systems What makes a writing

Writing for Funding Part 1: General Writing and Writing for Specific Review Alicia J. Knoedler,

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

Writing Your CAREER Proposal Writing Your CAREER Proposal 2016 NSF/CMMI CAREER Proposal Writing

Writing and SPAG Lickey Hills Primary School and Nursery November 2017 Aims: Writing Writing

Leah Soule 27 March 2010 Virtual Writing Center 2.0 A virtual writing center, or online writing

Writing Linguistics 203 Languages of the World Writing and Language Many people associate

Writing Home 8: Formal and Informal Writing We We use formal language for: Writing to

Early Help Clare Mittelstadt Early Help Manager What is Early Help? Early Help is about

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

WRITING FASTER CODE 1 . 1 WRITING FASTER CODE AND NOT HATING YOUR JOB AS A SOFTWARE DEVELOPER

Looking and Writing: Teaching Writing in the Discipline Teaching Writing in the Discipline of

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Powerpoint presentation writing persuasive essay Shawinigan.

SUSD Elementary Writing Program Overview Elementary Writing Workshop Series - Session 2 Writing

Writing Your CAREER Proposal Writing Your CAREER Proposal 2016 NSF CAREER Proposal Writing

Binary Search Trees These slides are not fully polished: - some transitions are rough - some

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches

Effectiveness of CMPs Prepared for: ICO Ref: jn1666/BW Date: April/2014 1 UK I

Modifications Progress Update Place your chosen image here. The four corners must just cover

Infinite Resources for Optimistic Concurrency Control with NOCC Theo Jepsen, Leandro Pacheco de

Engineering a Sort Function Engineering a Sort Function JON L. BENTLEY M. DOUGLAS McILROY

Control Structures CS2253, Owen Kaser Control Structures Implementing familiar HLL control

Sambuz

Useful Links

Newsletter

Mail Us

>>>CLICK HERE<<< Powerpoint presentation writing persuasive essay Shawinigan.