writing better code with writing better code with help
play

Writing better code with Writing better code with help from the - PowerPoint PPT Presentation

Writing better code with Writing better code with help from the compiler help from the compiler Thiago Macieira Thiago Macieira Qt Developer Days & LinuxCon Europe October/2014 Qt Developer Days & LinuxCon Europe October/2014


  1. Writing better code with Writing better code with help from the compiler help from the compiler Thiago Macieira Thiago Macieira Qt Developer Days & LinuxCon Europe – October/2014 Qt Developer Days & LinuxCon Europe – October/2014

  2. Who am I? 2

  3. Example scenario Interview questjon You have 2 MB of data and you want to calculate how many bits are set, how would you do it? Memory usage is not a constraint (within reason). 3

  4. Approach 1: count the number of bits in each byte static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; int bitcount() int bitcount() int bitcount() int bitcount() { { { { int result = 0; int result = 0; int result = 0; int result = 0; for (int i = 0; i < sizeof(data); ++i) { for (int i = 0; i < sizeof(data); ++i) { unsigned char x = data[i]; for (int i = 0; i < sizeof(data); ++i) { unsigned char x = data[i]; for (int i = 0; i < sizeof(data); ++i) { result += !!(x & 1); result += !!(x & 1); unsigned char x = data[i]; unsigned char x = data[i]; result += !!(x & 2); result += !!(x & 2); for ( ; x; ++result) for ( ; x; ++result) result += !!(x & 4); result += !!(x & 4); x &= x - 1; x &= x - 1; result += !!(x & 8); result += !!(x & 8); } } result += !!(x & 16); result += !!(x & 16); return result; result += !!(x & 32); return result; result += !!(x & 32); result += !!(x & 64); result += !!(x & 64); } } result += !!(x & 128); result += !!(x & 128); } } return result; return result; } } 4

  5. Approach 2: use a lookup table static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; extern const ushort bitcount_table[65536]; extern const ushort bitcount_table[65536]; int bitcount() int bitcount() { { int result = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 2) for (int i = 0; i < sizeof(data); i += 2) result += bitcount_table[*(ushort*)(data + i)]; result += bitcount_table[*(ushort*)(data + i)]; return result; return result; } } 5

  6. My answer • Use the POPCNT instructjon – Added with the fjrst Intel Core-i7 generatjon, Nehalem (SSE4.2, but separate CPUID bit) 6

  7. How do you use the POPCNT instruction? • Write assembly • Use the GCC intrinsic: __builtin_popcount() • Use the Intel intrinsic: _mm_popcnt_u32() 7

  8. When can I use the instruction? • Use unconditjonally! • Check CPUID • Ask the linker for help • Check if surrounding code already requires a CPU that supports the feature anyway 8

  9. Choosing the solution • What afgects the choice: – CPUs it will run on – Compilers / toolchains it will be compiled with – Libraries you're using 9

  10. Other architectures • Intrinsics exist for ARM and PowerPC too (Neon and Altjvec) • Not all compiler features work on those architectures yet • But not discussed on this presentatjon 10

  11. Using intrinsics Using intrinsics 11

  12. Finding out which intrinsic to use • Use the SDM, Luke! 12

  13. Examples using intrinsics • The populatjon count • Calculatjng CRC32 static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; static unsigned char data[2*1024*1024]; int crc32() int bitcount() int bitcount() int crc32() { { { { int h = 0; int result = 0; int h = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) h = _mm_crc32_u32 (h, *(unsigned int*)(data + i)); h = _mm_crc32_u32 (h, *(unsigned int*)(data + i)); result += result += return h; return h; __builtin_popcount (*(unsigned int*)(data + i)); __builtin_popcount (*(unsigned int*)(data + i)); } } return result; return result; } } 13

  14. Where are intrinsics allowed? For all compilers: recent enough (e.g., GCC 4.7 for AVX2, 4.9 for AVX512F, etc.) Compiler Permitted usage Microsoft Visual Studio Anywhere, no special build options required Intel C++ Compiler Clang Anywhere, as long as code generation is enabled (e.g., -mavx / -mavx2 / -march=core-avx-i / etc. active) GCC 4.8 or earlier Code generation enabled; or GCC 4.9 functions decorated with __attribute__((target("avx"))) (etc.) 14

  15. How I solved this for Qt 5.4 • Macro for testjng with #if • Macro that expands to __attribute__((target(xxx)) (or empty) #if QT_COMPILER_SUPPORTS_HERE(SSE4_2) #if QT_COMPILER_SUPPORTS_HERE(SSE4_2) QT_FUNCTION_TARGET(SSE4_2) QT_FUNCTION_TARGET(SSE4_2) static uint crc32(const char *ptr, size_t len, uint h) static uint crc32(const char *ptr, size_t len, uint h) { { // Implementation using _mm_crc32_u64 / u32 / u16 / u8 goes here // Implementation using _mm_crc32_u64 / u32 / u16 / u8 goes here } } #else #else static uint crc32(...) static uint crc32(...) { { Q_UNREACHABLE(); Q_UNREACHABLE(); return 0; return 0; } } #endif #endif 15

  16. Runtime dispatching Runtime dispatching 16

  17. Runtime dispatching basics 1) Detect CPU void function_sse2(); void function_sse2(); void function_plain(); void function_plain(); void function() void function() 2) Determine best implementatjon { { if (/* CPU supports SSE2 */) if (/* CPU supports SSE2 */) function_sse2(); 3) Run it function_sse2(); else else function_plain(); function_plain(); } } void function_sse2(); void function_sse2(); With GCC 4.8: void function_plain(); void function_plain(); (doesn't work with void function() void function() Clang, ICC or MSVC) { { if (__builtin_cpu_supports("sse2")) if (__builtin_cpu_supports("sse2")) function_sse2(); function_sse2(); else else function_plain(); function_plain(); } } 17

  18. Identifying the CPU • Running CPUID lefu as an exercise to the reader • Just remember: cache the result extern int qt_cpu_features; extern int qt_cpu_features; CPUID goes extern void qDetectCpuFeatures(void); extern void qDetectCpuFeatures(void); here static inline int qCpuFeatures() static inline int qCpuFeatures() { { int features = qt_cpu_features; int features = qt_cpu_features; if (Q_UNLIKELY(features == 0)) { if (Q_UNLIKELY(features == 0)) { qDetectCpuFeatures(); qDetectCpuFeatures(); features = qt_cpu_features; features = qt_cpu_features; } } return features; return features; } } 18

  19. Checking surrounding code 19

  20. Putting it together • Result on 64-bit: unconditjonal call to the SSE2 version void function_sse2(); void function_sse2(); void function_plain(); void function_plain(); void function() void function() { { if (qCpuHasFeature(SSE2)) if (qCpuHasFeature(SSE2)) function_sse2(); function_sse2(); else else function_plain(); function_plain(); } } 20

  21. Asking the linker and dynamic linker for help • Requires: – Glibc 2.11.1, Binutjls 2.20.1, GCC 4.8 / ICC 14.0 Magic – Not supported with Clang or on Android (due to Bionic) goes here void *memcpy(void *, const void *, size_t) void *memcpy(void *, const void *, size_t) void *memcpy(void *, const void *, size_t) void *memcpy(void *, const void *, size_t) __attribute__((ifunc("resolve_memcpy"))); __attribute__((ifunc("resolve_memcpy"))); __attribute__((ifunc("resolve_memcpy"))); __attribute__((ifunc("resolve_memcpy"))); decltype(memcpy) memcpy_avx, memcpy_sse2; void *memcpy_avx(void *, const void *, size_t); decltype(memcpy) memcpy_avx, memcpy_sse2; void *memcpy_avx(void *, const void *, size_t); void *memcpy_sse2(void *, const void *, size_t); void *memcpy_sse2(void *, const void *, size_t); static void *(*resolve_memcpy(void))(void *, const void *, size_t) auto resolve_memcpy() auto resolve_memcpy() static void *(*resolve_memcpy(void))(void *, const void *, size_t) { { { { return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; return qCpuHasFeature(AVX) ? memcpy_avx : memcpy_sse2; } } } } 21

  22. GCC 4.9 auto-dispatcher (a.k.a. “Function Multi Versioning”) • C++ only! __attribute__((target("popcnt"))) __attribute__((target("popcnt"))) int bitcount() int bitcount() { { int result = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 4) for (int i = 0; i < sizeof(data); i += 4) result += __builtin_popcount(*(uint*)(data + i)); result += __builtin_popcount(*(uint*)(data + i)); return result; return result; } } __attribute__((target("default"))) __attribute__((target("default"))) int bitcount() int bitcount() { { int result = 0; int result = 0; for (int i = 0; i < sizeof(data); i += 2) for (int i = 0; i < sizeof(data); i += 2) result += bitcount_table[*(ushort*)(data + i)]; result += bitcount_table[*(ushort*)(data + i)]; return result; return result; } } 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend