OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS - - PowerPoint PPT Presentation

optimizing builds
SMART_READER_LITE
LIVE PREVIEW

OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS - - PowerPoint PPT Presentation

OPTIMIZING BUILDS ON WINDOWS SOME PRACTICAL CONSIDERATIONS Alexandre Ganea, Ubisoft alexandre.ganea@ubisoft.com 2019 Bay Area LLVM Developers' Meeting, Oct.22-23 1 SUMMARY PART 1 PREAMBLE PART 2 EXPERIMENTS PART 3 PROPOSAL PART 4 NEXT


slide-1
SLIDE 1

OPTIMIZING BUILDS

ON WINDOWS

SOME PRACTICAL CONSIDERATIONS

Alexandre Ganea, Ubisoft alexandre.ganea@ubisoft.com 2019 Bay Area LLVM Developers' Meeting, Oct.22-23

1
slide-2
SLIDE 2 2

SUMMARY

PART 1

PREAMBLE

PART 2

EXPERIMENTS

PART 3

PROPOSAL

PART 4

NEXT STEPS

slide-3
SLIDE 3

PART 1

PREAMBLE:

CHALLENGES

3
slide-4
SLIDE 4 4 PART 1 – PREAMBLE

Lines of Code (Assassins’ Creed, Far Cry)

slide-5
SLIDE 5 5 PART 1 – PREAMBLE

Editor Build 20,000 .CPP 25,000 .H 23 GB .OBJ 9 GB .DEBUG$T 10 M TYPE RECORDS 42 M SYMBOLS 300 M .EXE 2 GB .PDB Windows 10 Fastbuild, distributed Always Unity builds Concurent AAA games 20 – 25 LoC/game 30 - 50 M Programmers/title 100 – 250 Code Changes/day 100 – 150 (peak:400) Build targets/platform 5 – 6 Platforms/Game 4+ Code workspace 70 - 100 GB Data workspace 100 - 200 GB Game builds/day 100 – 150 Stripped Build 1 - 6 GB Final Build 50 - 90 GB

Game production constraints @ Ubisoft

slide-6
SLIDE 6 6 PART 1 – PREAMBLE

08 min 50 sec 08 min 33 sec 08 min 50 sec 08 min 20 sec 07 min 00 sec 04 min 00 sec 04 min 15 sec 10 min 20 sec 06 min 46 sec 01 min 18 sec 43 sec 29 sec 29 sec 29 sec 00 min 00 sec 02 min 53 sec 05 min 46 sec 08 min 38 sec 11 min 31 sec 14 min 24 sec 17 min 17 sec 20 min 10 sec 2017 (MSVC) 2018 (MSVC) Fall 2018 (MSVC + LLD) 2019 (MSVC + LLD) 2019 (Clang) 100% cache hit, local SSD 100% cache hit, 1 Gpbs network

AAA GAME, CLEAN REBUILD X64 EDITOR RELEASE (FASTBUILD)

Compiler Linker

slide-7
SLIDE 7

PART 2

EXPERIMENTS

7
slide-8
SLIDE 8

2.1 Clang-scan-deps & Fastbuild cache

8 PART 2 – EXPERIMENTS
slide-9
SLIDE 9 9

clang-cl /E md5sum curl https://store/ clang-cl clang-scan-deps while read x; do md5sum $x; done

deps.txt a.cpp a.cpp

found not found

5-10 sec 0.02 sec 0.02 sec

FASTBUILD CACHE READ ALGORITHM

PART 2 – EXPERIMENTS

deps+MD5.txt

slide-10
SLIDE 10 10

06 min 10 sec 04 min 05 sec 35 sec 40 sec 40 sec 40 sec VS2017 15.9.16 Network cache Network cache + clang-scan-deps

100% NETWORK CACHE HITS

AAA GAME, X64 EDITOR RELEASE (FASTBUILD)

Compiler/Cache Linker

PART 2 – EXPERIMENTS
slide-11
SLIDE 11 11

clang-scan-deps + network cache LLD (MSVC OBJs + ghash)

Intel Xeon W-2135 @ 3.7 GHz, 128 GB, NVMe SSD, 1Gbps Network

7 GB –> 22.6 GB 50k files

PART 2 – EXPERIMENTS

(ms)

slide-12
SLIDE 12

2.2 StringMap

1 2 PART 2 – EXPERIMENTS
slide-13
SLIDE 13 Title of Document 13

11.5% process time

CLANG-SCAN-DEPS STANDALONE (50K FILES)

avg ~90% cpu

Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD

slide-14
SLIDE 14 14 PART 2 – EXPERIMENTS

STRINGMAP

slide-15
SLIDE 15 15 PART 2 – EXPERIMENTS

STRINGMAP

slide-16
SLIDE 16 16 PART 2 – EXPERIMENTS

sizeof(std::error_code) -> 16 bytes sizeof(llvm::ErrorOr<DirectoryEntry&>) -> 24 bytes sizeof(llvm::StringMapEntry<llvm::ErrorOr<DirectoryEntry&>>) –> 32 bytes (+string contents)

DOWN THE RABBIT HOLE

slide-17
SLIDE 17 17

nullptr nullptr nullptr 0x15f238a92 nullptr nullptr nullptr NumBuckets 0x12345678

uint32_t

StringMapEntry* NumBuckets count value string

size_t T count

PART 2 – EXPERIMENTS

STRINGMAP: MEMORY LAYOUT

slide-18
SLIDE 18 18 PART 2 – EXPERIMENTS

STRINGMAP (VTUNE)

slide-19
SLIDE 19 19

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 1 5 9 13 17 21 25 29 33 37 41 45 49 60.2% 14.7% 8.0% 5.3% 3.5% 1.7% 0.8% 0.5% 0.2% 0.1% 0.1%

Hash collisions / call

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 1 5 9 13 17 21 25 29 33 37 41 45 79.4% 11.7% 3.5% 1.6% 1.0% 0.5% 0.3% 0.2% 0.1% 0.1%

Cachelines hit / call 187 M samples

PART 2 – EXPERIMENTS

STRINGMAP STATS

slide-20
SLIDE 20 20 PART 2 – EXPERIMENTS

DenseMap<uint64_t,T> + xxHash64() + StringSaver

slide-21
SLIDE 21 21 PART 2 – EXPERIMENTS

DenseMap<__int128,T> + XXH128() + StringSaver

slide-22
SLIDE 22

2.4 Multithreading LLD (COFF driver)

2 3 PART 2 – EXPERIMENTS
slide-23
SLIDE 23 24

LINK AAA GAME, X64 EDITOR RELEASE

(22.8GB MSVC OBJS)

VS2019 16.2 LLD 9.0 LLD 8 + // GHASH

Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD

PART 2 – EXPERIMENTS

58 sec 62 sec 49 sec

slide-24
SLIDE 24 25

19.21 s 7.42 s 5.29 s 4.20 s .0 s 5.0 s 10.0 s 15.0 s 20.0 s 25.0 s Clang 9.0, no Ghash Clang 8.0 + // Ghash (12-byte buckets) Clang 8.0 + // Ghash (8-byte buckets) Clang 8.0 + // Ghash (8-byte buckets) + 2MB pages GHash

uint64_t TypeIndex uint32_t GHash uint64_t

TypeIndex

PART 2 – EXPERIMENTS
slide-25
SLIDE 25

2.3 Process Creation

2 6 PART 2 – EXPERIMENTS
slide-26
SLIDE 26 27 PART 2 – EXPERIMENTS

COMPILING WITH CLANG 9.0

slide-27
SLIDE 27 28

93 ms

PART 2 – EXPERIMENTS

CLANG CC1 IN PROCMON

slide-28
SLIDE 28 29

int main(int argc_, const char **argv_) { noteBottomOfStack(); llvm::InitLLVM X(argc_, argv_); SmallVector<const char *, 256> argv(argv_, argv_ + argc_); if (llvm::sys::Process::FixupStandardFileDescriptors()) return 1; llvm::InitializeAllTargets(); return ClangDriverMain(argv); } int ClangDriverMain(SmallVectorImpl<const char *>& argv) { static LLVM_THREAD_LOCAL bool EnterPE = true; if (EnterPE) { llvm::sys::DynamicLibrary::AddSymbol("ClangDriverMain", (void*)(i.. EnterPE = false; } else { llvm::cl::ResetAllOptionOccurrences(); } auto TargetAndMode = ToolChain::getTargetAndModeFromProgramName(arg..

clang/tools/driver/driver.cpp

int Command::Execute(ArrayRef<llvm::Optional<StringRef>> Redirects, std::string *ErrMsg, bool *ExecutionFailed) const { [...] typedef int (*ClangDriverMainFunc)(SmallVectorImpl<const char *> &); ClangDriverMainFunc ClangDriverMain = nullptr; [...] if (ClangDriverMain) { [...] llvm::CrashRecoveryContext CRC; CRC.EnableExceptionHandler = true; const void *PrettyState = llvm::SavePrettyStackState(); int Ret = 0; auto ExecuteClangMain = [&]() { Ret = ClangDriverMain(Argv); }; if (!CRC.RunSafely(ExecuteClangMain)) { llvm::RestorePrettyStackState(PrettyState); return CRC.RetCode; } return Ret; } else { auto Args = llvm::toStringRefArray(Argv.data()); return llvm::sys::ExecuteAndWait(Executable, Args, Env, Redirects, /*secondsToWait*/ 0, /*memoryLimit*/ 0, ErrMsg, ExecutionFailed); } }

clang/lib/driver/Job.cpp

PART 2 – EXPERIMENTS

MAKING CC1 REENTRANT

slide-29
SLIDE 29 30 PART 2 – EXPERIMENTS

CLANG DRIVER & CC1 MERGED

slide-30
SLIDE 30 31

34 min 00 sec 28 min 00 sec 12 min 00 sec 32 min 30 sec 30 min 16 sec 13 min 10 sec 22 min 46 sec 19 min 54 sec 07 min 10 sec 6-core - W10 build 1803 6-core - W10 build 1903 36-core - W10 build 1709

BYPASSING THE CC1 PROCESS CLEAN REBUILD LLVM, CLANG & LLD

VS2019 16.2 Clang 9.0 Clang 9.0 + cc1 bypass

PART 2 – EXPERIMENTS
slide-31
SLIDE 31

2.5 CRT Allocator

3 2 PART 2 – EXPERIMENTS
slide-32
SLIDE 32 33

LINKING RAINBOW6: SIEGE WITH THINLTO :-(

96% idle 4%

PART 2 – EXPERIMENTS
slide-33
SLIDE 33 34 PART 2 – EXPERIMENTS

THINLTO: ALLOCATOR CONTENTION

slide-34
SLIDE 34 35

$ LD_PRELOAD=/path/to/my/malloc.so /bin/ls

#include "rpmalloc/rpmalloc.c" extern "C" { _ACRTIMP _CRTRESTRICT void *malloc(size_t size) { return rpmalloc(size); } _ACRTIMP void free(void *p) { rpfree(p); } _ACRTIMP _CRTRESTRICT void *calloc(size_t n, size_t elem_size) { return rpcalloc(n, elem_size); } _ACRTIMP _CRTRESTRICT void *realloc(void *ptr, size_t size) { return rprealloc(ptr, size); } } // Bypass CRT debug allocator #ifdef _DEBUG void *operator new(decltype(sizeof(0)) n) noexcept(false) { return malloc(n); } void __CRTDECL operator delete(void *const block) noexcept { free(block); } void *operator new[](std::size_t s) throw(std::bad_alloc) { return malloc(s); } void operator delete[](void *p) throw() { free(p); } #endif

https://github.com/mjansson/rpmalloc

llvm/lib/Support/Windows/Memory.inc

PART 2 – EXPERIMENTS

REPLACING THE CRT ALLOCATOR

slide-35
SLIDE 35 36

57 min 00 sec 20 min 13 sec 16 min 19 sec 37 min 12 sec > 1 h 30 min 03 min 57 sec

VS 2017 15.9.16 Clang 9.0 ThinLTO Clang 9.0 ThinLTO + rpmalloc

THINLTO (CLEAN REBUILD) RAINBOW 6: SIEGE, PC GAME PROFILE

6-core (W10 build 1903) 36-core (W10 build 1709)

PART 2 – EXPERIMENTS
slide-36
SLIDE 36

PART 3

PROPOSAL

PROOF-OF-CONCEPT

37
slide-37
SLIDE 37 PART 3 – PROPOSAL 38

FASTBUILD

clang.exe lld-link.exe llvm-tblgen.exe clang-tblgen.exe llvm-lib.exe ml64.exe (masm) rc.exe cmake.exe

PREVIOUS BUILD PROCESS

slide-38
SLIDE 38 39 PART 3 – PROPOSAL

Maybe there’s a better way

slide-39
SLIDE 39 40

Image Credit: Caterpillar

PART 3 – PROPOSAL
slide-40
SLIDE 40 PART 3 – PROPOSAL 41

FASTBUILD

ml64.exe (masm) rc.exe cmake.exe

LLVM-BUILDOZER

clang.exe lld-link.exe llvm-tblgen.exe clang-tblgen.exe llvm-lib.exe

BUILDING WITH BUILDOZER

slide-41
SLIDE 41 PART 3 – PROPOSAL 42

FASTBUILD

ml64.exe (masm) rc.exe cmake.exe

Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Local Local Local

slide-42
SLIDE 42 43

int buildozer::ImportEXE(llvm::StringRef EXE) { [..] HINSTANCE H = LoadLibraryA(EXE.data()); if (!H) return 0; RemapIAT(H); InitDebInfo(); PatchRPMalloc(M); InitializeStaticTLS(H); InitializeCRT(M); FindEntryPoints(M); [..] }

PART 3 – PROPOSAL

RUNNING THE DOZER

”LoadLibrary can also be used to load other executable modules.[..] However, do not use LoadLibrary to run an .exe file. Instead, use the CreateProcess function.” (MSDN)

slide-43
SLIDE 43 44

int buildozer::ImportEXE(llvm::StringRef EXE) { [..] HINSTANCE H = LoadLibraryA(EXE.data()); if (!H) return 0; RemapImportAddressTable(H); InitDebInfo(); PatchRPMalloc(M); InitializeStaticTLS(H); InitializeCRT(M); FindEntryPoints(M); [..] }

PART 3 – PROPOSAL

RUNNING THE DOZER

slide-44
SLIDE 44 45

Pool.emplace(NumWorkers, [&]() { while (true) { buildozer::WorkUnit *WU = AcquireWork(..); if (!WU) break; int Mod = IdentifyMOD(WU); llvm::CrashRecoveryContext CRC; CRC.RunSafely([&] { buildozer::Launch(Mod, WU->Directory, WU->Arguments); }); [..] } }); Pool.join();

PART 3 – PROPOSAL

RUNNING THE DOZER

slide-45
SLIDE 45 46 PART 3 – PROPOSAL

RUNNING THE DOZER

slide-46
SLIDE 46 47

19 min 34 sec 11 min 53 sec Clang 9.0 Buildozer

Local build, AAA game, x64 Editor Release

PART 3 – PROPOSAL

Intel Xeon W-2135 @ 3.7 GHz (6-core), 128 GB, NVMe SSD

slide-47
SLIDE 47

PART 4

NEXT STEPS

48
slide-48
SLIDE 48

SHORT TERM

PART 4– NEXT STEPS 49
  • Remove OS jitter (in-RAM file content & stat cache)
  • OBJ cache (in-RAM)
  • Clang-LLD in-memory bridge
  • Incrementally link along the way
  • Incrementally compile along the way (SN Systems’ Program Repository)
  • Remote API for distribution & caching
slide-49
SLIDE 49

LONG TERM

PART 4– NEXT STEPS 50

BUILD TARGET

slide-50
SLIDE 50

PLATFORM

LONG TERM

PART 4– NEXT STEPS 51

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

slide-51
SLIDE 51

LONG TERM

PART 4– NEXT STEPS 52

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

slide-52
SLIDE 52

LONG TERM

PART 4– NEXT STEPS 53

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

DAILY COMMITS

slide-53
SLIDE 53

LONG TERM

PART 4– NEXT STEPS 54

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

DAILY COMMITS ACTIVE BRANCHES

slide-54
SLIDE 54

LONG TERM

PART 4– NEXT STEPS 55

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

DAILY COMMITS ACTIVE BRANCHES GAME PRODUCTION

slide-55
SLIDE 55

LONG TERM

PART 4– NEXT STEPS 56

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

PLATFORM

BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET BUILD TARGET

DAILY COMMITS ACTIVE BRANCHES 5 min x6 x6 x100 x4 GAME PRODUCTION x20

slide-56
SLIDE 56 57

Is there a better way?

slide-57
SLIDE 57 58

THANK YOU

slide-58
SLIDE 58

Q&A

59

Alexandre Ganea, Ubisoft alexandre.ganea@ubisoft.com