SQL Server on Linux, will it perform? Slava Oks Thank You! - - PowerPoint PPT Presentation
SQL Server on Linux, will it perform? Slava Oks Thank You! - - PowerPoint PPT Presentation
SQL Server on Linux, will it perform? Slava Oks Thank You! Microsoft Research Windows team Midori Our goal is to make SQL Server perform and scale on any platform or hardware of customers choice Prolog: Meet the PALs Intro to Drawbridge:
Thank You!
Microsoft Research Windows team Midori
Our goal is to make SQL Server perform and scale on any platform or hardware of customers choice
Prolog: Meet the PALs
- Modified Windows Kernel to run in user mode,
, aka Library OS or LibOS
- De
Designed for running on Windows and leverages Pico-pr process featur ure
- Pi
Pico-pr process is a NT pr process with h empty addr ddress spa pace
Intro to Drawbridge: A container technology to achieve isolation, security and density in the cloud
NT process shared address space
user32 gdi32 ntdll
host OS
ntoskr nl win32k
400+
NT calls
800+
Win32 calls
Picoprocess picoprocess isolated address space
ABI boundary PAL
host OS
security monitor ntoskrnl
45
calls
- Al
All 1200+ system calls block
- cked from
- m user-
mo mode ( (NTOS a and w win32k) 32k)
- En
Enforced ed by 35-lin line chan ange to Ki KiSystemServiceHandler
- No
No perf im impac act to other processes — le leverag ages “slo low path” used by UMS
- 45 n
45 new s w system c m calls a added t to p process (D (Drawbridge system calls)
- Ev
Even hard-coded traps can’t ’t break out
NT UM
LibOS: A user mode runtime library exposing semantics of Windows kernel
Network Stack I/O Object Manager Process Manager DRTL Simple Heap Union FS AFD Wait Pool Threads Memory Manager Loader PEB/TEB PAL ABI Handler Sync Objects Threads Streams Memory Manager APC
- Storage Manager
- Asynchronous I/O submitted to the host and registered with WaitPool threads
- On completion WaitPool threads deliver I/Os to the original thread through APC
- Original threads deliver I/Os to their final destination
- Network Manager
- Custom version of AFD (WinSock semantics) with a thread pool
- AFD Asynchronous I/O submitted to the host and registered with WaitPool
threads
- On completion WaitPool threads deliver I/Os to AFD threads through APC
- threads deliver network requests to original threads initiated I/O through APC
- Original threads deliver I/Os to their final destination
- I/O General
- No proper support for Scatter/Gather
- Memory Manager
- Global Virtual Address Descriptor (VAD) list
- Global Heap
- Object Manager
- Global Directory
- Process Manager
- Per process runtime libraries – no image sharing
- Threads
- APCs “injection” through polling
SQL OS (SOS): A user mode runtime library providing performance, scalability and diagnostic foundation for SQL Server
Memory Node CPU Node … Network Manager Scheduler Storage Manager Scheduler Storage Manager CPU Node … Network Manager Scheduler Storage Manager Scheduler Storage Manager …
- Network Manager
- I/O completion port/thread per CPU Node
- Asynchronous delivery
- Storage Manager
- I/O queue per scheduler
- Synchronous delivery through periodic polling
- Memory Manager / Object Manager / Scheduling Manager
- NUMA awareness
- Partitioned heaps
- Non-preemptive scheduling & User Mode Threads
- Synchronization primitives
Chapter 1: SQL & PALs The marriage in heaven or…
SQL Server On Top Of PALs
Ri Ring ng 3 SQL SQL Se Server SQL SQLOS Win Win32 Li Lib OS PA PAL Ri Ring ng 0 Li Linux Kern rnel Dr Drawbridge Te Technologies SQL SQL Li LibOS Ho Host Ex Extensio ion Ob Object Ma Mana nagem emen ent ✔ ✔ ✔ Mem Memory y Ma Mana nagem emen ent ✔ ✔ ✔ Th Threading/Scheduling ✔ ✔ ✔ Sy Synchronization ✔ ✔ ✔ I/O I/O (Disk, , Network) ✔ ✔ ✔
Chapter 2: The sign is on the wall Introducing Intelligent Hacks
- Ker
ernel el aio aio
- Pum
Pump p thr hreads ds vs Wa WaitPool th threads
- Fa
Fast I/O
// We can do Fast I/O if and only if it follows rules employed by SQL Server // disk I/O: which is delivered nonpreemptively through polling an overlapped // data structure // - I/O is asynchronous // - No user mode APC required // - No I/O completion port specified // - Contains an event to be signaled (leveraged by SQL Server to wake up idle scheduler // - Disk I/O // canDoFastIO = WaitForCompletion == FALSE; canDoFastIO = canDoFastIO && (ApcRoutine == NULL && FileObject != NULL); canDoFastIO = canDoFastIO && (Args->SkipCompletionPort || NtpGetCompletionPortObject(FileObject, &CompletionKey) == NULL); canDoFastIO = canDoFastIO && (Args->EventObject != NULL && IoStatusBlock != NULL); canDoFastIO = canDoFastIO && (NtpGetObjectType(Args->Object) == NTUM_FILE && NtpIsIoAsynchronous(Args->Object)); canDoFastIO = canDoFastIO && ((FileObject->Type & NtpSeekableFile) && (Type == NTUM_IO_READ || Type == NTUM_IO_WRITE || Type == NTUM_IO_WRITE_GATHER || Type == NTUM_IO_READ_SCATTER)); // If it is Gather/Scatter I/O then length can't exceed DK_UIO_MAXIOV supported by the Host // canDoFastIO = canDoFastIO && (!(Type == NTUM_IO_WRITE_GATHER || Type == NTUM_IO_READ_SCATTER) || Length <= DK_UIO_MAXIOV);
- Pum
Pump p thr hreads ds vs Wa WaitPool
- Fa
Fast I/ I/O ~ AFD pas pass th through
- SQ
SQLOS OS co completion threads ar are pump mp thread ads ~ no co conte text switch on co completion
// Complete I/Os received via the the IOPort are submitted to the I/O // completion port queue Status = NtpTryToProcessIoCompletion(IoCompletionPort, IoCompletionInformation); // Process any APCs or interruptions for this thread. // NtpProcessKernelApc(threadObject); Request.IOPort = IoCompletionPort->IOPort; Request.PendingIOs = &PendingIOs; Status = DrtlReadStreamSync(IoCompletionPort->Stream, 0, 0, (PVOID)&Request, NULL); while (PendingIOs != NULL) { // // Remember I/O to complete and move to the next I/O before // we complete the current one since by the time we return from // completion routine the completed I/O will be freed // CompletedIO = PendingIOs; PendingIOs = (PDK_ASYNC_RESULTS_LINKED)PendingIOs->Next; // // Complete I/O // NtpCompleteNetworkIoRequest((PNTUM_IO_REQUEST)CompletedIO->Request); }
- Mu
Multiple Heaps
- I/
I/O Reque quest free lis list pe per thr hread ad
- Pe
Per process Virtual Ad Address Sp Space e Manager er
- NU
NUMA support rt
- Pr
Proce cessor Af Affini nity
PVOID DrtlAllocate( __in ULONG Flags, __in SIZE_T Size, __in ULONG Tag ) { ULONG heapIdx; // // Early boot we might not have a thread // heapIdx = DrtlGetCurrentThreadId() % g_DrtlNumberHeaps; return DrtlpAllocate(&g_DrtlHeaps[heapIdx], Flags, Size, Tag); } NtpAllocateIORequestRaw( __in NTUM_IO_TYPE Type) { // Use cache if we have i/O request // LocalRequest = (PNTUM_IO_REQUEST)ExpInterlockedPopEntrySList( &RequestingThread->IORequestsCache); // If the cache was empty allocate a new request structure. // if (LocalRequest == NULL) { LocalRequest = (PNTUM_IO_REQUEST)ExAllocatePoolWithTag( PagedPool, sizeof(*LocalRequest), ' PRI'); }
Chapter 3: Pressure is On
Ha Hardware e Configuration Power Settings: OS Control power option, , High Performance in OS, , HT OFF, , Turbo boost OFF Ne Netwo work: 1x 1x10 10 GB Ne Netwo work connection per mac achine Ma Machine co configuration (server and cl client): Ge Gen3 systems Mo Model/Processors: Intel Xeon CPU E5-2660 0 @ 2.20 GHz (2S/16C), , Memory: 128 GB St Storage: 4x447.13 GB SSD
- SSDs. All SSD
SSDs are striped together and mounted as 1 volume. Both data an and log ar are stored on this volume.
Ha Hardware e Configuration Power Settings: OS Control power option, , High Performance in OS, , HT OFF, , Turbo boost OFF Ne Netwo work: 1x 1x10 10 GB Ne Netwo work connection per mac achine Ma Machine co configuration (server and cl client): 4S systems (for TPCC test) Mo Model/Processors: Intel Xeon CPU E7-4850 0 @ 2.00 GHz (4S/40C), , Memory: 768 GB Da Data S Storage: 2 : 2x1.46 T TB G GB F Fusion I IO d
- disk. A
All d disks a are s striped t together a and m mounted a as 1 1 v volume. Lo Log Storage: 1x5.54 TB HDD
Chapter 4: The ultimate PAL
Introducing SQLPAL
Pr Principles:
- Re
Remove re redundancy
- Op
Optimize Perfor
- rmance critical paths
s (I/O) O)
- Sh
Shrink code pa path-le length Li LibOS and Win32 Te Technologies SQ SQL SO SOSv2 Ho Host Ex Exten ension Ob Object Ma Management ❌ ✔ ❌ Me Memory ry Ma Management ❌ ✔ ✔Ho Host translation (je jemallo alloc) Th Threading/Scheduling ❌ ✔ ✔Ho Host translation (pt pthreads) Sy Synchronization ❌ ✔ ✔Ho Host translation (condition variables es) I/ I/O (Disk, , Network) ❌ ✔ ✔Ho Host translation (ka kaio) Ri Ring 3
SQL SQL Se Server
Wi Win3 n32 SO SOSv2 Li Lib-OS OS Ho Host Ex Exten ension Ri Ring 0 Li Linux Kernel SQ SQLPAL
SQL SQL PAL AL and SOS SOSv2 Ar Arch chitect cture
Ho Host Extensi ension n and nd Integr egration HE HE Debug ebugger er Br Bridge SOSv2 (Memory, , Scheduling, , Synchronization) St Storage Ma Mana nager er Net Network Ma Mana nager er Re Resource Ma Mana nager er Pr Process Ma Mana nager er Se Secu curity Ma Mana nager er Av Availabilit y y Ma Manager er NT NT U User M Mode Co Config ig Ma Mana nager er PA PAL Debugger Ex Extensio ion Ho Hosted ed Windo ndows s APIs SQL SQL Se Server SO SOS S Direct ct APIs
Chapter 5 Natural Habita(n)t
Linux Process Layout
- Ho
Host Exten ension is native e Linux proces ess
- Th
The Ho Host Exten ension loads the e SQLPAL na native W Windo ndows lib librar ary
- SQ
SQLPAL loads SQ SQL Se Server into a vi virtual Wi Windows Process.
SQL Server (Windows Binary) LibOS (Windows Kernel in User Mode) Host Extension (Linux or OS X)
Win32 Calls (1200+) ABI Calls (50) Linux or OS X OS Calls
Linux or OS X OS LLDB Debugger
Debugger
- De
Debugger bridge for Wi Windbg ndbg
- Fo
For most scenarios debugging is ident ntical to Windows
- Li
Live Debugging
- St
Start SQ SQL on Linux under debugger bridge
- At
Attach with Wi Windbg ndbg
- Ds
Dscripts et
- etc. work same as against Windows
- Cr
Crash Dump
- Ru
Run debugger bridge passing in crash dump file
- At
Attach with Wi Windbg ndbg an and it it’s the sam ame as as Wi Windo ndows ws
- Ex
Extract Window
- ws dump from Linux Core
e dump
- Ab
Able to
- extract
ct a Windows dump from
- m Linux
x cor
- re
dum dump
- Lo
Loses Li Linux information
- Li
Linux Enlightenment
- Th
The debugger extension also adds commands to debug Li Linux parts of f the PAL
- Co
Commands mirror normal Wi Windbg ndbg co commands nds
- Ex
Examples es:
- ‘k
‘k’ ’ shows Windows stack
- ‘!
‘!k’ ’ shows Linux stack
- Same for dv (!dv),
, dt dt (! (!dt dt), , etc.
- So
Source can be listed and source stepping works
- VT
VTune une is is a a cross pla latform performan ance tool
- Pr
Proce cess
- Ca
Captu ture on Linux and resolve on Linux
- Co
Copy th the project t to Windows
- Re
Resolve symbols and re rerun analysis
- Th
This adds the Windows information to th the project
- Af
After proce
- cessing all the cod
- de is available for
- r
analysis: Linux code, , sq sqlpal.dll, , Win32, , and SQL