SQL Server on Linux, will it perform? Slava Oks Thank You! - - PowerPoint PPT Presentation

sql server on linux will it perform
SMART_READER_LITE
LIVE PREVIEW

SQL Server on Linux, will it perform? Slava Oks Thank You! - - PowerPoint PPT Presentation

SQL Server on Linux, will it perform? Slava Oks Thank You! Microsoft Research Windows team Midori Our goal is to make SQL Server perform and scale on any platform or hardware of customers choice Prolog: Meet the PALs Intro to Drawbridge:


slide-1
SLIDE 1

SQL Server on Linux, will it perform?

Slava Oks

slide-2
SLIDE 2

Thank You!

Microsoft Research Windows team Midori

slide-3
SLIDE 3

Our goal is to make SQL Server perform and scale on any platform or hardware of customers choice

slide-4
SLIDE 4

Prolog: Meet the PALs

slide-5
SLIDE 5
  • Modified Windows Kernel to run in user mode,

, aka Library OS or LibOS

  • De

Designed for running on Windows and leverages Pico-pr process featur ure

  • Pi

Pico-pr process is a NT pr process with h empty addr ddress spa pace

Intro to Drawbridge: A container technology to achieve isolation, security and density in the cloud

NT process shared address space

user32 gdi32 ntdll

host OS

ntoskr nl win32k

400+

NT calls

800+

Win32 calls

Picoprocess picoprocess isolated address space

ABI boundary PAL

host OS

security monitor ntoskrnl

45

calls

  • Al

All 1200+ system calls block

  • cked from
  • m user-

mo mode ( (NTOS a and w win32k) 32k)

  • En

Enforced ed by 35-lin line chan ange to Ki KiSystemServiceHandler

  • No

No perf im impac act to other processes — le leverag ages “slo low path” used by UMS

  • 45 n

45 new s w system c m calls a added t to p process (D (Drawbridge system calls)

  • Ev

Even hard-coded traps can’t ’t break out

slide-6
SLIDE 6

NT UM

LibOS: A user mode runtime library exposing semantics of Windows kernel

Network Stack I/O Object Manager Process Manager DRTL Simple Heap Union FS AFD Wait Pool Threads Memory Manager Loader PEB/TEB PAL ABI Handler Sync Objects Threads Streams Memory Manager APC

slide-7
SLIDE 7
  • Storage Manager
  • Asynchronous I/O submitted to the host and registered with WaitPool threads
  • On completion WaitPool threads deliver I/Os to the original thread through APC
  • Original threads deliver I/Os to their final destination
  • Network Manager
  • Custom version of AFD (WinSock semantics) with a thread pool
  • AFD Asynchronous I/O submitted to the host and registered with WaitPool

threads

  • On completion WaitPool threads deliver I/Os to AFD threads through APC
  • threads deliver network requests to original threads initiated I/O through APC
  • Original threads deliver I/Os to their final destination
  • I/O General
  • No proper support for Scatter/Gather
slide-8
SLIDE 8
  • Memory Manager
  • Global Virtual Address Descriptor (VAD) list
  • Global Heap
  • Object Manager
  • Global Directory
  • Process Manager
  • Per process runtime libraries – no image sharing
  • Threads
  • APCs “injection” through polling
slide-9
SLIDE 9

SQL OS (SOS): A user mode runtime library providing performance, scalability and diagnostic foundation for SQL Server

Memory Node CPU Node … Network Manager Scheduler Storage Manager Scheduler Storage Manager CPU Node … Network Manager Scheduler Storage Manager Scheduler Storage Manager …

slide-10
SLIDE 10
  • Network Manager
  • I/O completion port/thread per CPU Node
  • Asynchronous delivery
  • Storage Manager
  • I/O queue per scheduler
  • Synchronous delivery through periodic polling
  • Memory Manager / Object Manager / Scheduling Manager
  • NUMA awareness
  • Partitioned heaps
  • Non-preemptive scheduling & User Mode Threads
  • Synchronization primitives
slide-11
SLIDE 11

Chapter 1: SQL & PALs The marriage in heaven or…

slide-12
SLIDE 12

SQL Server On Top Of PALs

Ri Ring ng 3 SQL SQL Se Server SQL SQLOS Win Win32 Li Lib OS PA PAL Ri Ring ng 0 Li Linux Kern rnel Dr Drawbridge Te Technologies SQL SQL Li LibOS Ho Host Ex Extensio ion Ob Object Ma Mana nagem emen ent ✔ ✔ ✔ Mem Memory y Ma Mana nagem emen ent ✔ ✔ ✔ Th Threading/Scheduling ✔ ✔ ✔ Sy Synchronization ✔ ✔ ✔ I/O I/O (Disk, , Network) ✔ ✔ ✔

slide-13
SLIDE 13

Chapter 2: The sign is on the wall Introducing Intelligent Hacks

slide-14
SLIDE 14
  • Ker

ernel el aio aio

  • Pum

Pump p thr hreads ds vs Wa WaitPool th threads

  • Fa

Fast I/O

// We can do Fast I/O if and only if it follows rules employed by SQL Server // disk I/O: which is delivered nonpreemptively through polling an overlapped // data structure // - I/O is asynchronous // - No user mode APC required // - No I/O completion port specified // - Contains an event to be signaled (leveraged by SQL Server to wake up idle scheduler // - Disk I/O // canDoFastIO = WaitForCompletion == FALSE; canDoFastIO = canDoFastIO && (ApcRoutine == NULL && FileObject != NULL); canDoFastIO = canDoFastIO && (Args->SkipCompletionPort || NtpGetCompletionPortObject(FileObject, &CompletionKey) == NULL); canDoFastIO = canDoFastIO && (Args->EventObject != NULL && IoStatusBlock != NULL); canDoFastIO = canDoFastIO && (NtpGetObjectType(Args->Object) == NTUM_FILE && NtpIsIoAsynchronous(Args->Object)); canDoFastIO = canDoFastIO && ((FileObject->Type & NtpSeekableFile) && (Type == NTUM_IO_READ || Type == NTUM_IO_WRITE || Type == NTUM_IO_WRITE_GATHER || Type == NTUM_IO_READ_SCATTER)); // If it is Gather/Scatter I/O then length can't exceed DK_UIO_MAXIOV supported by the Host // canDoFastIO = canDoFastIO && (!(Type == NTUM_IO_WRITE_GATHER || Type == NTUM_IO_READ_SCATTER) || Length <= DK_UIO_MAXIOV);

slide-15
SLIDE 15
  • Pum

Pump p thr hreads ds vs Wa WaitPool

  • Fa

Fast I/ I/O ~ AFD pas pass th through

  • SQ

SQLOS OS co completion threads ar are pump mp thread ads ~ no co conte text switch on co completion

// Complete I/Os received via the the IOPort are submitted to the I/O // completion port queue Status = NtpTryToProcessIoCompletion(IoCompletionPort, IoCompletionInformation); // Process any APCs or interruptions for this thread. // NtpProcessKernelApc(threadObject); Request.IOPort = IoCompletionPort->IOPort; Request.PendingIOs = &PendingIOs; Status = DrtlReadStreamSync(IoCompletionPort->Stream, 0, 0, (PVOID)&Request, NULL); while (PendingIOs != NULL) { // // Remember I/O to complete and move to the next I/O before // we complete the current one since by the time we return from // completion routine the completed I/O will be freed // CompletedIO = PendingIOs; PendingIOs = (PDK_ASYNC_RESULTS_LINKED)PendingIOs->Next; // // Complete I/O // NtpCompleteNetworkIoRequest((PNTUM_IO_REQUEST)CompletedIO->Request); }

slide-16
SLIDE 16
  • Mu

Multiple Heaps

  • I/

I/O Reque quest free lis list pe per thr hread ad

  • Pe

Per process Virtual Ad Address Sp Space e Manager er

  • NU

NUMA support rt

  • Pr

Proce cessor Af Affini nity

PVOID DrtlAllocate( __in ULONG Flags, __in SIZE_T Size, __in ULONG Tag ) { ULONG heapIdx; // // Early boot we might not have a thread // heapIdx = DrtlGetCurrentThreadId() % g_DrtlNumberHeaps; return DrtlpAllocate(&g_DrtlHeaps[heapIdx], Flags, Size, Tag); } NtpAllocateIORequestRaw( __in NTUM_IO_TYPE Type) { // Use cache if we have i/O request // LocalRequest = (PNTUM_IO_REQUEST)ExpInterlockedPopEntrySList( &RequestingThread->IORequestsCache); // If the cache was empty allocate a new request structure. // if (LocalRequest == NULL) { LocalRequest = (PNTUM_IO_REQUEST)ExAllocatePoolWithTag( PagedPool, sizeof(*LocalRequest), ' PRI'); }

slide-17
SLIDE 17

Chapter 3: Pressure is On

slide-18
SLIDE 18

Ha Hardware e Configuration Power Settings: OS Control power option, , High Performance in OS, , HT OFF, , Turbo boost OFF Ne Netwo work: 1x 1x10 10 GB Ne Netwo work connection per mac achine Ma Machine co configuration (server and cl client): Ge Gen3 systems Mo Model/Processors: Intel Xeon CPU E5-2660 0 @ 2.20 GHz (2S/16C), , Memory: 128 GB St Storage: 4x447.13 GB SSD

  • SSDs. All SSD

SSDs are striped together and mounted as 1 volume. Both data an and log ar are stored on this volume.

slide-19
SLIDE 19

Ha Hardware e Configuration Power Settings: OS Control power option, , High Performance in OS, , HT OFF, , Turbo boost OFF Ne Netwo work: 1x 1x10 10 GB Ne Netwo work connection per mac achine Ma Machine co configuration (server and cl client): 4S systems (for TPCC test) Mo Model/Processors: Intel Xeon CPU E7-4850 0 @ 2.00 GHz (4S/40C), , Memory: 768 GB Da Data S Storage: 2 : 2x1.46 T TB G GB F Fusion I IO d

  • disk. A

All d disks a are s striped t together a and m mounted a as 1 1 v volume. Lo Log Storage: 1x5.54 TB HDD

slide-20
SLIDE 20

Chapter 4: The ultimate PAL

slide-21
SLIDE 21

Introducing SQLPAL

Pr Principles:

  • Re

Remove re redundancy

  • Op

Optimize Perfor

  • rmance critical paths

s (I/O) O)

  • Sh

Shrink code pa path-le length Li LibOS and Win32 Te Technologies SQ SQL SO SOSv2 Ho Host Ex Exten ension Ob Object Ma Management ❌ ✔ ❌ Me Memory ry Ma Management ❌ ✔ ✔Ho Host translation (je jemallo alloc) Th Threading/Scheduling ❌ ✔ ✔Ho Host translation (pt pthreads) Sy Synchronization ❌ ✔ ✔Ho Host translation (condition variables es) I/ I/O (Disk, , Network) ❌ ✔ ✔Ho Host translation (ka kaio) Ri Ring 3

SQL SQL Se Server

Wi Win3 n32 SO SOSv2 Li Lib-OS OS Ho Host Ex Exten ension Ri Ring 0 Li Linux Kernel SQ SQLPAL

slide-22
SLIDE 22

SQL SQL PAL AL and SOS SOSv2 Ar Arch chitect cture

Ho Host Extensi ension n and nd Integr egration HE HE Debug ebugger er Br Bridge SOSv2 (Memory, , Scheduling, , Synchronization) St Storage Ma Mana nager er Net Network Ma Mana nager er Re Resource Ma Mana nager er Pr Process Ma Mana nager er Se Secu curity Ma Mana nager er Av Availabilit y y Ma Manager er NT NT U User M Mode Co Config ig Ma Mana nager er PA PAL Debugger Ex Extensio ion Ho Hosted ed Windo ndows s APIs SQL SQL Se Server SO SOS S Direct ct APIs

slide-23
SLIDE 23

Chapter 5 Natural Habita(n)t

slide-24
SLIDE 24

Linux Process Layout

  • Ho

Host Exten ension is native e Linux proces ess

  • Th

The Ho Host Exten ension loads the e SQLPAL na native W Windo ndows lib librar ary

  • SQ

SQLPAL loads SQ SQL Se Server into a vi virtual Wi Windows Process.

SQL Server (Windows Binary) LibOS (Windows Kernel in User Mode) Host Extension (Linux or OS X)

Win32 Calls (1200+) ABI Calls (50) Linux or OS X OS Calls

Linux or OS X OS LLDB Debugger

slide-25
SLIDE 25

Debugger

  • De

Debugger bridge for Wi Windbg ndbg

  • Fo

For most scenarios debugging is ident ntical to Windows

  • Li

Live Debugging

  • St

Start SQ SQL on Linux under debugger bridge

  • At

Attach with Wi Windbg ndbg

  • Ds

Dscripts et

  • etc. work same as against Windows
  • Cr

Crash Dump

  • Ru

Run debugger bridge passing in crash dump file

  • At

Attach with Wi Windbg ndbg an and it it’s the sam ame as as Wi Windo ndows ws

  • Ex

Extract Window

  • ws dump from Linux Core

e dump

  • Ab

Able to

  • extract

ct a Windows dump from

  • m Linux

x cor

  • re

dum dump

  • Lo

Loses Li Linux information

  • Li

Linux Enlightenment

  • Th

The debugger extension also adds commands to debug Li Linux parts of f the PAL

  • Co

Commands mirror normal Wi Windbg ndbg co commands nds

  • Ex

Examples es:

  • ‘k

‘k’ ’ shows Windows stack

  • ‘!

‘!k’ ’ shows Linux stack

  • Same for dv (!dv),

, dt dt (! (!dt dt), , etc.

  • So

Source can be listed and source stepping works

slide-26
SLIDE 26
  • VT

VTune une is is a a cross pla latform performan ance tool

  • Pr

Proce cess

  • Ca

Captu ture on Linux and resolve on Linux

  • Co

Copy th the project t to Windows

  • Re

Resolve symbols and re rerun analysis

  • Th

This adds the Windows information to th the project

  • Af

After proce

  • cessing all the cod
  • de is available for
  • r

analysis: Linux code, , sq sqlpal.dll, , Win32, , and SQL

slide-27
SLIDE 27

Chapter 6: The game is ON

slide-28
SLIDE 28

Thank You