Roadmap for Section 10.1 The Notion of Fault-Tolerance - - PDF document

roadmap for section 10 1
SMART_READER_LITE
LIVE PREVIEW

Roadmap for Section 10.1 The Notion of Fault-Tolerance - - PDF document

Unit OS10: Fault Tolerance Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume Management - Striped


slide-1
SLIDE 1

1

Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze

Unit OS10: Fault Tolerance

3

Roadmap for Section 10.1

The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume Management - Striped and Spanned Volumes Distributed File System (DFS) and File Replication Service (FRS) Network Load Balancing (NLB) Windows Clustering (MSCS)

slide-2
SLIDE 2

2

4

Fault-tolerant Systems

Fault-tolerance is the property of a system that continues operating properly in the event of failure of some of its parts

If its operating quality decreases at all, the decrease is proportional to the severity of the failure Fault-tolerance is particularly sought-after in high- availability or life-critical systems

Fault-tolerance is not just a property of individual machines; it may also characterize the rules by which they interact

5

Fault Models and Protocols

Need to specify Fault Model when discussing fault-tolerant (FT) systems All FT mechanisms in Windows are dealing with crash faults of computers or applications only Crash faults can be handled by replication in space or time

slide-3
SLIDE 3

3

6

Fault-tolerance (FT) by duplication

Three approaches toward FT systems

Replication:

multiple identical system instances directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum

Redundancy:

fail-over among multiple identical system instances fall-back or backup

Diversity:

multiple different implementations of the same spec. using them like replicated systems to cope with errors in a specific implementation

7

Fault-tolerance in NTFS -

Increasing System Availability

Transaction-based logging scheme Fast, even for large disks Recovery is limited to file system data

Use transaction processing like SQL server for user data Tradeoff: performance versus fully fault-tolerant File System (FS)

Design options for file I/O & caching:

Careful write: VAX/VMS FS, other proprietary OS FS Lazy write: most UNIX FS, OS/2 HPFS

slide-4
SLIDE 4

4

8

Recoverable File System

(Journaling File System)

Safety of careful write FS / performance of lazy write FS Log file + fast recovery procedure

Log file imposes some overhead Optimization over lazy write: distance between cache flushes increased

NTFS supports cache write-through and cache flushing triggered by applications

No extra disk I/O to update FS data structures necessary: all changes to FS structure are recorded in log file which can be written in a single operation In the future, NTFS may support logging for user files (hooks in place)

9

Recovery - Principles

NTFS performs automatic recovery Based on update records and checkpoints in Log file Update records store sub operations that change File System structure NTFS writes checkpoint every 5 sec. Includes copy of transaction table and dirty page table Checkpoint includes LSNs of the log records containing the tables; really a series of records - interleaved with update records Recovery depends on two NTFS in-memory tables: Transaction table: keeps track of active transactions (not completed) (sub operations of these transactions must be removed from disk) Dirty page table: records which pages in cache contain modifications to file system structure that have not yet been written to disk

Dirty page table Update record Transaction table Checkpoint record Update record Update record Begin of checkpoint operation End of checkpoint operation

slide-5
SLIDE 5

5

10

Recovery - Passes

1.

Analysis pass

  • NTFS scans forward in log file from beginning of last checkpoint
  • Updates transaction/dirty page tables it copied in memory
  • NTFS scans tables for oldest update record of a non-committed trans.

2.

Redo pass

  • NTFS looks for “page update“ records which contain volume

modification that might not have been flushed to disk

  • NTFS redoes these updates in the cache until it reaches end of log file
  • Cache manager “lazy writer thread“ begins to flush cache to disk

3.

Undo pass

  • Roll back any transactions that were not committed when system failed
  • After undo pass – volume is at consistent state
  • Write empty LFS restart area; no recovery is needed if system fails now

11

Undo Pass - Example

Transaction 1 was committed before power failure Transaction 2 was still active

NTFS must log undo operations in log file!

Power might fail again during recovery; NTFS would have to redo its undo operations

LSN 4044 LSN 4045 LSN 4046 LSN 4047 LSN 4048 LSN 4049 „Transaction committed“ record Redo: Allocate/Initialize an MFT file record Undo: Deallocate the file record Redo: Add the filename to the index Undo: Remove the filename from the index Redo: Set bits 3-9 in the bitmap Undo: Clear bits 3-9 in the bitmap Power failure

slide-6
SLIDE 6

6

12

NTFS Recovery - Conclusions

Recovery will return volume to some preexisting consistent state (not necessarily state before crash) Lazy commit algorithm: log file is not immediately flushed when a „transaction committed“ record is written Log File Service batches records Flush when cache manager calls or check pointing record is written (once every 5 sec); also when log is full Several parallel transactions might have been active before crash NTFS uses log file mechanisms for error handling Most I/O errors are not file system errors NTFS might create MFT record and detect that disk is full when allocating space for a file in the bitmap NTFS uses log info to undo changes and returns „disk full“ error to caller

13

Fault Tolerance Support -

using multiple disks

NTFS‘ capabilities are enhanced by the fault-tolerant volume managers FtDisk/DMIO

Lies above hard disk drivers in the I/O system‘s layered driver scheme FtDisk – for basic disks DMIO – for dynamic disks

Volume management capabilities:

Redundant data storage Dynamic data recovery from bad sectors on SCSI disks

NTFS itself implements bad-sector recovery for non-SCSI disks

slide-7
SLIDE 7

7

14

Terminology

Disks are a physical storage device such as a hard disk, a 3.5-inch floppy disk, or a CD-ROM A disk is divided into sectors, addressable blocks of fixed size

Sector sizes are determined by hardware All current x86-processor hard disk sectors are 512 bytes, and CD-ROM sectors are typically 2048 bytes Future x86 systems might support larger hard disk sector sizes

Partitions are collections of contiguous sectors on a disk A partition table or other disk-management database stores a partition's starting sector, size, and other characteristics Simple volumes are objects that represent sectors from a single partition that file system drivers manage as a single unit Multipartition volumes are objects that represent sectors from multiple partitions and that file system drivers manage as a single unit

Multipartition volumes offer performance, reliability, and sizing features that simple volumes do not

15

Basic vs Dynamic Disks

Two disk partitioning schemes used by Windows:

Basic disk partitioning Dynamic disk partitioning

Basic disks rely on MS-DOS-style disk partitioning

Are really Windows legacy disks Partition information for each disk stored on disk Multipartition information not stored on disk

can be lost when disk moved, OS reinstalled

Dynamic disks implement a more flexible partitioning scheme

Configuration of multipartition volumes is on disk and mirrored across the dynamic disks of the same computer. This allows for easy migration and minimizes chances of disk configuration loss. Disadvantage is that partitioning is not understood by other OS’s Laptops only support basic disks usually only disk and disks not removable

All disks are basic disks unless created new as dynamic disks or converted

slide-8
SLIDE 8

8

16

Basic Disk Partitioning

A disk has a sector called a Master Boot Record (MBR) as its first sector, that defines the first level of partitioning with its partition table

Boot code 1 2 3 4 Partitiion table Boot partition Partition 1 Partition 2 Partition 3 (extended) Partition 4 Partitions within an extended partition MBR Boot sector Extended partition boot record

17

Basic Disk Partitioning

The MBR describes up to 4 primary partitions

The first record of each primary partition is a boot record One primary partition can be marked “bootable” Each partition has a partition type (FAT, FAT32, NTFS, …)

To overcome a 4-partition limit, a basic disks define a special type of partition called an extended partition

Like a subdisk, complete with its own MBR

In NT 4, configuration for multipartition volumes is stored in the Registry’s HKLM\System\Disk subkey

Lost of system is reinstalled or disk is moved to another system

slide-9
SLIDE 9

9

18

Dynamic Disk Partitioning

The dynamic disk partitioning scheme is defined by a component called Logical Disk Manager (LDM)

LDM consists of a service and driver components Dynamic disk partitioning scheme was co-developed with Veritas Software, porting LDM from UNIX implementations

LDM maintains one unified database that stores all partitioning information, for all disks in the system.

Database also stores multipartition configuration Database occupies last 1 MB of each dynamic disk, and is mirrored across a system’s dynamic disks

Veritas offers add-on software that allows dynamic disks to be managed in subsets called disk groups

Master boot record LDM partition area LDM database 1 MB

19

Dynamic Partitioning

A computer’s boot and system disks have a mix of dynamic and basic disk partitioning

NTLDR only understands basic-disk partitioning LDM partitions are called soft partitions whereas basic-disk partitions are hard partitions

Even on “pure” dynamic disks, the MBR contains a basic-disk partition table that defines the entire usable area of the disk as a single hard partition of type LDM

LDM manages the space within the LDM partition in its database

slide-10
SLIDE 10

10

20

Multipartition Volumes

The multipartition volumes available in Windows are:

Spanned volumes Mirrored volumes Striped-volumes RAID-5 volumes

All partitions that make up new multipartition volumes must be on dynamic disks

Windows preserves NT 4 multipartition volumes on basic disks during an upgrade

21

Volume Management Features – Spanned Volumes

Spanned Volumes:

Single logical volume composed of a maximum of 32 areas of free space on one or more disks NTFS volume sets can be dynamically increased in size (only bitmap file which stores allocation status needs to be extended) FtDisk/DMIO hide physical configuration of disks from file system Tool: Windows Disk Management MMC snap-in Spanned volumes were called volume sets in Windows NT 4.0

C: (100 MB) E: (100 MB) D: (100 MB) D: (100 MB) Volume set D:

  • ccupies half of

two disks

slide-11
SLIDE 11

11

22

Striped Volumes (RAID-0)

Series of partitions, one partition per disk (of same size) Combined into a single logical volume FtDisk/DMIO optimize data storage and retrieval times

Stripes are narrow: 64KB Data tends to be distributed evenly among disks Multiple pending read/write ops. will operate on different disks Latency for disk I/O is often reduced (parallel seek operations)

(150 MB) (150 MB) (150 MB)

1 2 4 3

23

Fault Tolerant Volumes

FtDisk/DMIO implement redundant storage schemes

Mirror sets (RAID-1) Stripe sets with parity (RAID-5) Sector sparing

Tools: Windows Disk Management MMC snap-in

Mirrored Volumes:

Contents of a partition on one disk are duplicated on another disk FtDisk/DMIO write same data to both locations Read operations are done simultaneously on both disks (load balancing)

C: C: (mirror)

slide-12
SLIDE 12

12

24

Mirrored Volumes

Performance improvement because reads of different data can

  • ccur in parallel through dynamic load balancing

25

RAID-5 Volumes

Fault tolerant version of a regular stripe set Parity: logical sum (XOR) Parity info is distributed evenly over available disks FtDisk/DMIO reconstruct missing data by using XOR op.

parity

slide-13
SLIDE 13

13

26

The Disk Management MMC Snapin

27

The Volume Manager

LDM Volume Manager inserts itself above disk drivers Exports “volumes” to file systems Takes volume-oriented requests and can create sub-requests aimed at different disks of multipartition volumes

slide-14
SLIDE 14

14

28

Bad Cluster Recovery

Sector sparing is supported by FtDisk/DMIO

Dynamic copying of recovered data to spare sectors Without intervention from file system / user Works for certain SCSI disks FtDisk/DMIO return bad sector warning to NTFS

Sector re-mapping is supported by NTFS

NTFS will not reuse bad clusters NTFS copies data recovered by FtDisk/DMIO into a new cluster

NTFS cannot recover data from bad sector without help from FtDisk/DMIO

NTFS will never write to bad sector (re-map before write)

29

Bad-cluster re-mapping

NTFS filename Standard info Security desc. Data Data

User file

4 1588 3 1 1049 2 2 1355 Number of clusters Starting LCN Starting VCN VCN 0 1 LCN 1355 1356 Data VCN 3 4 5 6 LCN 1588 1589 1590 1591 Data VCN 2 LCN 1049 NTFS filename Standard info Security desc. Data 1 1357 Number of clusters Starting LCN Starting VCN Bad VCN 0 LCN 1357

slide-15
SLIDE 15

15

30

Distributed File System (DFS)

Data Replication for high Availability

A strategic storage management solution

Namespaces: simplified views of folders regardless

  • f where those files physically reside in a network

A namespace abstracts away file paths Change of a file server’s name does not break virtual DFS paths DFS stores path names logically as a single namespace

Replicated file servers for high availability

31

How DFS Works

Client and a server component

Any Windows system can be a DFS client Windows NT/2000/2003 Server include DFS server component

The view of shared folders on different servers is called the DFS namespace

Like a virtual UNC path A single namespace can map to physical resources residing

  • n multiple servers
slide-16
SLIDE 16

16

32

DFS Operation

DFS operates in multiple steps

  • 1. Client makes a request of the DFS namespace,
  • 2. DFS returns the appropriate path to the data (including Active

Directory site-costing information when AD is in use)

  • 3. Client makes a connection to the server and share

33

DFS Authentication

DFS is a multi-protocol architecture

Uses SMB and LAN Manager authentication protocols to communicate between a DFS client and a DFS server

DFS server can redirect requests to various types of shares protocol-specific authentication

Server Message Block (SMB) servers Network File System (NFS) servers Services for Macintosh™ (AFP) servers, and Netware™ Core Protocol (NCP) servers.

Windows client machines must install suitable redirector drivers

slide-17
SLIDE 17

17

34

DFS Request Routing

for replicated servers

Files/Folders on a DFS server can be replicated using File Replication Service (FRS)

FRS resolves file and folder name conflicts to make data consistent among the replica members FRS uses “last writer wins” rule to resolve conflicts

DFS requests are then routed to the closest server

If a server becomes unavailable, DFS ensures that requests are routed to the next closest server by using site-costing

Active Directory site and costing information will be used for routing decisions

I.e.; whether sites are connected via inexpensive, high-speed links or by expensive WAN links.

35

File Replication Service

(FRS) Details

Continuous replication

Subject to replication schedule, server load, and network load When a file or folder is changed and closed, FRS begins replicating within three seconds.

Fault-tolerant replication path

Fault-tolerant distribution by way of multiple connection paths between members Identical file data is sent no more than once to any replica member

slide-18
SLIDE 18

18

36

FRS Details (contd.)

Replication scheduling

Can be scheduled to occur at specified times and durations Replicating data during off-hours may free up network bandwidth for other uses

Replication integrity

Files are replicated only after they have been changed and closed FRS relies on the update sequence number (USN) journal to log records of files that have changed on a replica member FRS does not lose track of a changed file even if a replica member shuts down abruptly

37

Distributed File System Solution in Microsoft Windows Server 2003 R2

DFS Replication is a new state-based, multimaster replication engine

Supports replication scheduling and bandwidth throttling New compression protocol called Remote Differential Compression (RDC)

Allows to efficiently update files over a limited-bandwidth network

RDC detects insertions, removals, and re-arrangements of data in files

RDC replicates only the changes when files are updated

Cross-file RDC can help reduce the amount of bandwidth required to replicate new files

Substantial improvements over File Replication Service (FRS)

slide-19
SLIDE 19

19

38

Increased Availability with DFS and FRS

DFS may pointing to multiple volumes that can be alternates for each other

DFS manages failover to alternate volume Multiple copies of read-only shares can be mounted under the same logical DFS name (replication) Client accesses to DFS volumes are evenly distributed across multiple alternate network shares

FRS and DFS replication (in Server 2003 R2) can be used to maintain consistency among replicated volumes

39

Network Load Balancing (NLB)

A clustering technology for stateless services

Part of all Windows 2000 Server and Windows Server 2003 family operating systems Uses a distributed algorithm to load balance network traffic across a number of hosts

Helping to enhance the scalability and availability of mission critical, IP-based services, Web, Virtual Private Networking, Streaming Media, Terminal Services, Proxy, etc.

High availability by detecting host failures and redistributing traffic to operational hosts

slide-20
SLIDE 20

20

40

NLB versus Server Clusters

A server cluster (MSCS) is a collection of servers

Provide a single, highly available platform Applications can be failed over (SQL Server, Exchange Server data stores, file and print servers) MSCS clusters are used for stateful applications

NLB clusters distribute network traffic

NLB clusters provide a highly available and scalable platform for applications such as IIS, ISA server, etc. NLB is used for stateless applications; i.e. those that do not build any state as a result of a request.

41

NLB for High Availability

How Does NLB Detect a Server Failure?

Each NLB Cluster host emits heartbeats Convergence process to remove a failed host from the cluster By default, five seconds are required to detect a failed host Convergence process takes three seconds to evict the failed host and redistribute its load

slide-21
SLIDE 21

21

42

NLB Load Balancing Algorithm

Fully distributed filtering algorithm to map incoming clients to the cluster hosts

All hosts simultaneously inspect arriving packets and determine which host should handle the packet Randomization function determines destination and calculates a host priority based on IP address, port, etc. Destination host forwards the packet to the TCP/IP network stack Other cluster hosts discard the packet

Mapping remains unchanged unless the membership of cluster hosts changes

A given clients IP address and port will always map to the same cluster host Client affinity settings modify statistical mapping algorithms

43

NLB Implementation and Overhead

NLB has a kernel component called wlbs.sys

This is an intermediate NDIS driver NLB also has user-mode management components

NLB creates additional CPU load

Increases linearly with increased throughput on network interface

slide-22
SLIDE 22

22

44

Server Cluster

(Windows Server 2003)

Clustering technology for stateful applications

A dramatically improved version of the Microsoft Cluster Service (MSCS) component

MSCS was included with Windows 2000 Advanced Server and Windows 2000 Datacenter Server

Between two and eight servers that will act as nodes in the cluster Cluster resource include network names, IP addresses, applications, services, and disk drives

45

Server Cluster Operation (single quorum)

Nodes in a cluster use a quorum to track which node

  • wns a clustered application

The quorum is the storage device controlled by the primary node for a clustered application Only one node at a time may own the quorum On failover, the backup node takes ownership of the quorum The quorum may be created on the storage device attached to all nodes (single quorum device server cluster)

slide-23
SLIDE 23

23

46

Majority node set (MNS) server clusters

Quorum stored on a locally attached storage device connected to each of the cluster nodes

Backup node must have a copy of the data stored within the quorum Server cluster handles this requirement by replicating quorum data across the network Network can be a LAN, WAN, or VPN

47

Availability of a Server Cluster

To effectively fail over between nodes, majority node set clusters must have at least three nodes

More than half of the cluster nodes must be active at all times I.e.; in a cluster with three nodes, two of them must be active for the cluster to be functional Eight node clusters must have five nodes active to remain online

Single quorum device server clusters require that only a single node continues to function

slide-24
SLIDE 24

24

48

Hardware and Software Failures

Failed disks, memory, processors, power, and network equipment are all common sources of unplanned downtime

Server Cluster and NLB can be used to provide availability in the event of a failure of a processor, memory chip, power supply, or other hardware component Windows Server 2003 clusters provide availability at many

  • ther layers

To provide complete redundancy, all layers of your application must be clustered

NLB provides availability and scalability for the firewalls, front- end servers, and application servers, Server Cluster provides high availability for the database.

49

Clustered Applications -

the bigger picture

LAN (Ethernet) Laptop Laptop Laptop Firewall Firewall ASP.NET app server ASP.NET app server SQL Server SQL Server Cluster storage ASP.NET app server ASP.NET app server LAN (Ethernet)

NLB Cluster NLB Cluster Server Cluster

slide-25
SLIDE 25

25

50

Further Reading

Mark E. Russinovich and David A. Solomon, Microsoft Windows Internals, 4th Edition, Microsoft Press, 2004.

Chapter 12 - File Systems NTFS Recovery Support (from pp. 775) Chapter 13 - Networking Network Load Balancing and File Replication Service (from pp. 841) Chapter 10 - Storage Management Volume Management (from pp. 622)

Distributed File System (DFS) and File Replication Services (FRS)

http://www.microsoft.com/windowsserver2003/technologies/storage/d fs/default.mspx

Network Load Balancing (NLB) for Windows 2000 and Windows Server 2003,

http://www.microsoft.com/technet/prodtechnol/windowsserver2003/te chnologies/clustering/nlbfaq.mspx

Windows Server 2003 Clustering

http://www.microsoft.com/windowsserver2003/techinfo/overview/bdmt dm/default.mspx

51

Source Code References

Windows Research Kernel (WRK) sources

\base\ntos\fstub – partition table/MBR support code

Note: the other topics covered in this unit are not included with the WRK