Multiple processor Multiple processor systems systems 1 - - PowerPoint PPT Presentation

multiple processor multiple processor systems systems
SMART_READER_LITE
LIVE PREVIEW

Multiple processor Multiple processor systems systems 1 - - PowerPoint PPT Presentation

Multiple processor Multiple processor systems systems 1 Multiprocessor Systems Multiprocessor Systems Continuous need for faster Continuous need for faster computers computers Multiprocessors: Multiprocessors: shared


slide-1
SLIDE 1

Multiple processor Multiple processor systems systems

1

slide-2
SLIDE 2

Multiprocessor Systems Multiprocessor Systems

  • Continuous need for faster

Continuous need for faster computers computers

  • Multiprocessors:

Multiprocessors:

  • shared

shared memory memory model, access time model, access time nanosec nanosec (ns) (ns)

  • Multicomputers

Multicomputers: :

  • message

message passing passing multiprocessor, access time multiprocessor, access time microsec microsec ( (

s)

s)

  • Distributed systems:

Distributed systems:

  • wide

wide area distributed area distributed system, access time system, access time millisec millisec (ms) (ms)

2

slide-3
SLIDE 3

Multiprocessors Multiprocessors

Definition: Definition: A computer system in which two or more A computer system in which two or more CPUs share full access to a common CPUs share full access to a common RAM RAM Memory access: Memory access: UMA UMA – – Uniformed Memory Access Uniformed Memory Access NUMA NUMA – – Nonuniform Nonuniform Memory Access Memory Access

3

slide-4
SLIDE 4

Bus Bus-

  • Based UMA Multiprocessors

Based UMA Multiprocessors

  • Contention for the bus

Contention for the bus

  • Only one CPU can access the memory at any time

Only one CPU can access the memory at any time

  • If bus is busy, requesting CPU must wait

If bus is busy, requesting CPU must wait

  • Limits the number of CPU

Limits the number of CPU

  • Internal cache reduces the memory accesses

Internal cache reduces the memory accesses

  • Need: cache

Need: cache-

  • coherence protocol (several copies exists)

coherence protocol (several copies exists)

4

slide-5
SLIDE 5

UMA Multiprocessor using a crossbar switch UMA Multiprocessor using a crossbar switch

  • Allows n CPUs to connect with k memories

Allows n CPUs to connect with k memories using n*k using n*k crosspoints crosspoints

  • Nonblocking

Nonblocking network network

  • CPU can always connect, assuming that memory is available

CPU can always connect, assuming that memory is available

5

slide-6
SLIDE 6

NUMA Multiprocessor NUMA Multiprocessor

Characteristics Characteristics

1. 1.

Single address space visible to all CPUs Single address space visible to all CPUs

2. 2.

Access to remote memory via commands Access to remote memory via commands

  • LOAD

LOAD

  • STORE

STORE

3. 3.

Access to remote memory slower than to local Access to remote memory slower than to local

6

slide-7
SLIDE 7

Multicore Cips Multicore Cips

  • ”What to do with all the transistors on the chip?”

”What to do with all the transistors on the chip?”

  • Add cache

Add cache -

  • > after a point does not improve performance

> after a point does not improve performance

  • Add a CPU, called core

Add a CPU, called core -

  • > multicore chip

> multicore chip

  • May share cache or each has its own (or both)

May share cache or each has its own (or both)

  • Failure in shared component can bring down more cores

Failure in shared component can bring down more cores

  • System on a ship

System on a ship – –design has also special design has also special-

  • purpose cores (for

purpose cores (for video, audio, crypto, network) with the CPUs video, audio, crypto, network) with the CPUs

  • For the software symmetric multicore chips are similar than

For the software symmetric multicore chips are similar than UMA multiprocessors UMA multiprocessors

  • ”How to use them with the software?”

”How to use them with the software?”

  • Parallel coding, lack of algorithms

Parallel coding, lack of algorithms

  • Synchronization, race conditions, deadlocks

Synchronization, race conditions, deadlocks

  • Benefits

Benefits

7

slide-8
SLIDE 8

Multiprocessor OS Multiprocessor OS Types Types

  • Each CPU

Each CPU has its own Operating System has its own Operating System

  • Partition the memory for private use

Partition the memory for private use

  • Each CPU and its OS operate independently

Each CPU and its OS operate independently

  • Master

Master-

  • Slave Multiprocessors

Slave Multiprocessors

  • One CPU is the master and others are slaves

One CPU is the master and others are slaves

  • Only master may run Operating System

Only master may run Operating System

  • Symmetric Multiprocessors (SMP)

Symmetric Multiprocessors (SMP)

  • One copy of the OS and any CPU may run it

One copy of the OS and any CPU may run it

  • Balances processes and memory dynamically

Balances processes and memory dynamically

  • Synchronization issues in the OS code itself

Synchronization issues in the OS code itself

  • Critical regions must be protected (disabling interrupts is not enough)

Critical regions must be protected (disabling interrupts is not enough)

8

slide-9
SLIDE 9

E ach CPU has its own OS E ach CPU has its own OS

  • System call handled by the private OS using its own private data

System call handled by the private OS using its own private data structures structures

  • No sharing of processes, no load balancing

No sharing of processes, no load balancing

  • No sharing of memory pages, no reallocation of free pages

No sharing of memory pages, no reallocation of free pages

  • No buffer cache for shared file systems to avoid inconsistencies

No buffer cache for shared file systems to avoid inconsistencies

  • No used any more

No used any more

9

slide-10
SLIDE 10

Master Master-

  • Slave multiprocessors

Slave multiprocessors

  • Only one instance of OS and its data structures

Only one instance of OS and its data structures

  • Master allocates load (processes/threads) to other CPUs

Master allocates load (processes/threads) to other CPUs

  • Shared memory, pages can be allowed for all processes

Shared memory, pages can be allowed for all processes

  • Shared buffer cache, inconsistencies do not occur

Shared buffer cache, inconsistencies do not occur

  • Master may become bottleneck!

Master may become bottleneck!

  • It executes all system calls for all processes

It executes all system calls for all processes

10

slide-11
SLIDE 11

Symmetric Multiprocessors, SMP Symmetric Multiprocessors, SMP

  • One copy of the OS, any CPU can run it

One copy of the OS, any CPU can run it

  • All CPUs can execute system calls

All CPUs can execute system calls

  • Access to critical regions must be controlled

Access to critical regions must be controlled

  • Mutexes

Mutexes to control access to multiple independent critical regions to control access to multiple independent critical regions

  • Mutexes

Mutexes to to cotrol cotrol access to critical tables (used in several critical access to critical tables (used in several critical regions) regions)

  • Only one CPU can access a particular part of OS at any time!

Only one CPU can access a particular part of OS at any time!

  • Deadlocks might freeze the system

Deadlocks might freeze the system

11

slide-12
SLIDE 12

Multiprocessor Multiprocessor Synchronization Synchronization

  • Correct

Correct mutex mutex implementation is not simple! implementation is not simple!

  • Disabled interrupts do not work with multiple processors

Disabled interrupts do not work with multiple processors

  • Test and Set Lock without locking the bus does not work (see fig above)

Test and Set Lock without locking the bus does not work (see fig above)

  • Test and Set with logging the bus before command and releasing it after

Test and Set with logging the bus before command and releasing it after works works

  • The waiting CPU spins (loops in testing) fast waiting for the

The waiting CPU spins (loops in testing) fast waiting for the spin lock spin lock

  • Instead of spinning the CPU could switch to another thread

Instead of spinning the CPU could switch to another thread Test and Set Lock (TSL)

  • instruction can fail

if bus cannot be locked

12

slide-13
SLIDE 13

Spinning versus switching Spinning versus switching

  • In

In some cases CPU must wait some cases CPU must wait

  • waits to acquire ready list

waits to acquire ready list

  • In other cases a choice exists

In other cases a choice exists

  • spinning wastes CPU cycles

spinning wastes CPU cycles

  • switching uses up CPU cycles also

switching uses up CPU cycles also

  • could be possible

could be possible to make separate decision each time to make separate decision each time locked locked mutex mutex encountered encountered

13

slide-14
SLIDE 14

Multiprocessor scheduling Multiprocessor scheduling

  • What is scheduled?

What is scheduled?

  • User

User-

  • level threads

level threads

OS schedules processes

OS schedules processes

  • Kernel

Kernel-

  • level threads

level threads

OS schedules threads

OS schedules threads

  • Where to run it?

Where to run it?

  • Which CPU

Which CPU

  • Schedules thread independently or in groups?

Schedules thread independently or in groups?

  • Timesharing

Timesharing

  • Space sharing

Space sharing

  • Gang scheduling

Gang scheduling

14

slide-15
SLIDE 15

Timesharing Timesharing

  • Single scheduling data

Single scheduling data structure for structure for CPUs CPUs

  • Automatic load balancing

Automatic load balancing

  • Smart scheduling : thread in critical region is not switched off

Smart scheduling : thread in critical region is not switched off

  • Affinity scheduling: make an effort to have thread run on the

Affinity scheduling: make an effort to have thread run on the same CPU again same CPU again

15

slide-16
SLIDE 16

Space sharing Space sharing

  • Multiple related threads at same time across multiple CPUs

Multiple related threads at same time across multiple CPUs

  • Simple model:

Simple model:

  • All threads are allocated CPUs at the same time

All threads are allocated CPUs at the same time

  • They hold on the CPU even while waiting for I/O

They hold on the CPU even while waiting for I/O

  • Release only when thread is finished

Release only when thread is finished

  • More complex alternatives exist

More complex alternatives exist

16

slide-17
SLIDE 17

Multiprocessor Scheduling (4) Multiprocessor Scheduling (4)

  • Problem with threads in figure:

Problem with threads in figure:

  • Belong to same thread A, need to communicate,

Belong to same thread A, need to communicate,

  • but run in different phases, makes A slow!

but run in different phases, makes A slow!

  • Solution

Solution: Gang : Gang Scheduling Scheduling

17

slide-18
SLIDE 18

Gang Scheduling Gang Scheduling

  • Groups of related threads scheduled as a unit (a gang)

Groups of related threads scheduled as a unit (a gang)

  • All members of gang run simultaneously on diff. timeshared CPU

All members of gang run simultaneously on diff. timeshared CPU

  • All gang members start and end time slices together

All gang members start and end time slices together A ’s threads B & C D, E0 Rest of E repeat

18

slide-19
SLIDE 19

Multicomputers Multicomputers

  • Definition:

Definition:

Tightly Tightly-

  • coupled CPUs that do not share memory

coupled CPUs that do not share memory

  • Also known as

Also known as

  • cluster computers

cluster computers

  • clusters of workstations (COWs)

clusters of workstations (COWs)

19

slide-20
SLIDE 20

Multicomputers Multicomputers

  • Hardware

Hardware

  • Topology

Topology

  • Communication

Communication

  • Network interfaces

Network interfaces

  • Communication software

Communication software

  • Low

Low-

  • level:from memory to network

level:from memory to network

  • User

User-

  • level: send & receive, blocking vs. Nonblocking

level: send & receive, blocking vs. Nonblocking

  • Remote

Remote-

  • Procedure Call, RPC

Procedure Call, RPC

  • Distributed shared memory

Distributed shared memory

  • Scheduling & load balancing

Scheduling & load balancing

20

slide-21
SLIDE 21

Interconnection Topology Interconnection Topology

21

Single switch Ring Grid Double torus Cube 4D hypercude

slide-22
SLIDE 22

Communication Communication

  • Store

Store-

  • and

and-

  • forward

forward

  • Each switch that receives the packet, forwards to next one

Each switch that receives the packet, forwards to next one

  • Flexible and efficient, Latency grows with size

Flexible and efficient, Latency grows with size

  • Circuit switching

Circuit switching

  • Path established first during setup phase

Path established first during setup phase

  • All packets travel the same path

All packets travel the same path

22

slide-23
SLIDE 23

Low Low-

  • level communication

level communication

Packets copied several times during transmission Packets copied several times during transmission

23

slide-24
SLIDE 24

User Level Communication Software User Level Communication Software

  • Minimum services

Minimum services provided provided

  • send and receive

send and receive commands commands

  • These are blocking

These are blocking (synchronous) calls (synchronous) calls

(a) Blocking send call (b) Nonblocking send call

24

slide-25
SLIDE 25

Remote Procedure Call (1) Remote Procedure Call (1)

  • Client makes (normal) procedure call

Client makes (normal) procedure call

  • Client stub marchalls the parameters for network

Client stub marchalls the parameters for network

  • Network passes the request to server

Network passes the request to server

  • Server stub unmarhalls the parameters

Server stub unmarhalls the parameters

  • and makes a normal procedure call on server

and makes a normal procedure call on server

25

slide-26
SLIDE 26

Remote Procedure Call (2) Remote Procedure Call (2)

Implementation Issues Implementation Issues

  • Cannot pass pointers

Cannot pass pointers

  • call by reference becomes copy

call by reference becomes copy-

  • restore (but might fail)

restore (but might fail)

  • Weakly typed languages

Weakly typed languages

  • client stub cannot determine size

client stub cannot determine size

  • Not always possible to determine parameter types

Not always possible to determine parameter types

  • Cannot use global variables

Cannot use global variables

  • may get moved to remote

may get moved to remote machine machine

  • RFC 1057

RFC 1057

  • RPC v2 : RFC 1831

RPC v2 : RFC 1831

26

slide-27
SLIDE 27

27

RPC: a RPC: a S Schematic chematic V View iew

FNCT(a,b) FNCT(a,b) c:={ comp} c:={ comp} return c. return c.

Thread P

… Y=FNCT(X,Y) …

X, Y, Z System A System B

RPC package RPC package

a:=X; b:=Y; Y Y=FNCT(X,Y)

slide-28
SLIDE 28

Distributed Shared Distributed Shared Memory Memory

  • Can

Can be be implemented on implemented on

  • hardware

hardware

  • perating system
  • perating system
  • user

user-

  • level software

level software

28

slide-29
SLIDE 29

Scheduling & load balancing Scheduling & load balancing

  • Each node has its own set of processes

Each node has its own set of processes

  • Local scheduling decisions

Local scheduling decisions

  • Global decision on allocation of processes

Global decision on allocation of processes

  • Gang scheduling over multicomputers is possible

Gang scheduling over multicomputers is possible

  • Processor allocation algorithms (for load balancing)

Processor allocation algorithms (for load balancing)

  • A graph

A graph-

  • theoretic deterministic algorithm

theoretic deterministic algorithm

  • A sender

A sender-

  • initiated distributed heuristic algorithm

initiated distributed heuristic algorithm

  • Overloaded node

Overloaded node

  • A receiver

A receiver-

  • initiated distributed heuristic algorithm

initiated distributed heuristic algorithm

  • Underloaded node

Underloaded node

29

slide-30
SLIDE 30

Virtualization Virtualization

30

slide-31
SLIDE 31

Background and motivation Background and motivation

  • Consolidating servers

Consolidating servers

  • Each service in its own virtual machine

Each service in its own virtual machine

  • Still not dependent on others

Still not dependent on others

  • Each has its own OS, libraries, configuration files

Each has its own OS, libraries, configuration files

  • Can fail independently, no effect to other virtual machines

Can fail independently, no effect to other virtual machines

  • Fewer physical machines

Fewer physical machines

  • Reduced hardware and energy costs

Reduced hardware and energy costs

  • Hardware (or hypervisor) failure fails all services on that server

Hardware (or hypervisor) failure fails all services on that server

  • Other benefits:

Other benefits:

  • Checkpointing and migrating straighforward (memory image)

Checkpointing and migrating straighforward (memory image)

  • Running lecacy systems (no hardware available any more)

Running lecacy systems (no hardware available any more)

  • Software development (testing on several OS)

Software development (testing on several OS)

31

slide-32
SLIDE 32

Requirements Requirements

  • Virtual machine must act just like the real hardware

Virtual machine must act just like the real hardware

  • Booting the machine, installing operating system, etc.

Booting the machine, installing operating system, etc.

  • Hypervisor provides this illusion

Hypervisor provides this illusion

  • Hypervisor

Hypervisor emulates

emulates the hardware by ”interpreting” the

the hardware by ”interpreting” the machine code instructions machine code instructions

  • Hardware support necessary for type 1 hypervisors

Hardware support necessary for type 1 hypervisors

  • Priviledges instructions (trap if run on user mode)

Priviledges instructions (trap if run on user mode)

  • Sensitive instructions (executed only in kernel mode)

Sensitive instructions (executed only in kernel mode)

  • Virtualizable only wen sensitive instr. are a subset of priv.

Virtualizable only wen sensitive instr. are a subset of priv.

  • Intel 386 not virtualizable:

Intel 386 not virtualizable:

  • Some sensitive instructions ignored in user mode

Some sensitive instructions ignored in user mode

  • Some instructions can read sensitive data without trap

Some instructions can read sensitive data without trap

32

slide-33
SLIDE 33

Type 1 hypervisor Type 1 hypervisor

  • Execution of priviledged instructions

Execution of priviledged instructions

  • Hardware detects!

Hardware detects!

  • Trap to hypervisor

Trap to hypervisor

  • Check whether the instruction for guest OS or user applic.

Check whether the instruction for guest OS or user applic.

  • Execute the instruction from OS

Execute the instruction from OS

  • Emulate the actual behavior for the user applications

Emulate the actual behavior for the user applications

33

slide-34
SLIDE 34

Type 2 hypervisor Type 2 hypervisor

  • No hardware support needed

No hardware support needed

  • Hypervisor itself

Hypervisor itself runs in user mode

runs in user mode

  • Executing program (or OS) in virtual machine in done

Executing program (or OS) in virtual machine in done using using binary translation

binary translation

  • First, hypervisor

First, hypervisor scans

scans the code for basic blocks without any

the code for basic blocks without any jumps etc changing the program counter jumps etc changing the program counter

  • Replace

Replace each sensitive instruction and the last instructionn

each sensitive instruction and the last instructionn

  • f each block with call to the hypervisor’s own procedure
  • f each block with call to the hypervisor’s own procedure
  • Modified block cached within the hypervisor and executed

Modified block cached within the hypervisor and executed

  • Binary translation can be used on type 1 hypervisors

Binary translation can be used on type 1 hypervisors also to avoid traps also to avoid traps

34

slide-35
SLIDE 35

Paravirtualization Paravirtualization

  • Use only

Use only modified guest operating system

modified guest operating system

  • Do the modifications (done in binary translation) already to

Do the modifications (done in binary translation) already to the code of the guest os the code of the guest os

  • The modified OS cannot be anymore run directly on the

The modified OS cannot be anymore run directly on the hardware on its own, it needs the hypervisor hardware on its own, it needs the hypervisor

  • Create a special Virtual Machine Interface

Create a special Virtual Machine Interface

  • The implementation under the interface can change

The implementation under the interface can change (VMILinux using VMIL) (VMILinux using VMIL)

35

slide-36
SLIDE 36

VMI VMI – – Virtual Machine Interface Virtual Machine Interface

  • Modified kernel calls procedures of VMI for any

Modified kernel calls procedures of VMI for any sensitive operations sensitive operations

  • The VMI form a low

The VMI form a low-

  • level layer interface with hardware

level layer interface with hardware

  • r hypervisor
  • r hypervisor

(just change the procedure implementations) (just change the procedure implementations)

36

slide-37
SLIDE 37

Distributed Distributed Systems Systems

  • The difference between multicomputer and distributed

The difference between multicomputer and distributed system depends on the viewpoint system depends on the viewpoint

  • Distributed systems are loosely coupled

Distributed systems are loosely coupled

  • Covered in our Distributed Systems

Covered in our Distributed Systems -

  • course

course

37

slide-38
SLIDE 38

Distributed Distributed Systems Systems

Achieving uniformity with middleware Achieving uniformity with middleware

38

slide-39
SLIDE 39

The Internet The Internet

39