Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology - - PowerPoint PPT Presentation

niagara t1
SMART_READER_LITE
LIVE PREVIEW

Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology - - PowerPoint PPT Presentation

Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com Agenda: Why CMT Processors Highlights of Sun Niagara Processor Performance characteristics of T1 Need for Virtualization CMT &


slide-1
SLIDE 1

Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com

Niagara(T1) A CMT PROCESSOR

slide-2
SLIDE 2

Sun Proprietary Information

Agenda:

  • Why CMT Processors
  • Highlights of Sun Niagara Processor
  • Performance characteristics of T1
  • Need for Virtualization
  • CMT & Virtualization
  • Sun Virtualization Solutions
  • HW and Software Network

Virtualization.

slide-3
SLIDE 3

Sun Proprietary Information

Case For CMT Processors

slide-4
SLIDE 4

Sun Proprietary Information

Memory Latency

Compute Time

Time Saved

Memory Latency

Compute

C M C M C M

Thread

Tradional processor behavior

C M C M C M

Memory Latency

Compute Time

Memory Latency

Compute

Thread Single scalar processor Processor optimized for ILP

slide-5
SLIDE 5

Sun Proprietary Information

Characteristics of Commercial Work Load

  • High degree of thread level parallelism (TLP)
  • Large working sets result in poor locality of reference

leading to high cache miss rates

  • There is significant data sharing among threads resulting

in coherence misses

  • There is low instruction level parallelism (ILP) due to high

cache miss rates, difficult to predict branches etc...

  • Performance is bottle necked by stalls on memory

access

slide-6
SLIDE 6

Sun Proprietary Information

Sun Solution

NIAGARA

Chip Multi Threaded Processor

slide-7
SLIDE 7

Sun Proprietary Information

Niagara(T1)

  • Uses CPU threads to exploit TLP

– Memory and Pipeline stall times are hidden due to

multiple threads

– Shared L2 cache allows efficient data sharing

between threads

  • Memory system is designed for high throughput

– High bandwidth interface to L2 cache for L1 misses – Highly associative L2 cache – High bandwidth interface to DRAM

slide-8
SLIDE 8

Sun Proprietary Information

Designed for Performance and Efficiency

Dedicated Integrated Memory Controllers Clean Sheet Design Delivers Highest Performance, Efficiency On-Chip Simplicity Means No Wait Latency Integrated Internal Communications

BUS

C8 C7 C6 C5 C4 C3 C2 C1 L2$ L2$ L2$ L2$ Xbar

DDR-2 SDRAM DDR-2 SDRAM DDR-2 SDRAM DDR-2 SDRAM FPU Sys I/F Buffer Switch Core

slide-9
SLIDE 9

Sun Proprietary Information

Niagara Specs

  • Up to 32 threads, 8 cores
  • Unique L1$ 16KB-I, 8KB-D per core
  • Shared L2$ 3MB, 134GB/s, 12 way associative
  • Radically changed cache coherency processing
  • 4XDDR2 Mem on CHIP Controllers 23GB/sec
  • Upto 128 GB memory
  • SSL support - 7X the RSA throughput of Xeon
  • Requires about 70 Watts
  • Each thread requires just about 2.0 watts
  • No Recompilation required
slide-10
SLIDE 10

Sun Proprietary Information

Thread Selection Policy

  • CPU switches between available threads every cycle giving

priority to least recently executed thread

  • Threads become unavailable due to:

– Long latency ops: loads, branch, mul, div – Pipeline stalls such as cache misses, traps, and resource

conflicts

  • Loads are speculated as cache hits, and the thread is

switched in with lower priority.

slide-11
SLIDE 11

Sun Proprietary Information

Pipe0

Memory Latency

Compute

Thread 4 Thread 3 Thread 2 Thread 1

Pipe1

Thread 4 Thread 3 Thread 2 Thread 1

Pipe2

Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1

Pipe3 Pipe4

Thread 4 Thread 3 Thread 2 Thread 1

Pipe5

Thread 4 Thread 3 Thread 2 Thread 1

Pipe6

Thread 4 Thread 3 Thread 2 Thread 1

Pipe7

Thread 4 Thread 3 Thread 2 Thread 1

Time

Multithreaded Process on Niagara

Larger number of Memory References outstanding from overlapping h/w threads leads to higher throughput

slide-12
SLIDE 12

Sun Confidential: Sun Employees and Authorized Partners Only

SWaP (Space, Watts and Perf)

Space: 2RU x Watts: 312

Sun FireT2000 SWaP Rating = 30.4

Performance: 19,000 Users(1)

  • 1. LotusR6iNotes

Performance/(Space*Watts) = SWaP Rating

= SWaP: 30.4

slide-13
SLIDE 13

Sun Confidential: Sun Employees and Authorized Partners Only

Sun Fire T1000 Crushes Xeon and p5+

Dell SC1425

SPECjbb2005

IBM p5+ 520

SPECjbb2005

Sun Fire

2.1X

SWaP Space Power Usage Performance

1/2 Same 4.4X 1.6X 1/2 1/4 14X

T100

vs.

slide-14
SLIDE 14

Sun Proprietary Information

Niagara-2 (T2): True System on a Chip

  • Better performance than Niagara-1
  • Up to 8 Cores
  • Up to 64 threads per CPU
  • Same power envelope as T1
  • On chip NIC's
  • And much more that I can not state
slide-15
SLIDE 15

Sun Proprietary Information

Performance Characteristics

  • f T1
slide-16
SLIDE 16

Sun Proprietary Information

Positive Characteristics

  • If a strand is stalled, its cycles can

be utilized by other threads

  • Multiple threads running the same

application benefit by sharing text and data in L2 cache

  • These characteristics make CMT

ideal for throughput computing.

slide-17
SLIDE 17

Sun Proprietary Information

Not so Positive Characteristics

  • If one thread is thrashing the L1

instruction cache, data cache, or TLB's on a core, it can adversely affect other threads on that core.

  • If all threads run on the same core

they are only getting one-quarter of the CPU time.

  • So CMT is not ideal for real time

applications.

slide-18
SLIDE 18

Sun Proprietary Information

Scaling issues to be aware of

  • Hot locks are the most common

reason applications fail to scale on CMT processors

  • Tuning Critical Sections
  • Apply more threads as CMT is a

thread rich environment.

slide-19
SLIDE 19

Sun Proprietary Information

Server Virtualization

slide-20
SLIDE 20

Sun Proprietary Information

Benefits of Virtualization

  • Virtualization is masking and sharing of server

resources

  • Results in

 Server Consolidation  Higher server utilization  Increased operational efficiency  Improved manageability

slide-21
SLIDE 21

Sun Proprietary Information

CMT and Virtualization

  • CMT provides hooks for server

virtualization

  • Each Strand can be a Virtual CPU
  • Niagara-2 also provides support for

Network Virtualization

slide-22
SLIDE 22

Sun Proprietary Information

Solaris Virtualization Solutions

  • Containers (BSD Jails)
  • Logical Domains (Individual OS

Instance per domain)

  • Xen
slide-23
SLIDE 23

Sun Confidential: Sun Employees and Authorized Partners Only

Logical Domains + Zones

Hardware Hypervisor

LDom 1 Solaris 10

CPU Mem

LDom 2 Solaris 10

CPU Mem

LDom 3 Solaris 11

CPU Mem

App App App Zone 1 App App Zone App Zone 2

Shared CPU, Memory, IO

I/O CPU

App App

  • Partitioning

capability

> Create virtual

machines each with sub-set of resources

> Protection &

Isolation using HW+firmware combination

slide-24
SLIDE 24

Sun Confidential: Sun Employees and Authorized Partners Only

Network Virtualization

slide-25
SLIDE 25

Sun Proprietary Information

HW Based Network Virtualizarion

  • Niagara-2 (T2) has on chip network interfaces
  • Supports network virtualization/partitioning

– Multiple Partitions can co-exist within a port – Only cable, MAC and RX FIFO's are shared.

  • Virualization/Partitioning can be Based on

– VLANS – upto 4K per port – MAC address – upto 16 per port – Service addresses (IP addresses, TCP/UDP ports) -

upto 256 per device

  • Interrupts for flow are sent to a particular CPU
  • Full register sets are provided to control RX Rings
slide-26
SLIDE 26

Sun Confidential: Sun Employees and Authorized Partners Only

NIU RX Classification Model

MAC NIU Flow Classifier

Incoming Traffic

RX DMA RX DMA RX DMA RX DMA RX DMA RX DMA ...

Incoming flows are classified at layer 2, 3, or 4 and put into RX DMA channel according to classification rules that matched the flow.

Solaris Classification Interface: m_l2_classify_add() m_l2_classify_remove() m_classify_add() m_classify_remove()

NIU Flow Classifier

slide-27
SLIDE 27

Sun Proprietary Information

Software Based Network Virtualization

  • Not All NIC's have HW support for Virtualization
  • Software creates virtual stacks over 1Gb and

10Gb NIC's

  • Virtual stacks are isolated from each other (for

both resources and security purposes)

  • Each Virtual stack can be tuned separately
slide-28
SLIDE 28

Sun Proprietary Information

Virtualized Networking

Specific To Containers Common To All Virtual Machines Zone 1 Global Zone

Shared Stack with Global Zone Global Zone Squeue

.. .

Virtual

NIC

Virtual

NIC

NIC

Global Zone Mem area Zone 1 Mem area

Flow Classifier

.. .

Zone n Mem area

Virtual

NIC Shared Network Stack Zone 1 Squeue

Zone 2

Exclusive Network Stack Zone 2 Squeue Network Stack

slide-29
SLIDE 29

Sun Proprietary Information

Virtual Network with XEN

Solaris Guest OS 2

Guest OS 2 Virtual SQUEUE All Traffic Guest OS 2 VNIC

Solaris Guest OS 1

Guest 1 Virtual SQUEUE

HTTP Squeue HTTPS Squeue Default Squeue

.. .

Virtual

NIC

Virtual

NIC

Virtual

NIC

NIC

HOST OS All traffic Mem area Guest OS 2 All Traffic Mem area Guest OS 1 HTTPS Mem area

Flow Classifier

.. .

Guest OS 1 Default Mem area

.. . .

Guest OS 1 HTTP Mem area

.. . . . Solaris Host OS

Host OS Virtual SQUEUE All Traffic Host OS VNIC

NIC Virtualization Engine NIC Virtualization Engine NIC Virtualization Engine

slide-30
SLIDE 30

Sun Proprietary Information

Future Work

  • More work is needed to characterize different

workloads on CMT processors and define best practices

  • Open Interfaces are needed to implement

Virtualization

  • Network Bandwidth/Resource control support is

needed in HW

slide-31
SLIDE 31

Sun Proprietary Information

References

  • Various Sun internal and external

documents and publications on Niagara

slide-32
SLIDE 32

Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com

Niagara(T1) A CMT PROCESSOR