Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com
Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology - - PowerPoint PPT Presentation
Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology - - PowerPoint PPT Presentation
Niagara(T1) A CMT PROCESSOR Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com Agenda: Why CMT Processors Highlights of Sun Niagara Processor Performance characteristics of T1 Need for Virtualization CMT &
Sun Proprietary Information
Agenda:
- Why CMT Processors
- Highlights of Sun Niagara Processor
- Performance characteristics of T1
- Need for Virtualization
- CMT & Virtualization
- Sun Virtualization Solutions
- HW and Software Network
Virtualization.
Sun Proprietary Information
Case For CMT Processors
Sun Proprietary Information
Memory Latency
Compute Time
Time Saved
Memory Latency
Compute
C M C M C M
Thread
Tradional processor behavior
C M C M C M
Memory Latency
Compute Time
Memory Latency
Compute
Thread Single scalar processor Processor optimized for ILP
Sun Proprietary Information
Characteristics of Commercial Work Load
- High degree of thread level parallelism (TLP)
- Large working sets result in poor locality of reference
leading to high cache miss rates
- There is significant data sharing among threads resulting
in coherence misses
- There is low instruction level parallelism (ILP) due to high
cache miss rates, difficult to predict branches etc...
- Performance is bottle necked by stalls on memory
access
Sun Proprietary Information
Sun Solution
NIAGARA
Chip Multi Threaded Processor
Sun Proprietary Information
Niagara(T1)
- Uses CPU threads to exploit TLP
– Memory and Pipeline stall times are hidden due to
multiple threads
– Shared L2 cache allows efficient data sharing
between threads
- Memory system is designed for high throughput
– High bandwidth interface to L2 cache for L1 misses – Highly associative L2 cache – High bandwidth interface to DRAM
Sun Proprietary Information
Designed for Performance and Efficiency
Dedicated Integrated Memory Controllers Clean Sheet Design Delivers Highest Performance, Efficiency On-Chip Simplicity Means No Wait Latency Integrated Internal Communications
BUS
C8 C7 C6 C5 C4 C3 C2 C1 L2$ L2$ L2$ L2$ Xbar
DDR-2 SDRAM DDR-2 SDRAM DDR-2 SDRAM DDR-2 SDRAM FPU Sys I/F Buffer Switch Core
Sun Proprietary Information
Niagara Specs
- Up to 32 threads, 8 cores
- Unique L1$ 16KB-I, 8KB-D per core
- Shared L2$ 3MB, 134GB/s, 12 way associative
- Radically changed cache coherency processing
- 4XDDR2 Mem on CHIP Controllers 23GB/sec
- Upto 128 GB memory
- SSL support - 7X the RSA throughput of Xeon
- Requires about 70 Watts
- Each thread requires just about 2.0 watts
- No Recompilation required
Sun Proprietary Information
Thread Selection Policy
- CPU switches between available threads every cycle giving
priority to least recently executed thread
- Threads become unavailable due to:
– Long latency ops: loads, branch, mul, div – Pipeline stalls such as cache misses, traps, and resource
conflicts
- Loads are speculated as cache hits, and the thread is
switched in with lower priority.
Sun Proprietary Information
Pipe0
Memory Latency
Compute
Thread 4 Thread 3 Thread 2 Thread 1
Pipe1
Thread 4 Thread 3 Thread 2 Thread 1
Pipe2
Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1
Pipe3 Pipe4
Thread 4 Thread 3 Thread 2 Thread 1
Pipe5
Thread 4 Thread 3 Thread 2 Thread 1
Pipe6
Thread 4 Thread 3 Thread 2 Thread 1
Pipe7
Thread 4 Thread 3 Thread 2 Thread 1
Time
Multithreaded Process on Niagara
Larger number of Memory References outstanding from overlapping h/w threads leads to higher throughput
Sun Confidential: Sun Employees and Authorized Partners Only
SWaP (Space, Watts and Perf)
Space: 2RU x Watts: 312
Sun FireT2000 SWaP Rating = 30.4
Performance: 19,000 Users(1)
- 1. LotusR6iNotes
Performance/(Space*Watts) = SWaP Rating
= SWaP: 30.4
Sun Confidential: Sun Employees and Authorized Partners Only
Sun Fire T1000 Crushes Xeon and p5+
Dell SC1425
SPECjbb2005
IBM p5+ 520
SPECjbb2005
Sun Fire
2.1X
SWaP Space Power Usage Performance
1/2 Same 4.4X 1.6X 1/2 1/4 14X
T100
vs.
Sun Proprietary Information
Niagara-2 (T2): True System on a Chip
- Better performance than Niagara-1
- Up to 8 Cores
- Up to 64 threads per CPU
- Same power envelope as T1
- On chip NIC's
- And much more that I can not state
Sun Proprietary Information
Performance Characteristics
- f T1
Sun Proprietary Information
Positive Characteristics
- If a strand is stalled, its cycles can
be utilized by other threads
- Multiple threads running the same
application benefit by sharing text and data in L2 cache
- These characteristics make CMT
ideal for throughput computing.
Sun Proprietary Information
Not so Positive Characteristics
- If one thread is thrashing the L1
instruction cache, data cache, or TLB's on a core, it can adversely affect other threads on that core.
- If all threads run on the same core
they are only getting one-quarter of the CPU time.
- So CMT is not ideal for real time
applications.
Sun Proprietary Information
Scaling issues to be aware of
- Hot locks are the most common
reason applications fail to scale on CMT processors
- Tuning Critical Sections
- Apply more threads as CMT is a
thread rich environment.
Sun Proprietary Information
Server Virtualization
Sun Proprietary Information
Benefits of Virtualization
- Virtualization is masking and sharing of server
resources
- Results in
Server Consolidation Higher server utilization Increased operational efficiency Improved manageability
Sun Proprietary Information
CMT and Virtualization
- CMT provides hooks for server
virtualization
- Each Strand can be a Virtual CPU
- Niagara-2 also provides support for
Network Virtualization
Sun Proprietary Information
Solaris Virtualization Solutions
- Containers (BSD Jails)
- Logical Domains (Individual OS
Instance per domain)
- Xen
Sun Confidential: Sun Employees and Authorized Partners Only
Logical Domains + Zones
Hardware Hypervisor
LDom 1 Solaris 10
CPU Mem
LDom 2 Solaris 10
CPU Mem
LDom 3 Solaris 11
CPU Mem
App App App Zone 1 App App Zone App Zone 2
Shared CPU, Memory, IO
I/O CPU
App App
- Partitioning
capability
> Create virtual
machines each with sub-set of resources
> Protection &
Isolation using HW+firmware combination
Sun Confidential: Sun Employees and Authorized Partners Only
Network Virtualization
Sun Proprietary Information
HW Based Network Virtualizarion
- Niagara-2 (T2) has on chip network interfaces
- Supports network virtualization/partitioning
– Multiple Partitions can co-exist within a port – Only cable, MAC and RX FIFO's are shared.
- Virualization/Partitioning can be Based on
– VLANS – upto 4K per port – MAC address – upto 16 per port – Service addresses (IP addresses, TCP/UDP ports) -
upto 256 per device
- Interrupts for flow are sent to a particular CPU
- Full register sets are provided to control RX Rings
Sun Confidential: Sun Employees and Authorized Partners Only
NIU RX Classification Model
MAC NIU Flow Classifier
Incoming Traffic
RX DMA RX DMA RX DMA RX DMA RX DMA RX DMA ...
Incoming flows are classified at layer 2, 3, or 4 and put into RX DMA channel according to classification rules that matched the flow.
Solaris Classification Interface: m_l2_classify_add() m_l2_classify_remove() m_classify_add() m_classify_remove()
NIU Flow Classifier
Sun Proprietary Information
Software Based Network Virtualization
- Not All NIC's have HW support for Virtualization
- Software creates virtual stacks over 1Gb and
10Gb NIC's
- Virtual stacks are isolated from each other (for
both resources and security purposes)
- Each Virtual stack can be tuned separately
Sun Proprietary Information
Virtualized Networking
Specific To Containers Common To All Virtual Machines Zone 1 Global Zone
Shared Stack with Global Zone Global Zone Squeue
.. .
Virtual
NIC
Virtual
NIC
NIC
Global Zone Mem area Zone 1 Mem area
Flow Classifier
.. .
Zone n Mem area
Virtual
NIC Shared Network Stack Zone 1 Squeue
Zone 2
Exclusive Network Stack Zone 2 Squeue Network Stack
Sun Proprietary Information
Virtual Network with XEN
Solaris Guest OS 2
Guest OS 2 Virtual SQUEUE All Traffic Guest OS 2 VNIC
Solaris Guest OS 1
Guest 1 Virtual SQUEUE
HTTP Squeue HTTPS Squeue Default Squeue
.. .
Virtual
NIC
Virtual
NIC
Virtual
NIC
NIC
HOST OS All traffic Mem area Guest OS 2 All Traffic Mem area Guest OS 1 HTTPS Mem area
Flow Classifier
.. .
Guest OS 1 Default Mem area
.. . .
Guest OS 1 HTTP Mem area
.. . . . Solaris Host OS
Host OS Virtual SQUEUE All Traffic Host OS VNIC
NIC Virtualization Engine NIC Virtualization Engine NIC Virtualization Engine
Sun Proprietary Information
Future Work
- More work is needed to characterize different
workloads on CMT processors and define best practices
- Open Interfaces are needed to implement
Virtualization
- Network Bandwidth/Resource control support is
needed in HW
Sun Proprietary Information
References
- Various Sun internal and external
documents and publications on Niagara
Rao Shoaib Solaris Core Technology group rao.shoaib@sun.com