System X Virgina Tech's Supercomputer The fastest academic - - PowerPoint PPT Presentation
System X Virgina Tech's Supercomputer The fastest academic - - PowerPoint PPT Presentation
System X Virgina Tech's Supercomputer The fastest academic supercomputer Project #2 CS466, Fall 2004 By Raj Bharath Swam inathan Hareesh Nagarajan {rswam ina, hnagaraj} @ cs.uic.edu University of Illinois at Chicago How was it built?
How was it built?
- VTech faculty (The terascale com puting facility
– TCF) worked closely with vendor partners
- 1100 Power MAC G5 were put into racks and
the construction began
- In parallel device drivers, hand optim ization of
num erical libraries, code porting was going on
- The super com puter was on paper in Feb 2003
and was built by Septem ber 2003.
- Unfortunately the system couldn't perform
scientific com putation as ECC RAM was required and the G5 didn't support it. Enter Xserve G5.
The TCF lab went from looking like this (Left) t this (Bottom )
Specification
- Nodes 1100 Apple XServe G5 2.3 GHz dual
processor cluster nodes (4 GB RAM, 80 GB S- ATA HD)
– 4.4 TB (4400 GB) of RAM – 88 TB (88000 GB) of HDD – 2200 Processors
- Primary Communication
24 Mellanox 96 port InfiniBand switches (4X InfiniBand, 10 Gbps)
- Secondary Communication
6 Cisco 4506 Gigabit Ethernet switches
- Cooling Liebert X- treme Density System cooling
- Software
Mac OS X, MVAPICH, XLC & XLF
- Current Linpack
– Rpeak = 20.24 Teraflop – Rmax = 12.25 Teraflops – Nmax = 620000
Some facts
- System X comes in at #7 on the top500.org's list
- Each of the 1100 Xserve servers was custom built by
Apple.
- $5.8 million price tag ($5.2 million for the initial
machines, and $600,000 for the Xserve upgrade)
- New (custom built!) Xserve servers are about 15%
faster than the desktop machines - - - > The new System X operates about 20 percent faster, almost adding 2 teraflops
- The extra 5- percent performance boost came from
- ptimized software
- Typically,
System X runs several projects simultaneously, each tying up 400 to 500 processors for research into weather and molecular modeling.
Power PC G5 Processor – key features
- Based on IBM’s PowerPC 970FX series.
- 64 bit PowerPC Architecture
- Native support for 32- bit Applications
- Front side bus speed upto 1.25GHz
- Superscalar execution core with 12 functional units
supporting upto 215 in- flight instructions
- Uses a dedicated optimized 128 bit velocity Engine for
accelerated SIMD processing
- Can address upto 4 TB of RAM
Specifications
- 90nm Silicon on Insulator (SOI) process with copper
interconnects
- Consumes 42W of power at 1.3V.
- Around 58 million transistors.
- Uses a 2 Level Cache
- Registers:
– 32 64- bit general purpose registers – 32 64- bit floating- point registers – 32 128- bit vector registers
- Eight deep issue queues for each functional unit
- Uses a 16 stage pipeline
- It runs at 1/ 2 the core clock speed DDR. So for the
2.3GHz processor, the Front Side Bus runs at 1.15GHz DDR
- Bus is composed of two unidirectional channels, each
32 bits wide, the total theoretical peak bandwidth for the 1.15GHz bus is close to 10GB/ sec. Dual processors mean twice the bandwidth i.e around 20GB/ sec
Front- side bus
Cache
- L1 data cache:
32 KB write through 2- way Associative mapped
- L1 instruction cache:
64 KB direct mapped
- L2 cache:
512K fully associative
- L1 cache is parity protected
- L2 cache is protected using ECC (Error Correction code)
logic
Fetch, Decode & Issue
- Eight instructions per cycle are fetched from the 64KB
instruction cache into an instruction queue.
- 9 pipeline stages devoted to instruction fetch and
decode
- “Decode, crack, and group formation" phase breaks
down instructions to simpler IOPS( Internal Operations), which resemble RISC instructions
- 5 IOPS are dispatched per clock (4 instructions + 1
branch) in program order to a set of issue queues
- Out- of- order execution logic pulls instructions from
these issue queues to feed the chip's eight functional units.
Branch prediction
- On each instruction fetch the front end's branch unit scans
the eight instructions and picks out up to two branches. Prediction is done using one of two branch prediction schemes.
- 1. Standard BHT Scheme – 16K entries, 1- bit branch
predictor.
- 2. Global predictor table scheme – 16K entries. Each entry
has an associated 11 bit vector that records the actual execution path taken by the previous 11 fetch groups and a 1- bit branch predictor.
- A third 16K- entry keeps track of which of the two
schemes works best for each branch. When each branch is finally evaluated, the processor compares the success of both schemes and records in this selector table which scheme has done the best job so far of predicting the
- utcome of that particular branch.
Integer unit
- 2 Integer Units attached to 80 GPR’s (32 architectural +
48 rename)
- Simple, non- dependent integer IOPs can issue and finish
at a rate of one per cycle. Dependent integer IOPS need 2 cycles
- Condition register logical unit (CRU): Dedicated unit for
handling logical operations related to the PowerPC's condition register
Load Store Unit
- Two identical load- store units that executes all of the
LOADs and STOREs.
- Dedicated address generation hardware which is part
- f the load- store units. Hence address generation
takes place as part of the execution phase of the Load- Store Units pipeline.
Integer Issue Queue
Floating point unit
- Two identical FPUs, each of which can execute the fastest
floating- point instructions in 6 cycles. Single- and double- precision operations take the same amount of time to execute.
- FPUs are fully pipelined for all operations except
floating- point divides.
- 80 total microarchitectural registers, where 32 are
PowerPC architectural registers and the remaining 48 are rename registers.
- The floating- point units can complete both a multiply
- peration and an add operation as part of the same
machine instruction (fused multiply- add), thereby accelerating matrix multiplication, vector dot products, and other scientific computations.
Floating point Issue queue
Vector Unit
- Contains 4 fully pipelined vector processing units
- 1. Vector Permute Unit (VPU)
- Vector Arithmetic Logic Unit (VALU)
- 2. Vector Simple Integer Unit (VSIU)
- 3. Vector Complex Integer Unit (VCIU)
- 4. Vector Floating- point Unit (VFPU)
- Upto four vector IOPs per cycle total can be issued to the
two vector issue queues - two IOPs per cycle maximum to the 16- entry VPU queue and two IOPs per cycle maximum to the 20- entry VALU queue
Vector Issue Queue
Conclusion (On the processor. The presentation isn't over!)
- Dual processors provide the high- density power and
scalability required by the research and computational clustering environments of System X.
- The
PowerPC G5 is designed for symmetric multiprocessing.
- Dual independent frontside buses allow each processor
to handle its own tasks at maximum speed with minimal interruption.
- With sophisticated multiprocessing capabilities built in,
Mac OS X and Mac OS X Server dynamically manage multiple processing tasks across the two processors. This allows dual PowerPC G5 systems to accomplish up to twice as much as a single- processor system in the same amount of time, without requiring any special
- ptimization of the application.
A brief intro to Interconnection Networks
- Shared media has disadvantages (collisions)
- Switches allow communication directly from source to
destination, without intermediate nodes to interfere with these signals
- A crossbar switch allows any node to communicate
with any other node in one pass through interconnection
- An Omega interconnection uses less hardware but
contention is more likely. Contention is called blocking
- A fat tree switch has more bandwidth added higher in
the tree to match the requirements of common communication patterns
More...
- A Storage Area Network (SAN) that tries to
- ptimize based on shorter distances is
Infiniband.
- High performance clusters such as the System
X utilize “Fat Tree” or Constant Bidirectional Bandwidth (CBB) networks to construct large node count non- blocking switch configurations
- Here integrated crossbars with relatively low
number of ports are used to build a non- blocking switch topology supporting a much larger number of endpoints.
Crossbar switch(left) CBB Network (below) used in the System X P = 96 (Ports) 24 Mellanox switches 96/ 2 * 24 = 1152 ~ 1100 Nod
Switche s
How does it apply to SystemX?
Used in the System X
Infiniband is a switch based serial I/ O interconnect architecture operating at a base speed of 10Gb/ s in each direction per port.
A cluster making use of Infiniband system fabric
Note: We were unable to obtain the exact schem a of the System
The Mellanox Switch
Apple's new liquid cooling system
- 1. G5 processor at point of
contact to the heatsink.
- 2. G5 processor card from IBM
- 3. Heatsink
- 4. Cooling fluid output from the
radiator to the pump
- 5. Liquid cooling system pump
- 6. Pump power cable
- 7. Cooling fluid radiator input
from the G5 processor
- 8. Radiant grille
- 9. Airflow direction
More on the cooling system...
- 1. Liquid cooling system
pump
- 2. G5 processors
- 3. Radiator output
- 4. Radiator
- 5. Pump power cable
- 6. Radiator input
The cooling system used for SystemX
- Liebert’s XDR system utilizes a
cooling module that is attached to the back door of the computer rack enclosure.
- Fans in the module move room
temperature air from the front of the enclosure, past the equipment in the rack, past a cooling coil and expel it from the back of the unit, chilled to the point where the impact on the room is close to neutral.
- The XDR system can be configured
to take care of uneven heat loads within the room.
Software used
- Operating system: Mac OS X
- MVAPICH (pronounced as 'em- vah- pich'): is a
high performance implementation of MPI- 1
- ver InfiniBand based on MPICH1.
- Compilers: XL C/ C+ + Advanced Edition V6.0
for Mac OS X and XL Fortran Advanced Edition for Mac OS X (Both are made by IBM)
Performance of MVAPICH2 on G5
- Testbed:
– Each node of our testbed has dual 2.0 GHz PowerPC
G5 processors with 512 KB L2 cache.
– Each node also has 512 Megabyte memory and one
PCI- X 64- bit 133 MHz bus. They are equipped with MT23108 HCAs with PCI- X interfaces.
– An InfiniScale MTS2400 switch is used to connect
all the nodes.
– Experiments were conducted using the Small Tree
3.2 VAPI driver.
– The operating system used was OSX. – GCC compilers are used for all the test programs.
The point is: By using Infiniband and highly
- ptim iz ed