Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 Azher - - PowerPoint PPT Presentation

disk to disk data transfers at 100gbps supercomputing
SMART_READER_LITE
LIVE PREVIEW

Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 Azher - - PowerPoint PPT Presentation

Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 Azher Mughal Caltech (HEP) CENIC 2012 http://supercomputing.caltech.edu Agenda Motivation behind SC 2011 Demo Collaboration (Caltech, Univ of Victoria, Vendors) Network


slide-1
SLIDE 1

Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 Azher Mughal Caltech (HEP)

CENIC 2012 http://supercomputing.caltech.edu

slide-2
SLIDE 2

Agenda

  • Motivation behind SC 2011 Demo
  • Collaboration (Caltech, Univ of Victoria, Vendors)
  • Network & Servers Design
  • 100G Network testing
  • Disk to Disk Transfers
  • PCIe Gen3 Server Performance
  • How to design a Fast Data Transfer Kit
  • Questions ?
slide-3
SLIDE 3

 The LHC experiments, with their distributed Computing Models and

world-wide hands-on involvement in LHC physics, have brought renewed focus on networks, thus renewed emphasis on “capacity” and “reliability” of the networks

 Experiments have seen an exponential growth in capacity  10X in usage every 47 months in ESnet over 18 years  About 6M times capacity growth over 25 years across the Atlantic

(LEP3Net in 1985 to USLHCNet as of today)

 LHC experiments (CMS / ATLAS) are generating massively large data sets

which needs to be efficiently transferred to the end sites, anywhere in the world

 A sustained ability to use ever-larger continental and transoceanic networks

effectively: high throughput transfers

 HEP as a driver of R&E and mission-oriented networks  Testing latest innovations both in terms of software and hardware

Motivation behind SC 2011 Demo

Harvey Newman, Caltech

slide-4
SLIDE 4
  • Fiber Distance from Seattle Show Floor to Univ of

Victoria, about 217 Km.

  • Optical switches = Ciena OME-6500 using 100GE

OTU4 cards

  • Brocade MLXe-4 Routers

– One 100GE card with LR4 – One 8+8 port 10GE line card

  • Force10 Z9000 40GE switch
  • Mellanox PCIe Gen3 Dual port 40GE NICs
  • Servers with PCIe Gen3 slots using Intel E5 Sandy Bridge

processors –

CMS Data Transfer Volume (Oct 2010– Feb. 2011)

10 PetaBytes Transferred Over 4 Months = 8.0 Gbps Avg. (15 Gbps Peak)

slide-5
SLIDE 5

SuperComputing 2011 Collaborators

Caltech Booth 1223

Courtesy of Ciena

slide-6
SLIDE 6

SC11: Hardware Facts

Caltech Booth

  • SuperMicro : SandyBridge E5 based Servers
  • Mellanox 40GE ConnectX-3, Active Fiber or Passive Copper Cables.
  • Dell-Force10 Z9000 Switch (all 40GE ports, 54MB shared buffer)
  • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports
  • LSI 9265-8i RAID Controllers (with FAST Path), OCZ Vertex 3 SSD Drives

SCinet

  • Ciena OME6500 OTN switch

BC Net

  • Ciena OME6500 OTN switch

Univ of Victoria

  • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports
  • Dell R710 Servers with 10GE Intel NICs and OCZ Denva SSD Disks
slide-7
SLIDE 7

SC11 - WAN Design for 100G

slide-8
SLIDE 8

Caltech Booth - Detailed Network

slide-9
SLIDE 9

Key Challenges

  • First hand experience with PCIe Gen3 servers using

sample E5 Sandy Bridge processors, Not many vendors were available for testing.

  • Will FDT be able to go close to the line rate of Mellanox

ConnectX-3 Network Cards, 39.6Gbps (theoretical peak)

  • What about the firmware and drivers for both Mellanox

and LSI ?

  • LAG between Brocade 100G router and Z9000, 10 x

10GE , will it work ?

  • End to End 100G and 40G testing, any transport issues ?
  • What do we know on the BIOS settings for Gen3
slide-10
SLIDE 10

Issues we faced

  • SuperMicro Servers, Mellanox CX3 drivers were all in BETA

stage.

  • Mellanox NIC randomly throwing interface errors, though no

physical errors.

  • QSFP Passive Copper cable has issues at full rates, occasional

cable errors

  • Mellanox NIC randomly throwing interface errors, though no

physical errors.

  • QSFP Passive Copper cable has issues at full rates.
  • LAG between Brocade and Z9000 had hashing issues for 10 x

10GE so we moved to 12 x 10GE

  • LSI drivers, single threaded, utilizing a single core to maximum.
slide-11
SLIDE 11

DDR3

Sandy Bridge CPU Sandy Bridge CPU

1

QPI

DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 PCIe3 x8 PCIe3 x8

PCIe-3 X8

FDT Transfer Application LSI RAID Controller

DMI2

PCIe3 x8 PCIe3 x8 PCIe3 x8

PCIe2 x4

QPI

DDR3 PCIe3 x8 PCIe3 x8

Mellanox MSI Interrupts: 2/4/8

PCIe-2 X8 PCIe-2 X8

LSI LSI

How it looks from inside

slide-12
SLIDE 12

SC11: Software and Tuning

  • RHEL 6.1 / Scientific Linux 6.1 Distribution
  • Fast Data Transfer (FDT) utility for moving data among the sites

– Writing on the RAID-0 (SSD disk pool) – /dev/zero  /dev/null memory test

  • Kernel smp affinity:
  • Keep the Mellanox NIC driver queues to the processor cores

where NIC’s PCIe Lane is connected

  • Move LSI Driver IRQ to the second processor
  • Using Numa Control to bind FDT application to the second

processor

  • Change Kernel TCP/IP parameters
slide-13
SLIDE 13

Hardware Setting/Tuning

  • SuperMicro Motherboards
  • PCI-e slot needs to be manually set to Gen3, otherwise

defaults are Gen2

  • Disable Hyper threading
  • Change PCI-e payload to the maximum (for Mellanox NICs)
  • Z9000
  • Flow control needs to be turned on for Mellanox NIC to work

properly in the Z9000 switch

  • Single Queue compared to 4 Queue model
  • Mellanox
  • Got latest firmware, helped reducing the interface errors NIC

was throwing

slide-14
SLIDE 14

In (Gbps) Traffic: Out

Sustained 186 Gbps; Enough to transfer 100,000 Blue-ray per day

70 40 100 90 60 30

Servers Testing, reaching 100G

slide-15
SLIDE 15

Disk to Disk Results; Peaks of 60Gbps

Disk write on 7 Supermicro and Dell servers with a mix of 40GE and 10GE Servers.

60Gbps

slide-16
SLIDE 16

Total Transfers

Total Traffic among all the Links during the Show: 4.657 PetaBytes

slide-17
SLIDE 17

Single Server Gen3 performance: 36.8 Gbps (TCP) (During SC11)

37Gbps

Post SC11: Reaching 37.5Gbps inbound, and peaks of 38Gbps

37.5Gbps CUBIC HTCP 38Gbps Spike

slide-18
SLIDE 18

40GE Server Design Kit

 SandyBridge E5 Based Servers: (SuperMicro X9DRi-F or Dell R720) Intel E5-2670 with C1 or C2 Stepping  Mellanox ConnectX-3 PCIe Gen3 NIC  Dell – Force10; Z9000 40GE Switch  Mellanox QSFP Active Fiber Cables  LSI 9265-8i, 8 port SATA 6G RAID Controller  OCZ Vertex 3 SSD, 6Gb/s

Server Cost = ~ $10k

slide-19
SLIDE 19

Future Directions

  • Finding bottlenecks in the LSI Raid Card driver, New driver

supporting MSI Vectors is available (many configurable queues)

  • A better refined approach to distribute application and drivers

among the cores

  • Optimizing Linux kernel, timers, other unknowns
  • New Driver available from Mellanox, 1.5.7.2
  • We are working with Mellanox team to find performance

limitations not reaching close to the possible 39.6Gbps ethernet rate using 9K packets.

  • Ways to lower CPU Utilization …
  • Understand/overcome the SSD wearing out problems over a time
  • Yet to run tests with E5 2670 C2 chipsets, arrived a week ago.
slide-20
SLIDE 20

Summary

  • The 100Gbps network technology has shown the potential

possibilities to transfer peta scale physics datasets in matter of hours around the world.

  • Couple of highly tuned servers can easily reach to 100GE line

rate, effectively utilizing the PCIe Gen3 technology.

  • Individual Server tests using E5 processors and PCIe Gen-3 based

Network Cards have shown stable performance reaching close to 37.5Gbps.

  • Fast Data Transfer (FDT) application achieved an aggregate disk

write of 60Gbps.

  • MonALISA intelligent monitoring software, effectively recorded

and displayed the traffic at 40/100G and the other 10GE links.

slide-21
SLIDE 21

Questions ?