Huge Data Transfer Experimentation over Lightpaths
Corrie Kost, Steve McDonald
TRIUMF
Wade Hong
Carleton University
Huge Data Transfer Experimentation over Lightpaths Corrie Kost, - - PowerPoint PPT Presentation
Huge Data Transfer Experimentation over Lightpaths Corrie Kost, Steve McDonald TRIUMF Wade Hong Carleton University Motivation LHC expected to come on line in 2007 data rates expected to exceed a petabyte a year large Canadian HEP
Corrie Kost, Steve McDonald
TRIUMF
Wade Hong
Carleton University
ATLAS experiment
U, SFU, and UVic
Vancouver
federal budget
ability to provide dedicated point to point bandwidth over lightpaths under user control
to establish an end to end lightpath from Canada to CERN
to the wide area
protocol
limited by the optical components
Ethernet LAN
business opportunities for carriers
complement to the routed networks, not a replacement
hardware
between TRIUMF and CERN for iGrid 2002
OC-48
Diamond 6808 with 10GbE LRi blades
CERN using bbftp and tsunami
traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms
(Tsunami)
IGT to continue with experimentation
UofA for real-time remote farms
and Carleton U for transferring 700GB of ATLAS FCAL test beam data
Networks E600 switches, IXIA network testers, servers from Intel and CERN OpenLab
atlantic lightpath between Carleton U and CERN
Cisco ONS 15454 Force10 E 600 Force10 E 600 HP Itanium-2 HP Itanium-2 Cisco ONS 15454 Cisco ONS 15454 Cisco ONS 15454 Cisco ONS 15454 Ixia 400T Intel Itanium-2 Intel Xeon Ixia 400T 10GE WAN PHY 10GE LAN PHY OC192c
Ottawa Toronto Chicago Amsterdam Geneva
10 GbE WAN PHY over an OC-192 circuit using lightpaths provided by SURFnet and CA*net 4 9.24 Gbps using traffic generators 6 Gbps using UDP on PCs 5.65 Gbps using TCP on PCs
Single stream UDP throughput Single stream TCP throughput Data rates limited by the PC, even for memory to memory tests UDP uses less resources than TCP on high-bandwidth delay product networks
2004, looked at establishing a 10 GbE lightpath from TRIUMF
NetIron 40Gs, Foundry NetIron 1500, servers from Sun Microsystems, and custom built disk servers from Ciara Technologies.
OME 6500 in Vancouver
and TRIUMF and Carleton U over a 10 GbE lightpath
Linux-based disk servers
processors
CERN and lightpaths to Canadian Tier 2 ATLAS sites from TRIUMF
facilitate
Storm 1 Storm 1 Storm 1 Storm 2 Sun 1 NI1500 MRV FD
(Software Raid0 of 8 Sata disks on each of pair hardware Raid0 RocketRaid 1820A controllers on Storm2)
( from Storm2 to Storm1 with software Raid0 of 4 disks on each of 3 3ware-9500S4 controllers in Raid0)
streams averaged 18, 24, 27, 29 MB/sec disk-to-disk. ( Only 1 disk at CERN – max write speed 48MB/sec)
encountered and resolved. Details available on request. Don’t do test flights too close to ground echo 100000 > /proc/sys/vm/min_free_kbytes
(Seagate ST3300831AS)
Note: 64bit * 133MHz = 8.4 Gb/s
TYAN K8S S2882 SunFire V40z dd /dev/zero > /dev/null 60 GB/sec 32 GB/sec CPU Dual 2.5GHz Opterons Quad 2.5 GHz Opterons PCI-X (64 bit) 2@133 MHz (100 for two) 2@100MHz (66 for two) 4@133 MHz full length 1@133 MHz full length 1@100 Mhz half length 1@66 Mhz half length Memory 4 GB 8 GB Disks 16 300 GB SATA 2 x 73 GB 10 SCSI 320 3 x 147 GB 10K SCSI 320 I/O See slide “Optimal I/O Results” 3 x 147 GB as raid0 JBOD 160 to 123 MB/s write 176 to 130 MB/sec read
vendor 8086 and device 1048 to increase transmit burst length on the bus
3Ware-9500S-4 3Ware-9500S-8 Areca 1160 Highpoint RocketRaid 1820A SuperMicro DAC-SATA-MV8
PROS CONS Internal & External Web Access Flaky – external hangs require reboot, internal requires starting a new port Many options: display disk temps, SATA300 +NCQ, email alerts Trial and error to use them since few examples in documents. Supports filesystems >2TB, 16 disks, 64bit/ 133MHz (24 disk / PCI-EXPRESS X8 available) JBOD performance mostly = single disk 15 Disk RAID5 W/R 301/390 MB/s 15 Disk RAID6 W/R 237/328 MB/s 2 RAID0 (7&8 disk) W/R 361/405 MB/s RAID0 of 12 disks W/R 349/306 MB/s RAID6 very robust. Background rebuilds has low impact on I/O performance. Background rebuilds 50-100 slower than fast builds (at 20% priority).
Extensive tests were done by tweakers.net on ARECA and 8 others www.tweakers.net/benchdb/search/product/104629 www.tweakers.net/reviews/557
proceeded to raid6
Controller # of Disks Config Slot/Freq Result
RocketRaid 1820A 8 /dev/sda RAID 5 2/133 248 MB/s Write 341 MB/s Read RocketRaid 1820A 8 (2nd must be installed) md0 of RAID0 4/100 364 MB/s Write 330 MB/s Read Two 1820A RocketRaids 8/8 md0 of 2 RAID5s 1/133, 4/100 254 MB/s Write 620 MB/s Read Two 1820A RocketRaids 7/8 md0 of 2 RAID5s 2/133, 4/100 414 MB/s Write 540 MB/s Read MV8 (JBOD) MV8 (JBOD) 8 7 md0 4/100, 2/133 Oops (>4TB limit ?) MV8 (JBOD) 8 md0 4/100 410 MB/s Write 436 MB/s Read MV8 (JBOD) MV8 (JBOD) 8 6 (7 bad) md0 md1 4/100 Bridge A 3/100 Bridge A 2 streams 500 MB/s Read MV8 (JBOD) MV8 (JBOD) 8 7 md0 md1 4/100 Bridge A 2/133 Bridge B 2 streams 750 MB/s Read TYAN S288 (JBOD) 4 md0 On-board SATA 60 MB/sec Write 90 MB/sec Read
MV Controllers: Read Speed from 8 and 6 disks for PCI-X set to 100MHz (aggregate ~500MB/sec)
200000 220000 240000 260000 280000 300000 320000 340000 360000 380000 400000 1 201 401 601 801 1001 1201 1401 Time (sec) KBytes/sec
8 disks 100MHz 6 disks 100MHz 8 disks 133MHz 6 disks 133MHz
sec) for md0 of 2*8 disk RAID 5 of RR 1820A)
dev/sda (eg. Areca 1160 15 disk Raid5 190 to 323 MB/s
Bi-stable state for reads – a useful tool to display which disk may be slowing I/O is iostat –x 1
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util hda 0.00 0.00 1.00 0.00 8.00 0.00 4.00 0.00 8.00 0.01 9.00 9.00 0.90 md0 0.00 0.00 1920.00 0.00 491520.00 0.00 245760.00 0.00 256.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 239.00 0.00 61440.00 0.00 30720.00 0.00 257.07 10.88 45.31 4.19 100.10 BAD sdb 0.00 0.00 238.00 0.00 61440.00 0.00 30720.00 0.00 258.15 2.80 11.76 2.46 58.50 sdc 0.00 0.00 240.00 0.00 61440.00 0.00 30720.00 0.00 256.00 2.85 11.91 2.40 57.70 sdd 0.00 0.00 240.00 0.00 61440.00 0.00 30720.00 0.00 256.00 3.01 12.61 2.58 61.80 sde 0.00 0.00 237.00 0.00 61440.00 0.00 30720.00 0.00 259.24 2.94 12.39 2.57 61.00 sdf 0.00 0.00 236.00 0.00 61440.00 0.00 30720.00 0.00 260.34 2.96 12.47 2.61 61.60 sdg 0.00 0.00 239.00 0.00 61440.00 0.00 30720.00 0.00 257.07 3.04 12.77 2.51 60.00 sdh 0.00 0.00 235.00 0.00 61440.00 0.00 30720.00 0.00 261.45 3.02 12.72 2.49 58.60
When working properly this is...
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util hda 0.00 1.00 1.00 37.00 8.00 304.00 4.00 152.00 8.21 0.09 2.37 0.21 0.80 md0 0.00 0.00 3520.00 0.00 901120.00 0.00 450560.00 0.00 256.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 434.00 0.00 112640.00 0.00 56320.00 0.00 259.54 8.57 19.52 2.30 100.00 sdb 0.00 0.00 446.00 1.00 112640.00 0.00 56320.00 0.00 251.99 8.07 20.50 2.20 98.30 sdc 0.00 0.00 440.00 0.00 112640.00 0.00 56320.00 0.00 256.00 6.11 13.89 2.25 98.80 sdd 0.00 0.00 440.00 0.00 112640.00 0.00 56320.00 0.00 256.00 4.63 10.52 2.18 96.10 sde 0.00 0.00 439.00 0.00 112640.00 0.00 56320.00 0.00 256.58 4.64 10.54 2.18 95.70 sdf 0.00 0.00 441.00 0.00 112640.00 0.00 56320.00 0.00 255.42 6.26 14.22 2.25 99.20 sdg 0.00 0.00 437.00 0.00 112640.00 0.00 56320.00 0.00 257.76 4.89 11.11 2.19 95.80 sdh 0.00 0.00 439.00 0.00 112640.00 0.00 56320.00 0.00 256.58 5.21 11.84 2.19 96.10
Solution? Change ‘slow‘ disk with normal one.
Shows drop in read speed depending on location of the file. Reads significantly faster on the outer part of the software Raid0 (JBOD) set.
established since April 18th 2005
ATLAS Service Challenge
lightpath to CERN by Jan/Feb 2006
ATLAS Tier1 Service Challenge 3 (primary contact Reda Tafirout tafirout@triumf.ca) 3 Ciara servers: Intel SE7520BD2 (dual GigE, PCI-X, etc.) dual 3 GHz, Nocona EMT64 (1 MB cache/ 800 MHz FSB) 2 GB RAM 1 system disk 80 Gig IDE (laptop) 8 x 250 GB SATA150 (Seagate Barrac. NCQ, 8 MB) 3Ware 9500S-8MI RAID5 Infiniband connections 1 Evetek server (management node) dual opteron 246 2.0 GHz (800 MHz FSB) 2 GB RAM 1 system disk WD 80 GB SATA 2 x 250 GB WD SATA 3Ware 9500S-LP 4 channels ADAPTEC Ultra160 SCSI 29160-LP Tape system: 2 x IBM 4560SLX SDLT libraries
have fibre channel interface card All systems are running FC3 x86_64 2.6 kernel, and dCache for disk management (with gridftp + SRM access doors)
Servers 5x Dual 250 2.4 GHz Opterons 2GB memory 16x 300GB SATA drives SunFire V40z Quad Opteron 850 2.4GHz 8GB memory 3x 146GB SCSI Network Cards Intel Pro/10GbE-LR S2io/Neterion Xframe Raid and SATA Controllers 3ware 9500’s 8-port Rock Raid 1820A 8-port Super micro MV8 8-port Areca 1160 16-port Network MRV CWDM Foundry NI1500 & NI40G 10G-ER 1550nm LAN PHY 10G-LR 1310 LAN/WAN PHY CA*net4 OME 6500
1GbE transfers disk-to-disk b/w TRIUMF-Carleton Ottawa over 10G Circuit 115 MB/s sustained ~ 5 days Equivalent to ~46TB Iperf between TRIUMF & Ottawa memory-to-memory 1 week 3.74 Gbps averaged, 460 MB/s 350 TB transferred (errors ignored)
Disk-to-Memory back-to-back short distance, 24 hrs Single TCP Stream Average of 2.4 Gbps, 300 MB/s
(max disk read 361 MB/s –16 disks RAID5)
Disk-to-Disk back-to-back short distance, 76TB in ~ 4 days bbftp 5 TCP streams Average of 1.8 Gbps, 220 MB/s
(max disk write 303 MB/s – 15 disk ARECA RAID5) (max disk read 361 MB/sec – 16 disk as 2*8 RR1820A RAID5)
Disk Read Memory Network CPU
320MB/sec 575MB/sec 320MB/sec
500MB/sec 60% 575MB/sec 60%
320MB/sec 70% 360MB/sec 500MB/sec (parallel) 100% 380MB/sec 475MB/sec (parallel) 100%
Bottleneck – buffering? What are the solutions? Zero-copy
vendor space has matured
permanent - under consideration
protocols
behave over long haul networks
V40z with Solaris 10 (native iSCSI stack)
10GbE compatible NICs