Huge Data Transfer Experimentation over Lightpaths Corrie Kost, - - PowerPoint PPT Presentation

huge data transfer experimentation over lightpaths
SMART_READER_LITE
LIVE PREVIEW

Huge Data Transfer Experimentation over Lightpaths Corrie Kost, - - PowerPoint PPT Presentation

Huge Data Transfer Experimentation over Lightpaths Corrie Kost, Steve McDonald TRIUMF Wade Hong Carleton University Motivation LHC expected to come on line in 2007 data rates expected to exceed a petabyte a year large Canadian HEP


slide-1
SLIDE 1

Huge Data Transfer Experimentation over Lightpaths

Corrie Kost, Steve McDonald

TRIUMF

Wade Hong

Carleton University

slide-2
SLIDE 2

Motivation

  • LHC expected to come on line in 2007
  • data rates expected to exceed a petabyte a year
  • large Canadian HEP community involved in the

ATLAS experiment

  • establishment of a Canadian Tier 1 at TRIUMF
  • replicate all/part of the experimental data
  • need to be able transfer “huge data” to our Tier 1
slide-3
SLIDE 3

TRIUMF

  • Tri University Meson Facility
  • Canada’s Laboratory for Particle and Nuclear Physics
  • operated as a joint venture by UofA, UBC, Carleton

U, SFU, and UVic

  • located on the UBC campus in

Vancouver

  • five year funding from 2005 - 2009 announced in

federal budget

  • planned as the Canadian ATLAS Tier 1
slide-4
SLIDE 4

TRIUMF

slide-5
SLIDE 5

Lightpaths

  • a significant design principle of CA*net 4 is the

ability to provide dedicated point to point bandwidth over lightpaths under user control

  • similar philosophy of SURFnet provides the ability

to establish an end to end lightpath from Canada to CERN

  • optical bypass isolates “huge data transfers” from
  • ther users of the R&E networks
  • lightpaths permit the extension of ethernet LANs

to the wide area

slide-6
SLIDE 6

Ethernet: local to global

  • the de facto LAN technology
  • original ethernet
  • shared media, half duplex, distance limited by

protocol

  • modern ethernet
  • point to point, full duplex, switched, distance

limited by the optical components

  • cost effective
slide-7
SLIDE 7

Why native Ethernet Long Haul?

  • more than 90% of the Internet traffic originates from an

Ethernet LAN

  • data traffic on the LAN increases due to new applications
  • Ethernet services with incremental bandwidth offer new

business opportunities for carriers

  • why not native Ethernet?
  • scalability, reliability, service guarantees
  • all the above are research areas
  • native Ethernet long haul connections can be used today as a

complement to the routed networks, not a replacement

slide-8
SLIDE 8

Experimentation

  • experimenting with 10 GbE hardware for the past 3 years
  • engaged 10 GbE NIC and network vendors
  • mostly interested in disk to disk transfers with commodity

hardware

  • tweaking performance of Linux-based disk servers
  • engaged hardware vendors to help build systems
  • testing data transfers over dedicated lightpaths
  • engineering solutions for the e2e lightpath last mile
  • especially for 10 GbE
slide-9
SLIDE 9

2002 Activities

  • established the first end to end trans-atlantic lightpath

between TRIUMF and CERN for iGrid 2002

  • bonded dual GbEs transported across a 2.5 Gbps

OC-48

  • initial experimentation with 10GbE
  • alpha Intel 10GbE LR NICs, Extreme Black

Diamond 6808 with 10GbE LRi blades

  • transfered data from ATLAS DC from TRIUMF to

CERN using bbftp and tsunami

slide-10
SLIDE 10

Live continent to continent

  • e2e lightpath up and running Sept 20 20:45 CET

traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms

slide-11
SLIDE 11

iGrid 2002 Topology

slide-12
SLIDE 12

Exceeding a Gbps

(Tsunami)

slide-13
SLIDE 13

2003 Activities

  • Canarie funded directed research project, CA*net 4

IGT to continue with experimentation

  • Canadian HEP community and CERN
  • GbE lightpath experimentation between CERN and

UofA for real-time remote farms

  • data transfers over a GbE lightpath between CERN

and Carleton U for transferring 700GB of ATLAS FCAL test beam data

  • took 6.5 hrs versus 67 days
slide-14
SLIDE 14

Current IGT Topology

slide-15
SLIDE 15

2003 Activities

  • re-establishment of 10 GbE experiments
  • newer Intel 10 GbE NICs and Force 10

Networks E600 switches, IXIA network testers, servers from Intel and CERN OpenLab

  • established first native 10GbE end to end trans-

atlantic lightpath between Carleton U and CERN

  • demonstrated at ITU Telecom World 2003
slide-16
SLIDE 16

Demo during ITU Telecom World 2003

Cisco ONS 15454 Force10 E 600 Force10 E 600 HP Itanium-2 HP Itanium-2 Cisco ONS 15454 Cisco ONS 15454 Cisco ONS 15454 Cisco ONS 15454 Ixia 400T Intel Itanium-2 Intel Xeon Ixia 400T 10GE WAN PHY 10GE LAN PHY OC192c

Ottawa Toronto Chicago Amsterdam Geneva

10 GbE WAN PHY over an OC-192 circuit using lightpaths provided by SURFnet and CA*net 4 9.24 Gbps using traffic generators 6 Gbps using UDP on PCs 5.65 Gbps using TCP on PCs

slide-17
SLIDE 17

Results on the transatlantic 10GbE

Single stream UDP throughput Single stream TCP throughput Data rates limited by the PC, even for memory to memory tests UDP uses less resources than TCP on high-bandwidth delay product networks

slide-18
SLIDE 18

2004-2005 Activities

  • arrival of the third CA*net 4 lambda in the summer of

2004, looked at establishing a 10 GbE lightpath from TRIUMF

  • Neterion (s2io) Xframe 10 GbE NICs, Foundry

NetIron 40Gs, Foundry NetIron 1500, servers from Sun Microsystems, and custom built disk servers from Ciara Technologies.

  • distance problem between TRIUMF and the CA*net 4

OME 6500 in Vancouver

  • XENPAK 10 GbE WAN PHY at 1310nm
slide-19
SLIDE 19

2004-2005 Activities

  • testing data transfers between TRIUMF and CERN,

and TRIUMF and Carleton U over a 10 GbE lightpath

  • experimenting with robust data transfers
  • attempt to maximize disk i/o performance from

Linux-based disk servers

  • experimenting with disk controllers and

processors

  • ATLAS Service Challenges in 2005
slide-20
SLIDE 20

2004-2005 Activities

  • exploring a more permanent 10 GbE lightpath to

CERN and lightpaths to Canadian Tier 2 ATLAS sites from TRIUMF

  • CANARIE playing a lead role in helping to

facilitate

  • still need to solve some last mile lightpath issues
slide-21
SLIDE 21

Experimental Setup at TRIUMF

Storm 1 Storm 1 Storm 1 Storm 2 Sun 1 NI1500 MRV FD

slide-22
SLIDE 22

Xeon-based Servers

  • Dual 3.2 GHz Xeons
  • 4GB memory
  • 4 3WARE 9500S-4LP (&8)
  • 16 SATA150 120GB drives
  • 40GB HITACHI 14R9200 drives
  • INTEL 10GBE PXLA8590LR
slide-23
SLIDE 23

Some Xeon Server I/O Results

  • read a pair of 80 GB (xfs) files for 67 hours – 120TB – average 524 MB/sec

(Software Raid0 of 8 Sata disks on each of pair hardware Raid0 RocketRaid 1820A controllers on Storm2)

  • 10GbE S2IO Nics – back-to-back 17 hrs – 10TB – average 180MB/se

( from Storm2 to Storm1 with software Raid0 of 4 disks on each of 3 3ware-9500S4 controllers in Raid0)

  • 10GbE lightpath: Storm2 to Itanium machine at CERN – 10,15,20,25 bbftp

streams averaged 18, 24, 27, 29 MB/sec disk-to-disk. ( Only 1 disk at CERN – max write speed 48MB/sec)

  • continued Storm1 to Storm2 testing – many sustainability problems

encountered and resolved. Details available on request. Don’t do test flights too close to ground echo 100000 > /proc/sys/vm/min_free_kbytes

slide-24
SLIDE 24

Opteron-based Servers

  • Dual 2.4GHz Opterons
  • 4GB Memory
  • 1 WD800JB 80GB HD
  • 16 SATA 300GB HD

(Seagate ST3300831AS)

  • 4 4 Port Infiniband-SATA
  • 2 RocketRaid 1820A
  • 10GbE NIC
  • 2 PCI-X at 133MHz (*)
  • 2 PCI-X at 100MHz (*)

Note: 64bit * 133MHz = 8.4 Gb/s

slide-25
SLIDE 25

Multilane Infini-band SATA

slide-26
SLIDE 26

Server Specifications

TYAN K8S S2882 SunFire V40z dd /dev/zero > /dev/null 60 GB/sec 32 GB/sec CPU Dual 2.5GHz Opterons Quad 2.5 GHz Opterons PCI-X (64 bit) 2@133 MHz (100 for two) 2@100MHz (66 for two) 4@133 MHz full length 1@133 MHz full length 1@100 Mhz half length 1@66 Mhz half length Memory 4 GB 8 GB Disks 16 300 GB SATA 2 x 73 GB 10 SCSI 320 3 x 147 GB 10K SCSI 320 I/O See slide “Optimal I/O Results” 3 x 147 GB as raid0 JBOD 160 to 123 MB/s write 176 to 130 MB/sec read

slide-27
SLIDE 27

The Parameters

  • 5 types of controllers
  • number of controllers to use (1 to 4)
  • number of disks/controller (1 to 16)
  • RAID0, RAID5, RAID6, JBOD
  • dual or quad Opteron systems
  • 4-6 possible PCI-X slots (1 reserved for 10GigE)
  • linux kernels (2.6.9, 2.6.10, 2.6.11)
  • many tuning parameters (in addition to WAN) e.g.
  • blockdev –setra 8192 /dev/md0
  • chunk-size in mdadm (1024)
  • /sbin/setpci –d 8086:1048 e6.b=2e
  • (modifies MMRBC field in PCI-X configuration space for

vendor 8086 and device 1048 to increase transmit burst length on the bus

  • echo 100000 >/proc/sys/vm/min_free_kbytes
  • ifconfig eth3 txqueuelen 100000
slide-28
SLIDE 28

The SATA Controllers

3Ware-9500S-4 3Ware-9500S-8 Areca 1160 Highpoint RocketRaid 1820A SuperMicro DAC-SATA-MV8

slide-29
SLIDE 29

Areca 1160 Details

PROS CONS Internal & External Web Access Flaky – external hangs require reboot, internal requires starting a new port Many options: display disk temps, SATA300 +NCQ, email alerts Trial and error to use them since few examples in documents. Supports filesystems >2TB, 16 disks, 64bit/ 133MHz (24 disk / PCI-EXPRESS X8 available) JBOD performance mostly = single disk 15 Disk RAID5 W/R 301/390 MB/s 15 Disk RAID6 W/R 237/328 MB/s 2 RAID0 (7&8 disk) W/R 361/405 MB/s RAID0 of 12 disks W/R 349/306 MB/s RAID6 very robust. Background rebuilds has low impact on I/O performance. Background rebuilds 50-100 slower than fast builds (at 20% priority).

Extensive tests were done by tweakers.net on ARECA and 8 others www.tweakers.net/benchdb/search/product/104629 www.tweakers.net/reviews/557

slide-30
SLIDE 30

Why do we need Raid 6?

  • our experience is 1 out of 30 disk fails every 6 months
  • Raid5 rebuild in full operation of 15 300GB disks takes ~100hrs
  • probability that second disk fails during rebuild ~ 1%
  • ARECA-1160 TESTS of 15 300GB disks (1 broken):
  • Raid5 or 6 fast build in ~ 100minutes
  • Raid5 or 6 background build – up to 100hrs for busy system
  • Acid Test:
  • Raid6 – removed disk while very busy – degraded to Raid5
  • rebuild takes 100 hrs
  • removed second disk – now critical - but after raid5 built it

proceeded to raid6

Raid 5 with ~ 4TB of disk is too risky Raid 6 marginal cost minimal

slide-31
SLIDE 31

Some “optimal” I/O results

Controller # of Disks Config Slot/Freq Result

RocketRaid 1820A 8 /dev/sda RAID 5 2/133 248 MB/s Write 341 MB/s Read RocketRaid 1820A 8 (2nd must be installed) md0 of RAID0 4/100 364 MB/s Write 330 MB/s Read Two 1820A RocketRaids 8/8 md0 of 2 RAID5s 1/133, 4/100 254 MB/s Write 620 MB/s Read Two 1820A RocketRaids 7/8 md0 of 2 RAID5s 2/133, 4/100 414 MB/s Write 540 MB/s Read MV8 (JBOD) MV8 (JBOD) 8 7 md0 4/100, 2/133 Oops (>4TB limit ?) MV8 (JBOD) 8 md0 4/100 410 MB/s Write 436 MB/s Read MV8 (JBOD) MV8 (JBOD) 8 6 (7 bad) md0 md1 4/100 Bridge A 3/100 Bridge A 2 streams 500 MB/s Read MV8 (JBOD) MV8 (JBOD) 8 7 md0 md1 4/100 Bridge A 2/133 Bridge B 2 streams 750 MB/s Read TYAN S288 (JBOD) 4 md0 On-board SATA 60 MB/sec Write 90 MB/sec Read

slide-32
SLIDE 32

Some details of I/O

MV Controllers: Read Speed from 8 and 6 disks for PCI-X set to 100MHz (aggregate ~500MB/sec)

  • r 133 MHz (aggregate ~ 650MB/sec)

200000 220000 240000 260000 280000 300000 320000 340000 360000 380000 400000 1 201 401 601 801 1001 1201 1401 Time (sec) KBytes/sec

8 disks 100MHz 6 disks 100MHz 8 disks 133MHz 6 disks 133MHz

slide-33
SLIDE 33

Puzzling I/O results

  • read speeds for some 80 GB files consistently ~50% faster (620 MB/

sec) for md0 of 2*8 disk RAID 5 of RR 1820A)

  • read for other files consistently lower
  • read speeds are up to 50% faster using /dev/md0 over direct use of /

dev/sda (eg. Areca 1160 15 disk Raid5 190 to 323 MB/s

  • bi-stable (fast/slow) read modes within the same file
  • diskscrubb utility re-maps bad blocks - takes ~2hrs for 300 GB drive
  • “weak” blocks - not being remapped possible reason for slow spots
  • room temperature gradient suspected - tested - discounted
slide-34
SLIDE 34

Puzzling I/O results

Bi-stable state for reads – a useful tool to display which disk may be slowing I/O is iostat –x 1

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util hda 0.00 0.00 1.00 0.00 8.00 0.00 4.00 0.00 8.00 0.01 9.00 9.00 0.90 md0 0.00 0.00 1920.00 0.00 491520.00 0.00 245760.00 0.00 256.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 239.00 0.00 61440.00 0.00 30720.00 0.00 257.07 10.88 45.31 4.19 100.10 BAD sdb 0.00 0.00 238.00 0.00 61440.00 0.00 30720.00 0.00 258.15 2.80 11.76 2.46 58.50 sdc 0.00 0.00 240.00 0.00 61440.00 0.00 30720.00 0.00 256.00 2.85 11.91 2.40 57.70 sdd 0.00 0.00 240.00 0.00 61440.00 0.00 30720.00 0.00 256.00 3.01 12.61 2.58 61.80 sde 0.00 0.00 237.00 0.00 61440.00 0.00 30720.00 0.00 259.24 2.94 12.39 2.57 61.00 sdf 0.00 0.00 236.00 0.00 61440.00 0.00 30720.00 0.00 260.34 2.96 12.47 2.61 61.60 sdg 0.00 0.00 239.00 0.00 61440.00 0.00 30720.00 0.00 257.07 3.04 12.77 2.51 60.00 sdh 0.00 0.00 235.00 0.00 61440.00 0.00 30720.00 0.00 261.45 3.02 12.72 2.49 58.60

When working properly this is...

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util hda 0.00 1.00 1.00 37.00 8.00 304.00 4.00 152.00 8.21 0.09 2.37 0.21 0.80 md0 0.00 0.00 3520.00 0.00 901120.00 0.00 450560.00 0.00 256.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 434.00 0.00 112640.00 0.00 56320.00 0.00 259.54 8.57 19.52 2.30 100.00 sdb 0.00 0.00 446.00 1.00 112640.00 0.00 56320.00 0.00 251.99 8.07 20.50 2.20 98.30 sdc 0.00 0.00 440.00 0.00 112640.00 0.00 56320.00 0.00 256.00 6.11 13.89 2.25 98.80 sdd 0.00 0.00 440.00 0.00 112640.00 0.00 56320.00 0.00 256.00 4.63 10.52 2.18 96.10 sde 0.00 0.00 439.00 0.00 112640.00 0.00 56320.00 0.00 256.58 4.64 10.54 2.18 95.70 sdf 0.00 0.00 441.00 0.00 112640.00 0.00 56320.00 0.00 255.42 6.26 14.22 2.25 99.20 sdg 0.00 0.00 437.00 0.00 112640.00 0.00 56320.00 0.00 257.76 4.89 11.11 2.19 95.80 sdh 0.00 0.00 439.00 0.00 112640.00 0.00 56320.00 0.00 256.58 5.21 11.84 2.19 96.10

Solution? Change ‘slow‘ disk with normal one.

slide-35
SLIDE 35

I/O related results

Shows drop in read speed depending on location of the file. Reads significantly faster on the outer part of the software Raid0 (JBOD) set.

slide-36
SLIDE 36

TRIUMF-CERN GbE lightpath

  • currently a GbE circuit

established since April 18th 2005

  • uses ONS15454
  • used primarily for the

ATLAS Service Challenge

  • hoping to have a 10 GbE

lightpath to CERN by Jan/Feb 2006

slide-37
SLIDE 37

Atlas SC3 Setup

ATLAS Tier1 Service Challenge 3 (primary contact Reda Tafirout tafirout@triumf.ca) 3 Ciara servers: Intel SE7520BD2 (dual GigE, PCI-X, etc.) dual 3 GHz, Nocona EMT64 (1 MB cache/ 800 MHz FSB) 2 GB RAM 1 system disk 80 Gig IDE (laptop) 8 x 250 GB SATA150 (Seagate Barrac. NCQ, 8 MB) 3Ware 9500S-8MI RAID5 Infiniband connections 1 Evetek server (management node) dual opteron 246 2.0 GHz (800 MHz FSB) 2 GB RAM 1 system disk WD 80 GB SATA 2 x 250 GB WD SATA 3Ware 9500S-LP 4 channels ADAPTEC Ultra160 SCSI 29160-LP Tape system: 2 x IBM 4560SLX SDLT libraries

  • each with 1 SDLT drive + 26 SDLT tapes

have fibre channel interface card All systems are running FC3 x86_64 2.6 kernel, and dCache for disk management (with gridftp + SRM access doors)

slide-38
SLIDE 38

TRIUMF-Carleton U lightpath

Servers 5x Dual 250 2.4 GHz Opterons 2GB memory 16x 300GB SATA drives SunFire V40z Quad Opteron 850 2.4GHz 8GB memory 3x 146GB SCSI Network Cards Intel Pro/10GbE-LR S2io/Neterion Xframe Raid and SATA Controllers 3ware 9500’s 8-port Rock Raid 1820A 8-port Super micro MV8 8-port Areca 1160 16-port Network MRV CWDM Foundry NI1500 & NI40G 10G-ER 1550nm LAN PHY 10G-LR 1310 LAN/WAN PHY CA*net4 OME 6500

slide-39
SLIDE 39

Transfer results over 10 GbE

1GbE transfers disk-to-disk b/w TRIUMF-Carleton Ottawa over 10G Circuit 115 MB/s sustained ~ 5 days Equivalent to ~46TB Iperf between TRIUMF & Ottawa memory-to-memory 1 week 3.74 Gbps averaged, 460 MB/s 350 TB transferred (errors ignored)

slide-40
SLIDE 40

Transfer results over 10 GbE

Disk-to-Memory back-to-back short distance, 24 hrs Single TCP Stream Average of 2.4 Gbps, 300 MB/s

(max disk read 361 MB/s –16 disks RAID5)

Disk-to-Disk back-to-back short distance, 76TB in ~ 4 days bbftp 5 TCP streams Average of 1.8 Gbps, 220 MB/s

(max disk write 303 MB/s – 15 disk ARECA RAID5) (max disk read 361 MB/sec – 16 disk as 2*8 RR1820A RAID5)

slide-41
SLIDE 41

Pumping data into a 10GbE circuit

Disk Read Memory Network CPU

320MB/sec 575MB/sec 320MB/sec

500MB/sec 60% 575MB/sec 60%

320MB/sec 70% 360MB/sec 500MB/sec (parallel) 100% 380MB/sec 475MB/sec (parallel) 100%

Bottleneck – buffering? What are the solutions? Zero-copy

slide-42
SLIDE 42

Conclusions/Observations

  • dual opterons may still be I/O limited - exploring hot wired quad
  • pterons
  • SATA drives may need more quality control/screening/repair
  • Raid 5 for 1-4 TB, Raid 6 for larger sets (now up to 24 disks/controller)
  • some cards have a 2TB limit
  • GbE delivers stable disk-disk long distance transfers at 120 MB/s
  • there are critical tuning requirements - servers cannot be used blindly
  • achieving “robustness” is not easy!
  • lightpaths, however, make this much easier!
slide-43
SLIDE 43

Further Explorations

  • 10 GbE network infrastructure
  • over the past 3 years the 10 GbE networking

vendor space has matured

  • perhaps time to acquire something more

permanent - under consideration

  • XFP-based optics is the latest trend
  • re-visit evaluation of different data transfer

protocols

slide-44
SLIDE 44

Further Explorations

  • ATA over Ethernet
  • had some discussions with Coraid
  • explore how ethernet attached drives would

behave over long haul networks

  • iSCSI
  • iSCSI over long haul networks
  • Sun

V40z with Solaris 10 (native iSCSI stack)

  • demonstrated i/o over 500 MB/s
slide-45
SLIDE 45

Further Explorations

  • 10 GbE NICs
  • NICs with TOE
  • Myrinet recently announced new lower cost

10GbE compatible NICs

  • PCI-Express
  • emergence of PCI-E disk controllers and NICs
slide-46
SLIDE 46

Thank You! kost@triumf.ca mcdonald@triumf.ca xiong@physics.carleton.ca