ASGC Tier1 Center & Service Challenges activities ASGC, Jason - - PowerPoint PPT Presentation

asgc tier1 center service challenges activities
SMART_READER_LITE
LIVE PREVIEW

ASGC Tier1 Center & Service Challenges activities ASGC, Jason - - PowerPoint PPT Presentation

ASGC Tier1 Center & Service Challenges activities ASGC, Jason Shih Outlines n Tier1 center operations Resource status, QoS and utilizations q User support q


slide-1
SLIDE 1

ASGC Tier1 Center & Service Challenges activities

ASGC, Jason Shih

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-2
SLIDE 2

Outlines

n Tier1 center operations

q

Resource status, QoS and utilizations

q

User support

q

Other activities in ASGC (exclude HEP)

n

Biomed DC2

q Service availability

n Service challenges

q

SC4 disk to disk throughput testing

n Future remarks

q

SA improvement

q

Resource expansion

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-3
SLIDE 3

ASGC T1 operations

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-4
SLIDE 4

WAN connectivity

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-5
SLIDE 5

ASGC Network

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-6
SLIDE 6

Computing resources

n Instability of IS cause ASGC service endpoints

removed from exp. bdii

n High load on CE have impact to site info published

(site GIIS running on CE)

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-7
SLIDE 7

Job execution at ASGC

n Instability of site GIIS cause dynamic information

publish error

n High load of CE lead to abnormal functioning of

maui

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-8
SLIDE 8

OSG/LCG resource integration

n Mature tech help integrating

resources

q

GCB introduced to help integrating with IPAS T2 computing resources

q

CDF/OSG users can submit jobs by gliding-in into GCB box

q

Access T1 computing resources from “twgrid” VO

n Customized UI to help accessing

backend storage resources

q

Help local users not ready for grid

q

HEP users access T1 resources

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-9
SLIDE 9

ASGC Helpdesk

n Currently support following services (queue):

q

CIC/ROC

q

PRAGMA

q

HPC

q

SRB

n Classification of sub-queue of CIC/ROC:

q

T1

q

CASTOR

q

SC

q

SSC

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-10
SLIDE 10

ASGC TRS: Accounting

435/40 Total tickets 425/39 Close tickets 10 Open tickets Statistic (Tot/Ave)

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-11
SLIDE 11

Biomed DC2

n Add 90KSI2k dedicate for

DC2 activities, introduced additional subcluster in IS

n Maintaining site functional

to help utilizing grid jobs from DC2

q

Troubleshooting grid-wide issues

q

Collaborate with biomed in AP

  • peration

n AP: GOG-Singapore

devoted resources for DC2.

First run on partial of 36690 ligands (started at 4 April, 2006 Fourth run (started at 21 April)

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-12
SLIDE 12

Biomed DC2 (cont’)

n Two framework introduced

q

DIANE and wisdom

n Ave. 30% contribution from ASGC, in 4 run (DIANE)

9 37.2 10.6

  • Prod. LCG

(lcg00125.grid)

12.4 8.4 28.8 15.4

Q-HPC (quanta.grid)

4th 3rd 2nd 1st

CE/subclusters

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-13
SLIDE 13

Service Availability

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-14
SLIDE 14

Service challenges - 4

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-15
SLIDE 15

SC4 Disk to disk transfer

n problem observed at ASGC:

q system crash immediately when tcp buffer size

increase

n

castor experts help in trouble shooting, but prob. remains for 2.6 kernel + xfs

n

download kernel to 2.4 + 1.2.0rh9 gridftp + xfs

n

again, crash if window sized tuned

n

problem resolved only when down grade gridftp to identical version for SC3 disk rerun (Apr. 27, 7AM)

q

try with one of disk server, and move forward to rest of three

q

120+ MB/s have been observed

q

continue running for one week

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-16
SLIDE 16

Castor troubleshooting

Stable Tuned$ Stable XFS Gridftp version* Kernel Y Y N 1.1.8-13d 2.6 N Y N 1.2.0rh9 2.6 N Y Y 1.2.0rh9 2.6++ Y Y Y 1.8-13d 2.6++ N Y Y 1.2.0rh9 2.4(2)+ N/A N/A Y 1.8-13d 2.4(2)+ N Y Y 1.2.0rh9 2.4(1) N/A N/A Y 1.8-13d 2.4(1)** *gridftp bundled in castor + ver. 2.4, 2.4.21-40.EL.cern, adopted from CERN ** ver 2.4, 2.4.20-20.9.XFS1.3.1, introduced by SGI ++ exact ver no 2.6.9-11.EL.XFS $ tcp window size tuned, max to be 128MB Stack size recompiled to 8k for each experimental kernel adopted

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-17
SLIDE 17

SC Castor throughput: GridView

n disk to disk nominal rate

q currently ASGC have reach120+ MB/s static

throughput

q Round robin SRM headnodes associate with 4 disk

servers, each provide ~30 MB/s

q debugging kernel/castor s/w issues early time of

SC4 (reduction to 25% only, w/o further tuning)

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-18
SLIDE 18

Tier-1 Accountings: Jan – Mar, 2006

5.03 23.63 70.72 0.62 % 2663384 134023 629448 1883469 16444 sum 0.83 22000 818 693 20489 TRIUMF-LCG2 TRIUMF 0.48 12835 5744 342 5966 783 SARA-MATRIX SARA/NIKHEF 9.89 263380 21210 77025 156114 9031 RAL-LCG2 RAL 7.22 192358 32371 64920 95067 pic PIC Nordic NorduGrid INFN-T1 INFN-T1 4.05 107756 10107 27300 70349 IN2P3-CC IN2P3 5.98 159234 10147 51935 97152 FZK-LCG2 FZK 4.87 129620 129620 USCMS-FNAL-WC1 FNAL 16.6 442240 53626 258790 123194 6630 CERN-PROD CERN 47.75 1271894 1271894 BNL-LCG2 BNL 2.33 62067 18823 43244 Taiwan-LCG2 AsiaPacific % sum lhcb cms atlas alice Site Tier-1

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-19
SLIDE 19

Accounting: VO

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-20
SLIDE 20

Overall Accounting: CMS/Atlas

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-21
SLIDE 21

CMS usage: CRAB monitoring

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-22
SLIDE 22

SRM QoS monitoring: CMS Heartbeat

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-23
SLIDE 23

Castor2@ASGC

n testbed expected to deployed end of

March

n delayed due to

q

  • btaining LSF license from Platform

q

DB schema trouble shooting

q

  • verlap HR in debugging castorSC

throughput

q

revised 2k6 Q1 quarterly report

n separate into two phase, phase

q

(I) w/o considering tape functional testing

n

plan to connect to tape system in next phase

n

expect mid May to complete phase (I)

q

phase (II) plan to finish mid of Jun.

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-24
SLIDE 24

Future remarks

n Resource expansion plan n QoS improvement n Castor2 deployment n New tape system installed

q Continue with disk to tape throughput validation

n Resource sharing with local users

q For users more ready using grid q Large storage resource required

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-25
SLIDE 25

Resource expansion: MoU

  • 400

300 200 CPU (#) FTT*

  • 75

30 15 Disk (TB) 2000 1300 800 500 Tape (TB) T1 61 1500 3400 2008 85 2400 3600 2009 34 900 1770 2007 12 400 950 2006 Racks (#) Disk (TB) CPU (#) Date (year)

*FTT: Federated Taiwan Tire2

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-26
SLIDE 26

Resource expansion (I)

n CPU

q Current status:

n

430 KSI2k (composite by IBM HS20 and Quanta Blades)

q Goal: q Quanta Blades

n

7U, 10blades, Dual CPU, ~1.4 ksi2k/cpu

n

ratio 30 ksi2k/7U, to meet 950KSI2k

n

need 19 chassis (~4 racks)

q IBM Blades

n

LV model available (save 70% power consumption)

n

Higher density, 54 processors (dual core + SMP Xeon)

n

Ratio ~80 KSI2k/7U, only 13 chassis needed (~3 racks)

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-27
SLIDE 27

Resource expansion (II)

n Disk

q Current status

n

3U array, 400GB drive, 14 drives per rack

n

ratio: 4.4 TB/6U

q Goal:

n

400 TB ~ 90 Arrays needed

n

~ 9 racks (assume 11 arrays per rack)

n

Tape

q

New 3584 tape lib installed mid of May

q

4 x LTO4 tape drives provide ~ 80MB/s throughput

q

expected to be installed in mid-March

q

delayed due to

n

internal procurement

n

update items of projects from funding agency

q

Expected new tape system implemented at mid-May

q

full system in operation within two weeks after installed.

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-28
SLIDE 28

IBM 3584 vs. STK SL8500

Yes No Software required for Remote Management HSC/ACSLS None Required software expense 5 sec 1.8 sec* Average cell to drive time < 60 min <60 sec/frame Audit time 5.7 16.4* Storage density in TB/Sq-ft 29 41* Cartridge density in Slots /Sq-ft 1.33 PB 2.75 PB Maximum single library capacity 200 GB / 9940C 400 GB / LTO3 Maximum cartridge capacity supported 64 192 Maximum tape drive configuration 1,448 / 6,632 58 / 6,881 Min/Max single library slot configuration No Yes Any to any cartridge to drive access 8 2 Accessors required for redundancy TBD Yes Redundant robotics New 5 years Modular library design STK SL8500 IBM 3584

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-29
SLIDE 29

Resource expansion (III)

n C2 area of IPAS new machine room

q Rack space design q AC, cooling requirement:

n

for 20 racks: 1,360,000 BTUH or 113.3 tons of cooling - 2800 ksi2k

n

36 racks: 1,150,000 BTUH or 95 tons - 1440 TB

q HVAC: ~800 kVA estimated

n

HS20: 4000Watt * 5*20 + STK array: 1000Watt * 11*36 )

q generator

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-30
SLIDE 30

Summary

n new tape system ready mid of May, full operation

in two weeks

q plan to have disk to tape throughput testing

n Split batch system and CE

q help stabilizing scheduling functionality (mid of May) q Site GIIS sensitive to high CPU load, move to SMP box

n CASTOR2 deployed mid of Jun

q Connect to new tape lib q migrate data from disk cache

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-31
SLIDE 31

Acknowledgment

n CERN:

q

SC: Jamie, Maarten

q

Castor: Olof

q

Atlas: Zhong-Liang Ren

q

CMS: Chia-Ming Kuo

n ASGC:

q

Min, Hung-Che, J-S

q

Oracle: J.H.

q

Network: Y.L., Aries

q

CA: Howard

n IPAS: P.K., Tsan, & Suen

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-32
SLIDE 32

SRM & MSS deployed at each Tier-1

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-33
SLIDE 33

Nominal network/disk rates by sites

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-34
SLIDE 34

Target disk – tape throughput

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-35
SLIDE 35

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-36
SLIDE 36

Disk server snapshot (I)

n Host: lcg00116 n Kernel: 2.4.20-20.9.XFS1.3.1 n Castor gridftp ver.: VDT1.2.0rh9-1

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-37
SLIDE 37

Disk sever snapshot (II)

n Host: lcg00118 n Kernel: 2.4.21-40.EL.cern n Castor gridftp ver.: VDT1.2.0rh9-1

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-38
SLIDE 38

Disk server snapshot (III)

n Host: sc003 n Kernel version: 2.6.9-11.EL.XFS n Castor gridftp ver.: VDTALT1.1.8-13d.i386

וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ

slide-39
SLIDE 39

Accounting: normalized CPU time

1.28% 54.99 % 21.99% 2.17% 0.37% 1.53% 2.19% 5.68% 6.80% 2.52% 0.39% 0.08% Percent age 3,882 167,05 2 66,804 6,593 1,114 4,641 6,658 17,267 20,649 7,666 1,184 249 Total 38 60 25 30 twgrid 2 2 11 2 4 3 1 2 1 1 dteam 935 60,357 15,892 2,844 87 2,678 1,082 11,513 15,924 6 3 cms 17 7,364 12,366 7 1 2,283 14 49 5,818 1,019 13 biomed 2,892 99,269 38,544 3,706 994 1,959 3,290 5,739 4,674 1,841 162 187 atlas 48 alice 6-May 6-Apr 6-Mar 6-Feb 6-Jan 5-Dec 5-Nov 5-Oct 5-Sep 5-Aug 5-Jul 5-Jun VO וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ