ASGC Tier1 Center & Service Challenges activities
ASGC, Jason Shih
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
ASGC Tier1 Center & Service Challenges activities ASGC, Jason - - PowerPoint PPT Presentation
ASGC Tier1 Center & Service Challenges activities ASGC, Jason Shih Outlines n Tier1 center operations Resource status, QoS and utilizations q User support q
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Tier1 center operations
q
Resource status, QoS and utilizations
q
User support
q
Other activities in ASGC (exclude HEP)
n
Biomed DC2
q Service availability
n Service challenges
q
SC4 disk to disk throughput testing
n Future remarks
q
SA improvement
q
Resource expansion
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Instability of IS cause ASGC service endpoints
n High load on CE have impact to site info published
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Instability of site GIIS cause dynamic information
n High load of CE lead to abnormal functioning of
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Mature tech help integrating
resources
q
GCB introduced to help integrating with IPAS T2 computing resources
q
CDF/OSG users can submit jobs by gliding-in into GCB box
q
Access T1 computing resources from “twgrid” VO
n Customized UI to help accessing
backend storage resources
q
Help local users not ready for grid
q
HEP users access T1 resources
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Currently support following services (queue):
q
CIC/ROC
q
PRAGMA
q
HPC
q
SRB
n Classification of sub-queue of CIC/ROC:
q
T1
q
CASTOR
q
SC
q
SSC
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
435/40 Total tickets 425/39 Close tickets 10 Open tickets Statistic (Tot/Ave)
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Add 90KSI2k dedicate for
n Maintaining site functional
q
Troubleshooting grid-wide issues
q
Collaborate with biomed in AP
n AP: GOG-Singapore
First run on partial of 36690 ligands (started at 4 April, 2006 Fourth run (started at 21 April)
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Two framework introduced
q
DIANE and wisdom
n Ave. 30% contribution from ASGC, in 4 run (DIANE)
9 37.2 10.6
(lcg00125.grid)
12.4 8.4 28.8 15.4
Q-HPC (quanta.grid)
4th 3rd 2nd 1st
CE/subclusters
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n problem observed at ASGC:
q system crash immediately when tcp buffer size
n
castor experts help in trouble shooting, but prob. remains for 2.6 kernel + xfs
n
download kernel to 2.4 + 1.2.0rh9 gridftp + xfs
n
again, crash if window sized tuned
n
problem resolved only when down grade gridftp to identical version for SC3 disk rerun (Apr. 27, 7AM)
q
try with one of disk server, and move forward to rest of three
q
120+ MB/s have been observed
q
continue running for one week
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
Stable Tuned$ Stable XFS Gridftp version* Kernel Y Y N 1.1.8-13d 2.6 N Y N 1.2.0rh9 2.6 N Y Y 1.2.0rh9 2.6++ Y Y Y 1.8-13d 2.6++ N Y Y 1.2.0rh9 2.4(2)+ N/A N/A Y 1.8-13d 2.4(2)+ N Y Y 1.2.0rh9 2.4(1) N/A N/A Y 1.8-13d 2.4(1)** *gridftp bundled in castor + ver. 2.4, 2.4.21-40.EL.cern, adopted from CERN ** ver 2.4, 2.4.20-20.9.XFS1.3.1, introduced by SGI ++ exact ver no 2.6.9-11.EL.XFS $ tcp window size tuned, max to be 128MB Stack size recompiled to 8k for each experimental kernel adopted
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n disk to disk nominal rate
q currently ASGC have reach120+ MB/s static
q Round robin SRM headnodes associate with 4 disk
q debugging kernel/castor s/w issues early time of
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
5.03 23.63 70.72 0.62 % 2663384 134023 629448 1883469 16444 sum 0.83 22000 818 693 20489 TRIUMF-LCG2 TRIUMF 0.48 12835 5744 342 5966 783 SARA-MATRIX SARA/NIKHEF 9.89 263380 21210 77025 156114 9031 RAL-LCG2 RAL 7.22 192358 32371 64920 95067 pic PIC Nordic NorduGrid INFN-T1 INFN-T1 4.05 107756 10107 27300 70349 IN2P3-CC IN2P3 5.98 159234 10147 51935 97152 FZK-LCG2 FZK 4.87 129620 129620 USCMS-FNAL-WC1 FNAL 16.6 442240 53626 258790 123194 6630 CERN-PROD CERN 47.75 1271894 1271894 BNL-LCG2 BNL 2.33 62067 18823 43244 Taiwan-LCG2 AsiaPacific % sum lhcb cms atlas alice Site Tier-1
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n testbed expected to deployed end of
n delayed due to
q
q
DB schema trouble shooting
q
throughput
q
revised 2k6 Q1 quarterly report
n separate into two phase, phase
q
(I) w/o considering tape functional testing
n
plan to connect to tape system in next phase
n
expect mid May to complete phase (I)
q
phase (II) plan to finish mid of Jun.
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Resource expansion plan n QoS improvement n Castor2 deployment n New tape system installed
q Continue with disk to tape throughput validation
n Resource sharing with local users
q For users more ready using grid q Large storage resource required
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
*FTT: Federated Taiwan Tire2
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n CPU
q Current status:
n
430 KSI2k (composite by IBM HS20 and Quanta Blades)
q Goal: q Quanta Blades
n
7U, 10blades, Dual CPU, ~1.4 ksi2k/cpu
n
ratio 30 ksi2k/7U, to meet 950KSI2k
n
need 19 chassis (~4 racks)
q IBM Blades
n
LV model available (save 70% power consumption)
n
Higher density, 54 processors (dual core + SMP Xeon)
n
Ratio ~80 KSI2k/7U, only 13 chassis needed (~3 racks)
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Disk
q Current status
n
3U array, 400GB drive, 14 drives per rack
n
ratio: 4.4 TB/6U
q Goal:
n
400 TB ~ 90 Arrays needed
n
~ 9 racks (assume 11 arrays per rack)
n
Tape
q
New 3584 tape lib installed mid of May
q
4 x LTO4 tape drives provide ~ 80MB/s throughput
q
expected to be installed in mid-March
q
delayed due to
n
internal procurement
n
update items of projects from funding agency
q
Expected new tape system implemented at mid-May
q
full system in operation within two weeks after installed.
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
Yes No Software required for Remote Management HSC/ACSLS None Required software expense 5 sec 1.8 sec* Average cell to drive time < 60 min <60 sec/frame Audit time 5.7 16.4* Storage density in TB/Sq-ft 29 41* Cartridge density in Slots /Sq-ft 1.33 PB 2.75 PB Maximum single library capacity 200 GB / 9940C 400 GB / LTO3 Maximum cartridge capacity supported 64 192 Maximum tape drive configuration 1,448 / 6,632 58 / 6,881 Min/Max single library slot configuration No Yes Any to any cartridge to drive access 8 2 Accessors required for redundancy TBD Yes Redundant robotics New 5 years Modular library design STK SL8500 IBM 3584
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n C2 area of IPAS new machine room
q Rack space design q AC, cooling requirement:
n
for 20 racks: 1,360,000 BTUH or 113.3 tons of cooling - 2800 ksi2k
n
36 racks: 1,150,000 BTUH or 95 tons - 1440 TB
q HVAC: ~800 kVA estimated
n
HS20: 4000Watt * 5*20 + STK array: 1000Watt * 11*36 )
q generator
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n new tape system ready mid of May, full operation
q plan to have disk to tape throughput testing
n Split batch system and CE
q help stabilizing scheduling functionality (mid of May) q Site GIIS sensitive to high CPU load, move to SMP box
n CASTOR2 deployed mid of Jun
q Connect to new tape lib q migrate data from disk cache
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n CERN:
q
SC: Jamie, Maarten
q
Castor: Olof
q
Atlas: Zhong-Liang Ren
q
CMS: Chia-Ming Kuo
n ASGC:
q
Min, Hung-Che, J-S
q
Oracle: J.H.
q
Network: Y.L., Aries
q
CA: Howard
n IPAS: P.K., Tsan, & Suen
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Host: lcg00116 n Kernel: 2.4.20-20.9.XFS1.3.1 n Castor gridftp ver.: VDT1.2.0rh9-1
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Host: lcg00118 n Kernel: 2.4.21-40.EL.cern n Castor gridftp ver.: VDT1.2.0rh9-1
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
n Host: sc003 n Kernel version: 2.6.9-11.EL.XFS n Castor gridftp ver.: VDTALT1.1.8-13d.i386
וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ
1.28% 54.99 % 21.99% 2.17% 0.37% 1.53% 2.19% 5.68% 6.80% 2.52% 0.39% 0.08% Percent age 3,882 167,05 2 66,804 6,593 1,114 4,641 6,658 17,267 20,649 7,666 1,184 249 Total 38 60 25 30 twgrid 2 2 11 2 4 3 1 2 1 1 dteam 935 60,357 15,892 2,844 87 2,678 1,082 11,513 15,924 6 3 cms 17 7,364 12,366 7 1 2,283 14 49 5,818 1,019 13 biomed 2,892 99,269 38,544 3,706 994 1,959 3,290 5,739 4,674 1,841 162 187 atlas 48 alice 6-May 6-Apr 6-Mar 6-Feb 6-Jan 5-Dec 5-Nov 5-Oct 5-Sep 5-Aug 5-Jul 5-Jun VO וֹכּמּף ףץ٪ّ٠מּَِ ٩٭۶ףוֹ٭٩ץף ێ ۖףףףِِ