asiapacific regional operation center
play

AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May - PowerPoint PPT Presentation

AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/ 1 Agenda Introduction ROC Status Recent Activities 2 APROC I ntroduction APROC Goal Provide


  1. AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/ 1

  2. Agenda • Introduction • ROC Status • Recent Activities 2

  3. APROC I ntroduction • APROC Goal • Provide deployment support facilitating Grid expansion • Maximize the availability of Grid services • Supports EGEE sites in Asia Pacific since April 2005 • EGEE CIC • CIC-on-duty rotation: EGEE global operations • Monitoring tool development: GStat and GGUS Search • VO services • EGEE ROC • Monitoring, Diagnosis and Problem tracking M/W release deployment support • Security Coordination Site Registration • Portal and documentation 3

  4. ASGCCA • Production service since July 2003 • Taiwan • LCG/EGEE users in Asia Pacific without local CA • Member of both • EUGridPMA • APGridPMA • http://ca.grid.sinica.edu.tw 4

  5. VO I nfrastructure Support • APROC hosts centralized services for VOs • Host VOMS server • VO assigns manager to maintain membership • VO supply AUP • Host LFC global file catalogue service • Resource Broker • Top-Level BDII • Currently supporting • TWGrid • APeSci 5

  6. EGEE Site Registration and Certification • Registration Procedure: • http://www.twgrid.org/aproc/doc/admin_intro/newrc/ • Guidance for user and host certificate registration • Registration into GOCDB • Recommend startup documentation • Instructions for further registration in • Mailing lists • VO membership • APROC ticketing system • Consulting on site architecture and deployment • Deployment support and troubleshooting • Site certification • Manual tests • SFT and GStat tests 6

  7. Middleware and Operations Support • Middleware Support • Installation support • New release testing • Supplementary release notes • Assist in coordination of updates and upgrades • Operations Support • Review and track GGUS and APROC tickets • Monitor and detect new problems • Provide detailed technical support to sites • Support Channels • Phone • Email • TRS Ticketing System 7

  8. APROC Portal • www.twgrid.org/aproc • Rollout Highlights • Supplemental documentation • Getting started links • Registration information • Contact Info and TRS links • lists.grid.sinica.edu.tw/apwiki • Supplementary release notes • Site Operations Procedures • Technical Howtos • Trouble Shooting FAQs • APF and GDA meeting minutes • Feel free to contribute! 8

  9. Agenda • Introduction • ROC Status • Recent Activities 9

  10. Members and Biweekly meeting • 11 sites, 7 countries, ~ 600 CPUs • Australia Japan • India Korea • Pakistan Singapore • Taiwan • APF Meetings • Short biweekly meeting between AP sites • Topics • Operation: M/W issues, operations news, review site status • Service challenge: news and announcements • Welcome other topics, such as BELLE or other regional topics 10

  11. Site Registration • Site Registration • Recently: • JP-KEK-CRC-01 • In progress • Australia-UNIMELB-LCG2 • JP-KEK-CRC-02 • TW-THU-HPC • PAKGRID3-LCG2 • Welcomed site from CERN ROC • INDIACMS-TIFR • NCP-LCG2 • PAKGRID-LCG2 11

  12. APROC Usage I • Total computing capacity is increasing • But so is utilization (peak over 80% ) 700 % CPU Usage 600 100 500 90 80 400 70 totalCPU % Usage 60 runJob 300 50 40 200 30 20 100 10 0 0 4/1/2005 5/1/2005 6/1/2005 7/1/2005 8/1/2005 9/1/2005 10/1/2005 11/1/2005 12/1/2005 1/1/2006 2/1/2006 3/1/2006 4/1/2006 5/1/2006 4/1/05 5/1/05 6/1/05 7/1/05 8/1/05 9/1/05 10/1/05 11/1/05 12/1/05 1/1/06 2/1/06 3/1/06 4/1/06 5/1/06 12

  13. APROC Usage I I • Jobs predominately from Biomed, CMS and Atlas VOs • Past year: 41 KSI2K Years • This April: 21 KSI2K Years 13

  14. APROC Availability I • Ideal Grid World: May 3, 2006 14

  15. APROC Availability I I • Daily snapshots of SFT results of each site • Availability of 60-70% • Better if weighted with numbers of CPU • CT mostly replica management failure • Sensitive to Information System performance • Network Issues • Network congestion and packet loss • APROC SmokePing to monitor net performance • But monitoring from CERN is more relevant • Scheduled Downtime • Network and power maintenance Decommissioned • Hardware maintenance and upgrade Slow BDI I • Middleware upgrade 2.4 2.6 2.7 100% 100% 90% 90% 80% 80% SD 70% 70% CT CT 60% 60% JL 50% JL 50% JS 40% 40% JS OK 30% 30% OK 20% 20% 10% 10% 0% 0% 2005-04 2005-06 2005-08 2005-10 2005-12 2006-02 2006-04 4 5 6 7 8 9 0 1 2 1 2 3 4 0 0 0 0 0 0 1 1 1 0 0 0 0 - - - - - - - - - - - - - 5 5 5 5 5 5 5 5 5 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 15

  16. Support I ssues and Tickets Statistic (Total/Monthly Avg) • Remote troubleshooting • Email interaction is slow Open tickets 10 • Remote testing is limited Close tickets 425/39 • Reluctantly ask for access Total tickets 435/40 to services • Local diagnostic tools would be helpful 16

  17. ASGCCA Status • Improvements • Overhaul of certificate registration • instructions and application forms • Step-by-step guide for browser certificate management • Addition of FAQ sections to address common tasks • In progress • Certificate import error related to Firefox 1.5 • Design and implement new RA procedures • Revise and update CP/CPS 17

  18. Agenda • Introduction • ROC Status • Recent Activities 18

  19. Security Service Challenge 1 • Purpose to ensure that: • Sufficient information is available for audit trace (for IR) • Appropriate communication channels are available • Security Challenge with (OSCT) • Sending test jobs • Sites recover evidence • DN of job submitter IP address of submission UI • Executable name Time when executable ran • Results • Completed March 2006 for a period of one week • Instructions and audit guide sent to participating sites • 4 of 7 APROC sites completed challenge • Some sites could not participate due to SD or unavailability • Some results were incomplete since sites did not have Resource Broker (RB) • Sites need to contact RB admin for more information • Helpful learning exercise to familiarize security contacts with auditing process for LCG • Improvements • Sharing of audit techniques between ROCs (GOCWiki) • Tools to extract security audit information • Helpful for future SSC to measure security patch response time 19

  20. Pre-Production Service • APROC started PPS service in April 2006 • Previously managed by Application team • PPS deployment with glite-3.0 RC2 complete • Mix of LCG and gLite components • LCG-CE gLite-CE • MON combined UI • Integration of production SE and SRM services • FTS still needs to be deployed • Summary • Good way to get experience with gLite middleware • Using YAIM is very good transition for ROC staff • LCG components are more stable than gLite counterparts • Required significant support from CERN for gLite-CE • Integration with lcg-CE batch system was not trivial • Still troubleshooting • Need significant time to relearn administration and troubleshooting techniques • Administration documentation like ones accumulated for LCG in GOCWiki would be helpful 20

  21. Grid Administrator Tutorial I • Goal and details • Educate and train EGEE Site Administrators • Two day tutorial with instruction in Chinese • Hosted at Academia Sinica in March 2006 • Topics covered • Grid technology and components • Operations, administration and troubleshooting • Brief overview of Grid applications • Hands-on session to deploy functional sites • 36 Xen servers configured • Simple CA, RB, BDII, VOMS, LFC provided • 5 teams of 6 participants deployed sites (UI, MON, CE, WN, DPM-head, DPM-disk) • Based on Marco La Rosa’s KEK tutorial • http://lists.grid.sinica.edu.tw/apwiki/Grid_Administrator_Tutorial_Hands-on_Instructions 21

  22. Grid Administrator Tutorial I I • Results • 30 participants from 15 institutes • 4.18/5.0 survey evaluation scores • Only a couple teams where able to complete a fully functional site • Not enough time • Setup YAIM configuration from scratch • Time consuming and error prone • More realistic and gives chance for participants to troubleshoot • Feedback • Break up hands-on session to practice after each lecture • Provide a reference cheat sheet • Acronyms • Grid architecture diagrams • Suggest Linux training material as prerequisite • Provide user and developer tutorials • Significant time to setup hands-on session servers for installation • Is this available in GLIDA? 22

  23. GStat Development • Instances created for • PPS Service • Regional projects • balticGrid, EELA, • EUChinaGrid, etc.. • Usage calculations modified • PhysicalCPU • SizeTotal, SizeFree • Results published • To Service Availability Monitoring Environment (SAME) at CERN • Client tool for to retrieve historical data • http://goc.grid.sinica.edu.tw/gocwiki/GStat_Client_Tools 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend