AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May - - PowerPoint PPT Presentation

asiapacific regional operation center
SMART_READER_LITE
LIVE PREVIEW

AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May - - PowerPoint PPT Presentation

AsiaPacific Regional Operation Center Min-Hong Tsai ASGC ISGC May 3, Academia Sinica http://www.twgrid.org/aproc/ 1 Agenda Introduction ROC Status Recent Activities 2 APROC I ntroduction APROC Goal Provide


slide-1
SLIDE 1

1

AsiaPacific Regional Operation Center

Min-Hong Tsai

ASGC ISGC May 3, Academia Sinica

http://www.twgrid.org/aproc/

slide-2
SLIDE 2

2

Agenda

  • Introduction
  • ROC Status
  • Recent Activities
slide-3
SLIDE 3

3

APROC I ntroduction

  • APROC Goal
  • Provide deployment support facilitating Grid expansion
  • Maximize the availability of Grid services
  • Supports EGEE sites in Asia Pacific since April 2005
  • EGEE CIC
  • CIC-on-duty rotation:

EGEE global operations

  • Monitoring tool development:

GStat and GGUS Search

  • VO services
  • EGEE ROC
  • Monitoring, Diagnosis and Problem tracking

M/W release deployment support

  • Security Coordination

Site Registration

  • Portal and documentation
slide-4
SLIDE 4

4

ASGCCA

  • Production service since July 2003
  • Taiwan
  • LCG/EGEE users in Asia Pacific without local CA
  • Member of both
  • EUGridPMA
  • APGridPMA
  • http://ca.grid.sinica.edu.tw
slide-5
SLIDE 5

5

VO I nfrastructure Support

  • APROC hosts centralized services for VOs
  • Host VOMS server
  • VO assigns manager to maintain membership
  • VO supply AUP
  • Host LFC global file catalogue service
  • Resource Broker
  • Top-Level BDII
  • Currently supporting
  • TWGrid
  • APeSci
slide-6
SLIDE 6

6

EGEE Site Registration and Certification

  • Registration Procedure:
  • http://www.twgrid.org/aproc/doc/admin_intro/newrc/
  • Guidance for user and host certificate registration
  • Registration into GOCDB
  • Recommend startup documentation
  • Instructions for further registration in
  • Mailing lists
  • VO membership
  • APROC ticketing system
  • Consulting on site architecture and deployment
  • Deployment support and troubleshooting
  • Site certification
  • Manual tests
  • SFT and GStat tests
slide-7
SLIDE 7

7

Middleware and Operations Support

  • Middleware Support
  • Installation support
  • New release testing
  • Supplementary release notes
  • Assist in coordination of updates and

upgrades

  • Operations Support
  • Review and track GGUS and APROC tickets
  • Monitor and detect new problems
  • Provide detailed technical support to sites
  • Support Channels
  • Phone
  • Email
  • TRS Ticketing System
slide-8
SLIDE 8

8

APROC Portal

  • www.twgrid.org/aproc
  • Rollout Highlights
  • Supplemental documentation
  • Getting started links
  • Registration information
  • Contact Info and TRS links
  • lists.grid.sinica.edu.tw/apwiki
  • Supplementary release notes
  • Site Operations Procedures
  • Technical Howtos
  • Trouble Shooting FAQs
  • APF and GDA meeting minutes
  • Feel free to contribute!
slide-9
SLIDE 9

9

Agenda

  • Introduction
  • ROC Status
  • Recent Activities
slide-10
SLIDE 10

10

Members and Biweekly meeting

  • 11 sites, 7 countries, ~ 600 CPUs
  • Australia

Japan

  • India

Korea

  • Pakistan

Singapore

  • Taiwan
  • APF Meetings
  • Short biweekly meeting between AP sites
  • Topics
  • Operation: M/W issues, operations news, review site status
  • Service challenge: news and announcements
  • Welcome other topics, such as BELLE or other regional topics
slide-11
SLIDE 11

11

Site Registration

  • Site Registration
  • Recently:
  • JP-KEK-CRC-01
  • In progress
  • Australia-UNIMELB-LCG2
  • JP-KEK-CRC-02
  • TW-THU-HPC
  • PAKGRID3-LCG2
  • Welcomed site from CERN ROC
  • INDIACMS-TIFR
  • NCP-LCG2
  • PAKGRID-LCG2
slide-12
SLIDE 12

12

APROC Usage I

  • Total computing capacity is increasing
  • But so is utilization (peak over 80% )

100 200 300 400 500 600 700 4/1/05 5/1/05 6/1/05 7/1/05 8/1/05 9/1/05 10/1/05 11/1/05 12/1/05 1/1/06 2/1/06 3/1/06 4/1/06 5/1/06 totalCPU runJob

% CPU Usage

10 20 30 40 50 60 70 80 90 100 4/1/2005 5/1/2005 6/1/2005 7/1/2005 8/1/2005 9/1/2005 10/1/2005 11/1/2005 12/1/2005 1/1/2006 2/1/2006 3/1/2006 4/1/2006 5/1/2006 % Usage

slide-13
SLIDE 13

13

APROC Usage I I

  • Jobs predominately from Biomed, CMS and Atlas VOs
  • Past year: 41 KSI2K Years
  • This April: 21 KSI2K Years
slide-14
SLIDE 14

14

APROC Availability I

  • Ideal Grid World: May 3, 2006
slide-15
SLIDE 15

15

APROC Availability I I

  • Daily snapshots of SFT results of each site
  • Availability of 60-70%
  • Better if weighted with numbers of CPU
  • CT mostly replica management failure
  • Sensitive to Information System performance
  • Network Issues
  • Network congestion and packet loss
  • APROC SmokePing to monitor net performance
  • But monitoring from CERN is more relevant
  • Scheduled Downtime
  • Network and power maintenance
  • Hardware maintenance and upgrade
  • Middleware upgrade

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2005-04 2005-06 2005-08 2005-10 2005-12 2006-02 2006-04 SD CT JL JS OK

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2 5

  • 4

2 5

  • 5

2 5

  • 6

2 5

  • 7

2 5

  • 8

2 5

  • 9

2 5

  • 1

2 5

  • 1

1 2 5

  • 1

2 2 6

  • 1

2 6

  • 2

2 6

  • 3

2 6

  • 4

CT JL JS OK

2.6 2.7 2.4 Decommissioned Slow BDI I

slide-16
SLIDE 16

16

Support I ssues and Tickets

  • Remote troubleshooting
  • Email interaction is slow
  • Remote testing is limited
  • Reluctantly ask for access

to services

  • Local diagnostic tools

would be helpful

435/40 Total tickets 425/39 Close tickets 10 Open tickets Statistic (Total/Monthly Avg)

slide-17
SLIDE 17

17

ASGCCA Status

  • Improvements
  • Overhaul of certificate registration
  • instructions and application forms
  • Step-by-step guide for browser certificate management
  • Addition of FAQ sections to address common tasks
  • In progress
  • Certificate import error related to Firefox 1.5
  • Design and implement new RA procedures
  • Revise and update CP/CPS
slide-18
SLIDE 18

18

Agenda

  • Introduction
  • ROC Status
  • Recent Activities
slide-19
SLIDE 19

19

Security Service Challenge 1

  • Purpose to ensure that:
  • Sufficient information is available for audit trace (for IR)
  • Appropriate communication channels are available
  • Security Challenge with (OSCT)
  • Sending test jobs
  • Sites recover evidence
  • DN of job submitter

IP address of submission UI

  • Executable name

Time when executable ran

  • Results
  • Completed March 2006 for a period of one week
  • Instructions and audit guide sent to participating sites
  • 4 of 7 APROC sites completed challenge
  • Some sites could not participate due to SD or unavailability
  • Some results were incomplete since sites did not have Resource Broker (RB)
  • Sites need to contact RB admin for more information
  • Helpful learning exercise to familiarize security contacts with auditing process for LCG
  • Improvements
  • Sharing of audit techniques between ROCs (GOCWiki)
  • Tools to extract security audit information
  • Helpful for future SSC to measure security patch response time
slide-20
SLIDE 20

20

Pre-Production Service

  • APROC started PPS service in April 2006
  • Previously managed by Application team
  • PPS deployment with glite-3.0 RC2 complete
  • Mix of LCG and gLite components
  • LCG-CE

gLite-CE

  • MON

combined UI

  • Integration of production SE and SRM services
  • FTS still needs to be deployed
  • Summary
  • Good way to get experience with gLite middleware
  • Using YAIM is very good transition for ROC staff
  • LCG components are more stable than gLite counterparts
  • Required significant support from CERN for gLite-CE
  • Integration with lcg-CE batch system was not trivial
  • Still troubleshooting
  • Need significant time to relearn administration and troubleshooting techniques
  • Administration documentation like ones accumulated for LCG in GOCWiki would be helpful
slide-21
SLIDE 21

21

Grid Administrator Tutorial I

  • Goal and details
  • Educate and train EGEE Site Administrators
  • Two day tutorial with instruction in Chinese
  • Hosted at Academia Sinica in March 2006
  • Topics covered
  • Grid technology and components
  • Operations, administration and troubleshooting
  • Brief overview of Grid applications
  • Hands-on session to deploy functional sites
  • 36 Xen servers configured
  • Simple CA, RB, BDII, VOMS, LFC provided
  • 5 teams of 6 participants deployed sites (UI, MON, CE, WN, DPM-head, DPM-disk)
  • Based on Marco La Rosa’s KEK tutorial
  • http://lists.grid.sinica.edu.tw/apwiki/Grid_Administrator_Tutorial_Hands-on_Instructions
slide-22
SLIDE 22

22

Grid Administrator Tutorial I I

  • Results
  • 30 participants from 15 institutes
  • 4.18/5.0 survey evaluation scores
  • Only a couple teams where able to complete a fully functional site
  • Not enough time
  • Setup YAIM configuration from scratch
  • Time consuming and error prone
  • More realistic and gives chance for participants to troubleshoot
  • Feedback
  • Break up hands-on session to practice after each lecture
  • Provide a reference cheat sheet
  • Acronyms
  • Grid architecture diagrams
  • Suggest Linux training material as prerequisite
  • Provide user and developer tutorials
  • Significant time to setup hands-on session servers for installation
  • Is this available in GLIDA?
slide-23
SLIDE 23

23

GStat Development

  • Instances created for
  • PPS Service
  • Regional projects
  • balticGrid, EELA,
  • EUChinaGrid, etc..
  • Usage calculations modified
  • PhysicalCPU
  • SizeTotal, SizeFree
  • Results published
  • To Service Availability Monitoring Environment (SAME) at CERN
  • Client tool for to retrieve historical data
  • http://goc.grid.sinica.edu.tw/gocwiki/GStat_Client_Tools
slide-24
SLIDE 24

24

Summary

  • People:
  • Jinny Chien

Shu-Ting Liao

  • Jason Shih

Howard Su

  • Jeng-Hsueh Wu

Joanna Huang

  • Aries Hong

Hung-Che Jen

  • Min Tsai
  • APROC Provides EGEE operations support services to AsiaPacific
  • There is significant room for improvement in availability
  • Middleware is becoming more reliable
  • Network monitoring in critical
  • Operations procedures to reduce Scheduled Downtime and improve time to recover
  • Diagnostic tools will be helpful for troubleshooting
  • Would there be interest in another Administration Tutorial in late Summer or Autumn?
  • If there are significant increase in deployment in AsiaPacific one ROC may not be scalable
  • Federation of APROC is another option
  • Please give us feedback on what we can improve
  • Contact us:
  • roc@lists.grid.sinica.edu.tw
  • http://www.twgrid.org/aproc
  • http://lists.grid.sinica.edu.tw/apwiki