- ALHAD .G. APTE
ALHAD .G. APTE HEAD, COMPUTER DIVISION, BHABHA ATOMIC RESEARCH - - PowerPoint PPT Presentation
ALHAD .G. APTE HEAD, COMPUTER DIVISION, BHABHA ATOMIC RESEARCH - - PowerPoint PPT Presentation
ALHAD .G. APTE
- !""#!#$"%!
(256*2) !" #$%"& '()#*+
,
(288*4) !
- ./
01233456789 #$%"&:()#*+
!" "!#$%% !"!#
!&"!#
8; < +=8 />;7!%88!$334?--4 7%8/8
"#' " !"()*
(;/2 #3;8 1';;!8!;>!
!&"!#$"#' " ! "()*
Master Client Win32/ Xlib Chromium OpenGL Graphics Hardware N E T W O R K Graphics Hardware Rendering Server 36 Projector/ Monitor Graphics Hardware Rendering Server 1 Projector/ Monitor
!887;.! +%;@A%!@ (8!@A!;> A8>;! +;8 !!>8@-8!8>.% >!);;! //!;!7! ! !2%! ! +%7
- @!=8>77!
+@;!3@A(B@ /!/ .!%!7; >A8+%.%A8>;!) !"/!!;8.!+>;!
!"
!""!+ +&"!% ), )"+)%"+""-"'!& ').(")!& % "& ( /%"!"")&,* ( +"&'#"& "!0)% 1'&1)!&"!#% 1("!#!& 0 %"!"0"%)%' !), #)!"2)" ! ')'"')&-))!& 0-) %3 1)/ "!" 1,)" !& 4)()55)1 4 5)) 6#" "'#"&1"&&-)&3738 -)' "!"")"&"&&-))!&! -"!##"93:3 !& 9;< % )% !!%&' #'"&3
- !- 5,)!&-"&'& '"#'% 3)%'/ ,
,1"" !)(("%)" !3
"&'
- 4 Mbps
Links
- Resource sharing and coordinated problem solving in
dynamic, multiple R&D units
Uses WLCG tools
"&(
UI User Interface SE Storage Element CE Computing Element WN Worker Node WN Worker Node WN Worker Node WN Worker Node WN Worker Node WN Worker Node Resource Broker + MyProxy Server + Top BDII (Workload Management) (Proxy renewal) (Information System) LFC File Catalog
Interface for using the GRID
Certifying Authority Certificates VOMS Virtual Organization Membership Server
"&&-)+"%
- services (central) deployed only at BARC
Certification Authority (CA) Virtual Organization Membership Service (VOMS) Resource Broker (RB) + MyProxy Server + Top BDII LCG File Catalogue (LFC) Monitoring & Accounting Server
- All sites deploy the site services
namely Computing Element (one CE for every cluster) Worker Nodes User Interface
Monitoring & Accounting Server
User Interface User Interface Worker Nodes PBS client Certificates FMON Server
GridFTP, RFIO and FMON agent
Storage Element Gatekeeper Site BDII GRIS Information Providers PBS FMON agent Computing Element
Site Services
- Computing Element
Gatekeeper Service (Accepts job requests from Grid) PBS (Cluster Resource Management System) GRIS, Site BDII (Information System) FMON (Monitoring Agents)
- Worker Nodes
PBS Client
- Storage Element
DPM Services GridFTP, RFIO Services FMON Server (Monitoring Server)
- User Interface
"&"+"%
"& )
!=6#" >("!3 ' &0"!&? )>&1"!") ,>#")" !' "*=> %>")!)#&>
- +!)1* 00"! )!& !"!
!"! ( +"&)-,"!0)%0 ( )&' ="0"%)"#!"!#@>)!&& -! )&'"#!& %"0"%)3 ' 00"! "&0 ""!#%"0"%))!&+ 5"!# %"0"%))!&"!#!)" ! 0 ="0"%) + %)" !">3 "& ')+(!* ! AB3
DAE-Grid Certification Authority (CA)
- An in-house developed monitoring service that gives the complete state of the grid in a
single page (Services, File System, PBS MOM alerts etc)
- Gives Status of the jobs in queue.
- Job Records are collected and graphs generated by server from APEL (Accounting
Processor for Event Logs), which runs on every cluster
!" "!#$%% !"!#+
- The RB service failure has the following impact
New jobs cannot be submitted Status of existing jobs cannot be queried Jobs, which have finished will not be shown as, completed until the RB service has been recovered Output data from jobs may be lost since they cannot copy the job results to the output sandbox on the RB (the job retries for few times after waiting random periods for some time and then gives up)
"#'+)"),""*
RB Master RB Standby
Switch HA Status Packets broker.barc.daegrid.gov.in
State
RRCAT Indore, M.P VECC Kolkota, W.B IGCAR Kalpakkam BARC Mumbai
Cluster
R B
Links not being used
"( % 5
- Current gLite Version has inherent support for using multiple resource brokers
for a single VO.
- The user job will be directed to the resource broker that is up and running at
the time of submission.
- Currently queued jobs need to be shifted
- Resource Broker maintains the state in MySQL database and
Condor-G maintains the queue
- Putting the jobs in the backup RB Condor-G queue is not possible.
- Instead, take the state of jobs and Sandbox Dir. in main RB and
give to backup RB
- Backup RB copies the Sandbox Dir.’s and maintains the state of the
jobs separately that were initially submitted to main RB
- DNS mapping need to be changed for main RB.
- Client (Backup RB) – Server (CE’s) method used by backup RB to
regularly update the status of these Jobs
- Job Management Commands like job status, job cancel etc need to take
this effect automatically.
"!(()"!#',)%5(, 533
- Main RB comes up
- Get the state of all the jobs (shifted from this RB) from the Backup
RB
- Update the MySQL state tables and Sandbox directories
accordingly
- Remove the DNS mapping from the main RB
- This has been tested and is very useful in ensuring the smooth running
- f jobs in the event of a scheduled switching off of a Resource broker
- wing to A/C maintenance or due to scheduled Electrical Maintenance.
Issues in automating the above solution (we are currently working on this)
- Maintaining same Global User DN -> Local User mapping across RB’s is
difficult.
- Re-thinking needed to Completely automating this process (which is not
there like DNS mapping).
"!(()"!#',)%5(, 533
! ! ! !
- "
" " " # # # #
- $
$ $ $ % % % %
4 Mbps Links
!"#
- 0.3/1 Gbps link
to CERN/Geant
100 Mbps Link
Tier III 2 Mbps IPLC NLC
BARC, IOPB and 14 Universities have been operational since 2007
34 mbps Link to Geant
(')%"'*
!" !"/ /%;>7//!78 #"88/8>%%88.>;; (-)4/ 4/%!88!/)>;;>;!8>/ +! >&6($0 9/8"87%A<66688 3>8+48>8/!7!=/8!8=%8
Running Services Bandwidth Status
Service Processor RAM VOBOX Dual Intel Xeon 2.4 Ghz 4GB ALIEN LFC CE Dual Intel Xeon 3.0 GHz 4GB SE Dual Intel Xeon 3.0 GHz 4GB 13 WN Dual Intel Xeon 3.0 GHz 4GB 100 Mbps is running fine since 14/01/2008 and Upgraded soon to 155 Mbps.
ALICE TIER-2 Centre at KOLKATA
)"%"()&"!&+ (1!$ &( *1! 0 "!
LEMON Architecture GRIDVIEW QUATTOR Architecture SHIVA CC Tracker
GridView Architecture
?")))!0
GridView Screen
?C ,)
&#'& &#'& &#'& &#'& ( ( ( (
- %#%
%#% %#% %#%
- C>%";8>7/-.8>%>
87!/>%!/?D# /%!%-//8
- !7 >!?/8
!>!/8//8!..!8
- +>77!77;%!8.!/=8%!>8
8%8EEEE3 >8 !/8/8%8E> = (-)4$4 $!?+%%8EEEEEEEEE$+ +%%8EEEEEEE3?> ;!%;>88!>88EE-) FA
- $>888
?!=%8/>88878E- G388"! ;/!>/!>
!&")"&,/%"+
!" ##$!" %& #$#' #!"# #(
( ( ( (
- #)#*$#+
3+$(-A-4-3*(- +-661
- #*(34-3-+*4(-*#3-+
$-3 4(--(-*-D#+4(3 *D4H4(-+&4*3(+3+-3 3A#*333#*I3(#4 #*+4(--(-*-?-3-4-3 37>3%!.!3
- )(4DH((*-(-A#I
4(--(--?3#-(-*+ 3A#*(3(B--33#D4 4*$#+
%5! -!
,,#"( #',#$##
- #./0
/"# ,,$## ,,#",(, (," #$$ #$/$ ,"( """# ,,,,$"'#" " # '$#
#-B3+#-3+
LEMON architecture
Continue
Configuration Management Infrastructure Node(Cluster) Management
Continue
+$""# ,""$
SAM DB GRIDVIEW DB Service Nodes
SAM tests SAM Test Results SAM Framework Publishing Web Service R-GMA Archiver Module Web Service Archiver Module SAM XSQL Export Module
RBs SEs (gridftp)
WS Client RB Job Logs Gridftp Logs Gridftp Logs Fabric Monitoring System at Site (LEMON / Nagios) HTTP/XML Availability Metrics
GOCDB
GOCDB Sync Module Data Analysis & Summarization Module
Visualization Module
Graphs & Reports