ALHAD .G. APTE HEAD, COMPUTER DIVISION, BHABHA ATOMIC RESEARCH - - PowerPoint PPT Presentation

alhad g apte head computer division bhabha atomic
SMART_READER_LITE
LIVE PREVIEW

ALHAD .G. APTE HEAD, COMPUTER DIVISION, BHABHA ATOMIC RESEARCH - - PowerPoint PPT Presentation

ALHAD .G. APTE


slide-1
SLIDE 1
  • ALHAD .G. APTE

HEAD, COMPUTER DIVISION, BHABHA ATOMIC RESEARCH CENTER MUMBAI - INDIA

slide-2
SLIDE 2
  • !""#!#$"%!
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

(256*2) !" #$%"& '()#*+

,

(288*4) !

  • ./

01233456789 #$%"&:()#*+

slide-6
SLIDE 6

!" "!#$%% !"!#

slide-7
SLIDE 7

!&"!#

8; < +=8 />;7!%88!$334?--4 7%8/8

"#' " !"()*

(;/2 #3;8 1';;!8!;>!

!&"!#$"#' " ! "()*

Master Client Win32/ Xlib Chromium OpenGL Graphics Hardware N E T W O R K Graphics Hardware Rendering Server 36 Projector/ Monitor Graphics Hardware Rendering Server 1 Projector/ Monitor

slide-8
SLIDE 8

!887;.! +%;@A%!@ (8!@A!;> A8>;! +;8 !!>8@-8!8>.% >!);;! //!;!7! ! !2%! ! +%7

  • @!=8>77!

+@;!3@A(B@ /!/ .!%!7; >A8+%.%A8>;!) !"/!!;8.!+>;!

!"

slide-9
SLIDE 9
slide-10
SLIDE 10

!""!+ +&"!% ), )"+)%"+""-"'!& ').(")!& % "& ( /%"!"")&,* ( +"&'#"& "!0)% 1'&1)!&"!#% 1("!#!& 0 %"!"0"%)%' !), #)!"2)" ! ')'"')&-))!& 0-) %3 1)/ "!" 1,)" !& 4)()55)1 4 5)) 6#" "'#"&1"&&-)&3738 -)' "!"")"&"&&-))!&! -"!##"93:3 !& 9;< % )% !!%&' #'"&3

  • !- 5,)!&-"&'& '"#'% 3)%'/ ,

,1"" !)(("%)" !3

"&'

slide-11
SLIDE 11
  • 4 Mbps

Links

  • Resource sharing and coordinated problem solving in

dynamic, multiple R&D units

Uses WLCG tools

"&(

slide-12
SLIDE 12

UI User Interface SE Storage Element CE Computing Element WN Worker Node WN Worker Node WN Worker Node WN Worker Node WN Worker Node WN Worker Node Resource Broker + MyProxy Server + Top BDII (Workload Management) (Proxy renewal) (Information System) LFC File Catalog

Interface for using the GRID

Certifying Authority Certificates VOMS Virtual Organization Membership Server

"&&-)+"%

  • services (central) deployed only at BARC

Certification Authority (CA) Virtual Organization Membership Service (VOMS) Resource Broker (RB) + MyProxy Server + Top BDII LCG File Catalogue (LFC) Monitoring & Accounting Server

  • All sites deploy the site services

namely Computing Element (one CE for every cluster) Worker Nodes User Interface

Monitoring & Accounting Server

slide-13
SLIDE 13

User Interface User Interface Worker Nodes PBS client Certificates FMON Server

GridFTP, RFIO and FMON agent

Storage Element Gatekeeper Site BDII GRIS Information Providers PBS FMON agent Computing Element

Site Services

  • Computing Element

Gatekeeper Service (Accepts job requests from Grid) PBS (Cluster Resource Management System) GRIS, Site BDII (Information System) FMON (Monitoring Agents)

  • Worker Nodes

PBS Client

  • Storage Element

DPM Services GridFTP, RFIO Services FMON Server (Monitoring Server)

  • User Interface

"&"+"%

slide-14
SLIDE 14

"& )

slide-15
SLIDE 15

!=6#" >("!3 ' &0"!&? )>&1"!") ,>#")" !' "*=> %>")!)#&>

  • +!)1* 00"! )!& !"!

!"! ( +"&)-,"!0)%0 ( )&' ="0"%)"#!"!#@>)!&& -! )&'"#!& %"0"%)3 ' 00"! "&0 ""!#%"0"%))!&+ 5"!# %"0"%))!&"!#!)" ! 0 ="0"%) + %)" !">3 "& ')+(!* ! AB3

DAE-Grid Certification Authority (CA)

slide-16
SLIDE 16
  • An in-house developed monitoring service that gives the complete state of the grid in a

single page (Services, File System, PBS MOM alerts etc)

  • Gives Status of the jobs in queue.
  • Job Records are collected and graphs generated by server from APEL (Accounting

Processor for Event Logs), which runs on every cluster

!" "!#$%% !"!#+

slide-17
SLIDE 17
  • The RB service failure has the following impact

New jobs cannot be submitted Status of existing jobs cannot be queried Jobs, which have finished will not be shown as, completed until the RB service has been recovered Output data from jobs may be lost since they cannot copy the job results to the output sandbox on the RB (the job retries for few times after waiting random periods for some time and then gives up)

"#'+)"),""*

RB Master RB Standby

Switch HA Status Packets broker.barc.daegrid.gov.in

State

slide-18
SLIDE 18

RRCAT Indore, M.P VECC Kolkota, W.B IGCAR Kalpakkam BARC Mumbai

Cluster

R B

Links not being used

"( % 5

  • Current gLite Version has inherent support for using multiple resource brokers

for a single VO.

  • The user job will be directed to the resource broker that is up and running at

the time of submission.

slide-19
SLIDE 19
  • Currently queued jobs need to be shifted
  • Resource Broker maintains the state in MySQL database and

Condor-G maintains the queue

  • Putting the jobs in the backup RB Condor-G queue is not possible.
  • Instead, take the state of jobs and Sandbox Dir. in main RB and

give to backup RB

  • Backup RB copies the Sandbox Dir.’s and maintains the state of the

jobs separately that were initially submitted to main RB

  • DNS mapping need to be changed for main RB.
  • Client (Backup RB) – Server (CE’s) method used by backup RB to

regularly update the status of these Jobs

  • Job Management Commands like job status, job cancel etc need to take

this effect automatically.

"!(()"!#',)%5(, 533

slide-20
SLIDE 20
  • Main RB comes up
  • Get the state of all the jobs (shifted from this RB) from the Backup

RB

  • Update the MySQL state tables and Sandbox directories

accordingly

  • Remove the DNS mapping from the main RB
  • This has been tested and is very useful in ensuring the smooth running
  • f jobs in the event of a scheduled switching off of a Resource broker
  • wing to A/C maintenance or due to scheduled Electrical Maintenance.

Issues in automating the above solution (we are currently working on this)

  • Maintaining same Global User DN -> Local User mapping across RB’s is

difficult.

  • Re-thinking needed to Completely automating this process (which is not

there like DNS mapping).

"!(()"!#',)%5(, 533

slide-21
SLIDE 21

! ! ! !

  • "

" " " # # # #

  • $

$ $ $ % % % %

slide-22
SLIDE 22

4 Mbps Links

!"#

  • 0.3/1 Gbps link

to CERN/Geant

100 Mbps Link

Tier III 2 Mbps IPLC NLC

BARC, IOPB and 14 Universities have been operational since 2007

34 mbps Link to Geant

(')%"'*

slide-23
SLIDE 23

!" !"/ /%;>7//!78 #"88/8>%%88.>;; (-)4/ 4/%!88!/)>;;>;!8>/ +! >&6($0 9/8"87%A<66688 3>8+48>8/!7!=/8!8=%8

slide-24
SLIDE 24

Running Services Bandwidth Status

Service Processor RAM VOBOX Dual Intel Xeon 2.4 Ghz 4GB ALIEN LFC CE Dual Intel Xeon 3.0 GHz 4GB SE Dual Intel Xeon 3.0 GHz 4GB 13 WN Dual Intel Xeon 3.0 GHz 4GB 100 Mbps is running fine since 14/01/2008 and Upgraded soon to 155 Mbps.

ALICE TIER-2 Centre at KOLKATA

slide-25
SLIDE 25

)"%"()&"!&+ (1!$ &( *1! 0 "!

LEMON Architecture GRIDVIEW QUATTOR Architecture SHIVA CC Tracker

slide-26
SLIDE 26

GridView Architecture

?")))!0

slide-27
SLIDE 27

GridView Screen

?C ,)

slide-28
SLIDE 28

&#'& &#'& &#'& &#'& ( ( ( (

  • %#%

%#% %#% %#%

slide-29
SLIDE 29
  • C>%";8>7/-.8>%>

87!/>%!/?D# /%!%-//8

  • !7 >!?/8

!>!/8//8!..!8

  • +>77!77;%!8.!/=8%!>8

8%8EEEE3 >8 !/8/8%8E> = (-)4$4 $!?+%%8EEEEEEEEE$+ +%%8EEEEEEE3?> ;!%;>88!>88EE-) FA

  • $>888

?!=%8/>88878E- G388"! ;/!>/!>

!&")"&,/%"+

slide-30
SLIDE 30

!" ##$!" %& #$#' #!"# #(

( ( ( (

  • #)#*$#+
slide-31
SLIDE 31

3+$(-A-4-3*(- +-661

  • #*(34-3-+*4(-*#3-+

$-3 4(--(-*-D#+4(3 *D4H4(-+&4*3(+3+-3 3A#*333#*I3(#4 #*+4(--(-*-?-3-4-3 37>3%!.!3

  • )(4DH((*-(-A#I

4(--(--?3#-(-*+ 3A#*(3(B--33#D4 4*$#+

slide-32
SLIDE 32

%5! -&#1!

,,#"( #',#$##

  • #./0

/"# ,,$## ,,#",(, (," #$$ #$/$ ,"( """# ,,,,$"'#" " # '$#

slide-33
SLIDE 33

#-B3+#-3+

slide-34
SLIDE 34

LEMON architecture

Continue

slide-35
SLIDE 35

Configuration Management Infrastructure Node(Cluster) Management

Continue

+$""# ,""$

slide-36
SLIDE 36

SAM DB GRIDVIEW DB Service Nodes

SAM tests SAM Test Results SAM Framework Publishing Web Service R-GMA Archiver Module Web Service Archiver Module SAM XSQL Export Module

RBs SEs (gridftp)

WS Client RB Job Logs Gridftp Logs Gridftp Logs Fabric Monitoring System at Site (LEMON / Nagios) HTTP/XML Availability Metrics

GOCDB

GOCDB Sync Module Data Analysis & Summarization Module

Visualization Module

Graphs & Reports

Continue

"&"- %'"%