Pierre Charrue BE/CO Preamble Preamble The LHC Controls - - PowerPoint PPT Presentation

pierre charrue be co preamble preamble the lhc controls
SMART_READER_LITE
LIVE PREVIEW

Pierre Charrue BE/CO Preamble Preamble The LHC Controls - - PowerPoint PPT Presentation

Pierre Charrue BE/CO Preamble Preamble The LHC Controls Infrastructure External Dependencies l d Redundancies Control Room Power Loss Conclusion Conclusion 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 2


slide-1
SLIDE 1

Pierre Charrue – BE/CO

slide-2
SLIDE 2

Preamble Preamble The LHC Controls Infrastructure

l d

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 2 Pierre Charrue - BE/CO - LHC Risk Review

slide-3
SLIDE 3

Preamble Preamble The LHC Controls Infrastructure

l d

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 3 Pierre Charrue - BE/CO - LHC Risk Review

slide-4
SLIDE 4

The Controls Infrastructure is designed to

control the beams in the accelerators control the beams in the accelerators d d h hi

It is not designed to protect the machine nor

to ensure personnel safety p y

See Machine Protection or Access Infrastructures

6 March 2009 4 Pierre Charrue - BE/CO - LHC Risk Review

slide-5
SLIDE 5

Preamble Preamble The LHC Controls Infrastructure

l d

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 5 Pierre Charrue - BE/CO - LHC Risk Review

slide-6
SLIDE 6
  • The 3-tier architecture

– Hardware Infrastructure – Software layers – Resource Tier

Applications Layer

– – VME crates, PC GW & PLC VME crates, PC GW & PLC dealing with dealing with high high performance performance acquisitions and acquisitions and real real-

  • time

time processing processing – – Database where all the setting Database where all the setting and configuration of all LHC and configuration of all LHC device exist device exist

B i L

Client tier

Applications Layer

– Server Tier – Application servers – Data Servers – File Servers – Central Timing Client Tier

Business Layer

Server tier

– Client Tier – Interactive Consoles – Fixed Displays – GUI applications – Communication to the equipment goes h h C l Middl CMW

tier

CMW CMW

through Controls Middleware CMW

DB Hardware

Resource i

CMW CMW

CTRL CTRL

tier

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 6

slide-7
SLIDE 7

Since January 2006, the accelerator operation

is done from the CERN Control Centre (CCC) is done from the CERN Control Centre (CCC)

  • n the Prévessin site

h h d l d

The CCC hosts around 100 consoles and

around 300 screens

The CCR is the rack room next to the CCC. It

hosts more than 400 servers hosts more than 400 servers

6 March 2009 7 Pierre Charrue - BE/CO - LHC Risk Review

slide-8
SLIDE 8

Preamble Preamble The LHC Controls Infrastructure

l d i

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 8 Pierre Charrue - BE/CO - LHC Risk Review

slide-9
SLIDE 9

HARDWARE SOFTWARE

Electricity Cooling and Ventilation Network Oracle IT Authentication Technical Network/General Network Oracle servers in IT Technical Network/General

Purpose Network

6 March 2009 9 Pierre Charrue - BE/CO - LHC Risk Review

slide-10
SLIDE 10

All Linux servers are HP Proliants with dual

power supplies power supplies

They are cabled to two separate 230V UPS

sources

High power consumption will consume UPS

g p p batteries rapidly

1 hour maximum autonomy

1 hour maximum autonomy

Each Proliant consumes an average of 250W

6 March 2009 10 Pierre Charrue - BE/CO - LHC Risk Review

slide-11
SLIDE 11

F b d f i fl d li i i i CCR

  • Feb 2009: upgrade of air flow and cooling circuits in CCR

CCR vulnerability to Cooling problems has been resolved

  • In the event of loss of refrigeration the CCR will overheat very quickly
  • In the event of loss of refrigeration, the CCR will overheat very quickly

Monitoring with temperature sensors and alarms in place to ensure rapid

intervention by TI operators

The CCR cooling state is monitored by theTechnical Infrastructure Monitoring

The CCR cooling state is monitored by the Technical Infrastructure Monitoring (TIM) with views which can show trends over the last 2 weeks:

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 11

slide-12
SLIDE 12

V li bl t k

  • Very reliable network

topology

  • Redundant network

routes

  • Redundant Power

Supply in routers and switches switches

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 12

slide-13
SLIDE 13

Additional server for i

  • LHC Controls infrastructure is highly DATA centric

Controls Configuration LSA Settings HWC Measurements Measurements Logging

testing: Standby database for LSA

  • LHC Controls infrastructure is highly DATA centric

– All accelerator parameters & settings are stored in a DB located in B513

E-Logbook CESAR

CTRL CTRL CTRL CTRL 2 x quad-core 2.8GHz CPU 8GB RAM

11 4TB 11.4TB usable

Clustered NAS shelf 14x146GB FC disks Clustered NAS shelf 14x300GB SATA disks

  • Service Availability

– New infrastructure has high-redundancy for high-availability – Deploy each service on a dedicated Oracle Real Application Cluster p y pp – The use of a standby database will be investigated

  • objective of reaching 100% uptime for LSA

– The Logging infrastructure can sustain a 24h un-attainability of the DB Keep data in local buffers – Keep data in local buffers – A ‘golden’ level support with intervention in 24h – Secure database account granting specific privileges to dedicated db accounts

13 6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review

slide-14
SLIDE 14

Needed online for Role Based Access Control

(RBAC) and variousWeb Pages used by (RBAC) and various Web Pages used by

  • perators

d f l l

Not used for operational logins on Linux Windows caches recently used passwords

y p

6 March 2009 14 Pierre Charrue - BE/CO - LHC Risk Review

slide-15
SLIDE 15

Preamble Preamble The LHC Controls Infrastructure

l d

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 15 Pierre Charrue - BE/CO - LHC Risk Review

slide-16
SLIDE 16

Remote Reboot and Terminal Server functionality Remote Reboot and Terminal Server functionality

built in

Excellent power supply and fan redundancy, partial

p pp y y, p CPU redundancy, ECC memory

Excellent disk redundancy

i i i f di k f il

Automatic warnings in case of a disk failure

Several backup methods :

ADSM towards IT backup ADSM towards IT backup Daily or weekly rsync towards a storage place in Meyrin

▪ Data will be recovered in case of catastrophic failure in CCR Data will be recovered in case of catastrophic failure in CCR

We are able to restore a BackEnd with destroyed

disks in a few hours

nd P

tM t i t ll d th M i it

2nd PostMortem server installed on the Meyrin site

6 March 2009 16 Pierre Charrue - BE/CO - LHC Risk Review

slide-17
SLIDE 17

VMEs VMEs

Can survive limited fan failure Some VME systems with redundant power supplies Otherwise no additional redundancy Remote reboot and terminal server vital

PLCs

Generally very reliable

Generally very reliable

Rarely have remote reboot because of previous point

▪ some LHC Alcove PLCs have a remote reboot ▪ some LHC Alcove PLCs have a remote reboot

6 March 2009 17 Pierre Charrue - BE/CO - LHC Risk Review

slide-18
SLIDE 18

LHC central timing Master, Slave,

Gateway using y g reflective memory, and hot standby h switch

d b d d d d k

Timing is distributed over dedicated network

to timing receivers CTRx in front ends g

6 March 2009 18 Pierre Charrue - BE/CO - LHC Risk Review

slide-19
SLIDE 19

Isolation of Technical Network from external access

CNIC initiative to separate CNIC initiative to separate

the General Purpose Network from the Technical Network Network

NO dependences of

resources from the GPN for

  • perating the machines
  • perating the machines

Very few hosts from the

GPN allowed to access the TN TN

Regular Technical Network

security scans

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 19

slide-20
SLIDE 20

High level tools to diagnose and monitor the controls

infrastructure (DIAMON and LASER)

Easy to use first line diagnostics and tool to solve problems or help Easy to use first line diagnostics and tool to solve problems or help

to decide about responsibilities for first line intervention

Protecting the device access : RBAC initiative

Protecting the device access : RBAC initiative

Device access are authorized upon RULES applied to ROLES given to

specific USERS

GroupView Protecting the Machine Critical Settings (e.g. BLM threshold)

▪ Can only be changed by authorized person ▪ Uses RBAC for Authentication & Authorization

Group View

▪ Signs the data with a unique signature to ensure critical parameters have not been tampered since last update

Navigation Tree Monitoring Tests Details Repair tools

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 20

Repair tools

slide-21
SLIDE 21

Preamble Preamble The LHC Controls Infrastructure

l d

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 21 Pierre Charrue - BE/CO - LHC Risk Review

slide-22
SLIDE 22

Power Loss in any LHC site: No access to equipment from this site No access to equipment from this site

▪ Machine protection or OP will take action

Power Loss in the CCC/CCR Power Loss in the CCC/CCR CCC can sustain 1 hour on UPS

CCR C li ill b bl

CCR Cooling will be a problem Some CCR servers will still be up if the 2nd power

so rce i t ff t d source is not affected

10 minutes on UPS for EOD1

f

1 hour on UPS for EOD2 and EOD9

6 March 2009 22 Pierre Charrue - BE/CO - LHC Risk Review

slide-23
SLIDE 23

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 23

slide-24
SLIDE 24
  • The LHC machine itself is protected via a complete Machine Protection

System, mainly based of hardware :

Beam Interlock System

Beam Interlock System

Safe Machine Parameters system Fast Magnet current Change Monitors Powering Interlock System

g y

Warm Magnet Interlock System Software Interlock System

ll d h h d f l

  • All devices in the PostMortem chain are protected for at least 15 minutes

In addition, the source FrontEnds can hold the data locally in case the network

is the cause

The CCR servers for PostMortem are on UPS for one hour The CCR servers for PostMortem are on UPS for one hour A 2nd PostMortem mirror server is located on the Meyrin site The archives are stored on RAID servers, with 2 levels of backups, one on a

backup server maintained by BE/CO on the Meyrin site, and one ADSM bac up se e a ta ed by /CO o t e ey s te, a d o e S backup server maintained by IT in building 513.

6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 24

slide-25
SLIDE 25

Preamble Preamble The LHC Controls Infrastructure

l d

External Dependencies Redundancies Control Room Power Loss Conclusion

Conclusion

6 March 2009 25 Pierre Charrue - BE/CO - LHC Risk Review

slide-26
SLIDE 26

High dependence on electricity distribution,

network, cooling and ventilation, databases , g ,

Emphasis on the Controls Infrastructure on

Redundancy Remote monitoring and diagnostic Remote reset

Q i k i f j bl

Quick recovery in case of major problem

The controls infrastructure can sustain Power

loss between 10’ and 60’ loss between 10 and 60

Special care is taken to secure PostMortem data

(collection and archives)

6 March 2009 26 Pierre Charrue - BE/CO - LHC Risk Review