np04 powercut recovery and cold starts
play

NP04 Powercut Recovery and Cold Starts Pengfei Dine, Bonnie King, - PowerPoint PPT Presentation

NP04 Powercut Recovery and Cold Starts Pengfei Dine, Bonnie King, Geoff Savage DUNE DAQ meeting 29 July 2019 Power cut #1 July 23 Power cut with subsequent cooling failure brought down all servers except np04-srv-001 and np04-srv-004 (on


  1. NP04 Powercut Recovery and Cold Starts Pengfei Dine, Bonnie King, Geoff Savage DUNE DAQ meeting 29 July 2019

  2. Power cut #1 July 23 • Power cut with subsequent cooling failure brought down all servers except np04-srv-001 and np04-srv-004 (on UPS) 2

  3. Power cut #1 July 23 recovery • Manually pressed power button on servers without IPMI configured/cabled (and IPMI head nodes) • Issued power on commands via IPMI where possible with no particular order ● cronjob to delay reboot in place and mounts came back correctly ● had to restart supervisord where it came up before NFS mount ● had to mount CIFS mount manually 3

  4. Power cut #2 • Had to power down servers again due to water pressure drop • This time, non-critical servers were gracefully shut down and IPMI head nodes kept up • Most servers came back except srv-0[03, 04, 10. 11, 12, 21, 22, 24] due to known issue (keystroke required to boot) 4

  5. Power cut #2 recovery • Some RAID volumes (not on UPS) got upset after the power cut ● recovered mounts on np04-srv-003 and np04-srv-004 by manually assembling volumes ● raw data was written to some mount areas while mount was missing, filling up / ● moved data to correct area Need to redirect boot loader and kernel init to serial console for IPMI access (plan to do this later today) 5

  6. Planned improvements • Get serial console redirection to Serial Over Lan working during boot (can send keystrokes remotely) • Configure supervisord with correct startup dependency • Audit ansible playbooks; remove outdated playbooks, create playbook to provision new node from scratch • Alerting in Prometheus ● disk usage, missing mounts, etc etc • Ansible roles for work done in test period • np04-onl-XXX cabled for IPMI? • np04-srv-007 IPMI cable 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend