Scott Stanford # Topology Infrastructure Backups & Disaster - PowerPoint PPT Presentation

Scott Stanford #

• Topology • Infrastructure • Backups & Disaster Recovery • Monitoring • Lessons Learned • Q&A #

Boston Traditional • Proxy 1.2 Tb database, mostly db.have • Average daily journal size 70 Gb • Average of 4.1 Million daily commands • 3722 users globally Bangalore Pittsburg P4D Traditional Traditional • 655 Gig of depots (Sunnyvale) Proxy Proxy • 254,000 Clients, most with @ 200,000 files • One Git-Fusion instance RTP • 2014.1 version of Perforce Traditional Proxy • Environment has to be up 24x7x365 #

Pittsburg Boston Proxy Proxy • Currently migrating from a Bangalore Sunnyvale traditional model to Commit/Edge RTP Traditional Edge Edge Proxy servers. • Traditional proxies will remain until the migration completes Commit later this year (Sunnyvale) RTP Bangalore • Initial Edge database is 85 Gig Traditional Edge Proxy • Major sites have an Edge server, others a proxy off of the closest Edge (50ms improvement) Pittsburg Boston Traditional Traditional Proxy Proxy #

• All large sites have an Edge server, formerly were proxies • High performance SAN storage used for the database, journal, and log storage • Proxies have a P4TARGET of the closest Edge server (RTP) • All hosts deployed with an active/standby host pairing # 7

• Redundant Connectivity to storage • FC - redundant Fabric to each controller and HBA • SAS – each dual HBA connected to each controller • Filers has multiple redundant data LIFs • 2 x 10 Gig NICs, HA bond, for the network (NFS and p4d) • VIF for hosting public IP / hostname • Perforce licenses tied to this IP #

Each Commit/Edge server is configured in a pair consisting of • A production host, controlled through a virtual NIC – Allows for a quick failover of the p4d without any DNS or changes to the users environment • Standby host with a warm database or read-only replica • Dedicated SAN volume for low latency database storage • Multiple levels of redundancy (Network, Storage, Power, HBA) • Common init framework for all Perforce daemon binaries • SnapMirrored volume used for hosting the infrastructure binaries & tools (Perl, Ruby, Python, P4, Git-Fusion, common scripts) #

• Storage devices used – NetApp EF540 w/ FC for the Commit server • 24 x 800 Gig SSD – NetApp E5512 w/ FC or SAS for each Edge server • 24 x 600 Gig 15k SAS – All RAID 10 with multiple spare disks, XFS, dual controllers, and dual power supplies • Used for: – Warm database or read-only replica on stand-by host – Production journal • Hourly journal truncations, then copied to the filer – Production p4d log • Nightly log rotations, compressed and copied to the filer #

• NetApp cDOT clusters used at each site with FAS6290 or better • 10 Gig data LIF • Dedicated vserver for Perforce • Shared NFS volumes between production/standby pairs for longer term storage, snapshots, and offsite • Used for: – Depot storage – Rotated journals & p4d logs – Checkpoints – Warm database • used for creating checkpoints and if both hosts are down to run the daemon – Git-Fusion homedir & cache, dedicated volume per instance #

• Truncate the journal Checksum p4d -jj journal on SAN • Checksum the journal, copy to NFS and verify they match • Create a SnapShot of the NFS volumes Replay on Copy journal warm NFS to NFS • Remove any old snapshots • Replay the journal on the warm SAN Every 1 hour database Compare • Replay the journal on the warm NFS checksums Replay on of local and warm database NFS standby • Once a week create a temporary snapshot on the NFS database and create a Delete old Create checkpoint (p4d – jd) snapshots snapshot(s) #

Warm database Edge server captures event in events.csv • Trigger on the Edge server events.csv changing • If a jj event, then get the journals that may need to Monit triggers Commit server backups on truncates be applied: events.csv – p4 journals –F “ jdate>=(event epoch – 1)” – T jfile,jnum ” • For each journal, run a p4d – jr • Weekly checkpoint from a snapshot Determine which journals to apply Apply journals Read-only Replica from Edge • Weekly checkpoint • Created with: • p4 – p localhost:<port> admin checkpoint -Z #

• New process for Edge servers to avoid WAN NFS mounts • For all the clients on an Edge server, at each site: – Save the change output for any open changes – Generate the journal data for the client – Create an tarball of the open files – Retained for 14 days • A similar process will be used by users to clone clients across Edge servers #

• Snapshots: – Main backup method – Created and kept for: • 4 hours every 20 minutes (20 & 40 minutes past the hour) • 8 hours every hour (top of the hour) • 3 weeks of nightly during backups (@midnight PT) • SnapVault – Used for online backups – Created every 4 weeks, kept for 12 months • SnapMirrors – Contains all of the data needed to recreate the instance – Sunnyvale • DataProtection (DP) Mirror for data recovery • Stored in the Cluster • Allows the possibility of fast test instances being created from production snapshots with FlexClone – DR • RTP is the Disaster Recovery site for the Commit server • Sunnyvale is the Disaster Recovery site for the RTP and Bangalore Edge servers #

• Monit & M/Monit – Monitors and alerts • Filesystem thresholds, space and inodes • On specific processes, and file changes (timestamp/md5) • OS thresholds • Ganglia – Used for identifying host or performance issues • NetApp OnCommand – Storage monitoring • Internal Tools – Monitor both infrastructure and the end-user experience #

• Daemon that runs on each system, sends data to a single M/Monit instance • Monitors core daemons (Perforce and system) ssh, sendmail, ntpd, crond, ypbind, p4p, p4d, p4web, p4broker • Able to restart or take actions when conditions met (ie. clean a proxy cache or purge all) • Configured to alert on process children thresholds • Dynamic monitoring from init framework ties • Additional checks added for issues that have affected production in the past : – NIC errors – Number of filehandles – known patterns in the system log – p4d crashes #

• Multiple Monit (one per host) communicate the status to a single M/Monit instance • All alerts and rules are controlled through M/Monit • Provides the ability to remotely start/stop/restart daemons • Has a dashboard of all of the Monit instances • Keeps historical data of issues, both when found and recovered from #

• Collect historical data (depot, database, cache sizes, license trends, number of clients and opened files per p4d) • Benchmarks collected every hour with the top user commands – Alerts if a site is 15% slower than a historical average – Runs for both the Perforce binary and internal wrappers #

• Faster performance for end-users – Most noticeable for sites with higher latency WAN connections • Higher uptime for services since an Edge can service some commands when the WAN or Commit site are inaccessible • Much smaller databases, from 1.2Tb to 82G on a new Edge server • Automatic “backup” of the Commit server data through Edge servers • Easily move users to new instances • Can partially isolate some groups from affecting all users #

• Helpful to disable csv log rotations for frequent journal truncations – Set the dm.rotatelogwithjnl configurable to 0 • Shared log volumes with multiple databases (warm or with a daemon) can cause interesting results with csv logs • Set global configurables where you can, monitor, rpl.*, track, etc • Use multiple pull – u threads to ensure the replicas have warm copies of the depot files • Need to have rock solid backups on all p4d’s with client data – Warm databases are harder to maintain with frequent journal truncations, no way to trigger on these events • Shelves are not automatically promoted • Users need to login to each edge server or ticket file updated from existing entries • Adjusting the perforce topologies may have unforeseen side-effects. Pointing proxies to new P4TARGETs can cause increased load on the WAN depending on the topology. #

Scott Stanford sstanfor@netapp.com #

Scott Stanford is the SCM Lead for NetApp where he also functions as a worldwide Perforce Administrator and tool developer. Scott has twenty years experience in software development, with thirteen years specializing in configuration management. Prior to joining NetApp, Scott was a Senior IT Architect at Synopsys. #

RESOUR RESOURCES CES SnapShot: http://www.netapp.com/us/technology/storage-efficiency/se-technologies.aspx SnapVault & SnapMirror: http://www.netapp.com/us/products/protection-software/index.aspx Backup & Recovery of Perforce on NetApp: http://www.netapp.com/us/system/pdf-reader.aspx?pdfuri=tcm:10-107938-16&m=tr-4142.pdf Monit: http://mmonit.com/ #

Scott Stanford # Topology Infrastructure Backups & Disaster - PowerPoint PPT Presentation

Scott Stanford # Topology Infrastructure Backups & Disaster Recovery Monitoring Lessons Learned Q&A # # Boston Traditional Proxy 1.2 Tb database, mostly db.have Average daily journal size 70 Gb

Queen Victoria Street Precinct Stanford A Collaborative Project by Stanford Tourism Stanford

Stanford Microfluidics Microfluidics Lab Lab Stanford Juan G. Santiago Research Examples:

Assessing the Gains from E-Commerce Paul Dolfen, Stanford Liran Einav, Stanford and NBER Pete

Stanford Web Authentication Overview Russ Allbery June 6, 2006 Russ Allbery (rra@stanford.edu)

MOCVD enables cutting-age Stanford University applications Dr. Xiaoqing Xu Stanford

Animation and Rendering of Complex Water Surfaces Douglas Enright Stephen Marschner Stanford

Brain-Inspired Computing H.-S. Philip Wong Stanford University Stanford SystemX Alliance

Stanford NetDB- An Open Source Network Management Application for DNS, DHCP, IP Address Spaces,

Teaching End-User Programming Monica Lam Stanford University lam@cs.stanford.edu Funded in part

CONCEPTS OF LOGICAL AI John McCarthy Computer Science Department Stanford University Stanford,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

Using Debian for Enterprise Infrastructure Stanford University: A Case Study Russ Allbery August

David S. Pedulla Stanford University dpedulla@stanford.edu State of the Union, 2018 Stanford

POMI Stanford University 2020 Programmable Open Mobile Internet POMI 2020 pomi.stanford.edu

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

Simple Variance Swaps Ian Martin ian.martin@stanford.edu LSE/Stanford and NBER May, 2013 Ian

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Chemistry 2000 Slide Set 7: Band theory of bonding in crystalline solids Marc R. Roussel January

Dolores Santa Mara * , M. a ngeles Farrn, M. a ngeles Garca, and Rosa M. a Claramunt

Automated Analysis of the Security of XOR-Based Key Management Schemes V eronique Cortier,

Tangent L evy Models Sergey Nadtochiy (joint work with Ren e Carmona) Oxford-Man Institute

Least Squares Monte Carlo and Energy Markets Matteo Tesser matteo.tesser@fairmat.com Fairmat

Image-based Rendering Can we model and rendering this? What do we want to do with the model?

How sustainable is Nigerias effort of ending gas flaring and unlocking gas potentials: a review