Scott Stanford # Topology Infrastructure Backups & Disaster - - PowerPoint PPT Presentation

scott stanford
SMART_READER_LITE
LIVE PREVIEW

Scott Stanford # Topology Infrastructure Backups & Disaster - - PowerPoint PPT Presentation

Scott Stanford # Topology Infrastructure Backups & Disaster Recovery Monitoring Lessons Learned Q&A # # Boston Traditional Proxy 1.2 Tb database, mostly db.have Average daily journal size 70 Gb


slide-1
SLIDE 1

#

Scott Stanford

slide-2
SLIDE 2

#

  • Topology
  • Infrastructure
  • Backups & Disaster Recovery
  • Monitoring
  • Lessons Learned
  • Q&A
slide-3
SLIDE 3

#

slide-4
SLIDE 4

#

P4D (Sunnyvale)

Boston Traditional Proxy Pittsburg Traditional Proxy RTP Traditional Proxy Bangalore Traditional Proxy

  • 1.2 Tb database, mostly db.have
  • Average daily journal size 70 Gb
  • Average of 4.1 Million daily commands
  • 3722 users globally
  • 655 Gig of depots
  • 254,000 Clients, most with @ 200,000

files

  • One Git-Fusion instance
  • 2014.1 version of Perforce
  • Environment has to be up 24x7x365
slide-5
SLIDE 5

#

Commit (Sunnyvale)

RTP Edge

Pittsburg Proxy Boston Proxy

Sunnyvale Edge Bangalore Edge Boston Traditional Proxy Pittsburg Traditional Proxy RTP Traditional Proxy Bangalore Traditional Proxy

  • Currently migrating from a

traditional model to Commit/Edge servers.

  • Traditional proxies will remain

until the migration completes later this year

  • Initial Edge database is 85 Gig
  • Major sites have an Edge server,
  • thers a proxy off of the closest

Edge (50ms improvement)

slide-6
SLIDE 6

#

slide-7
SLIDE 7

#

  • All large sites have an

Edge server, formerly were proxies

  • High performance SAN

storage used for the database, journal, and log storage

  • Proxies have a

P4TARGET of the closest Edge server (RTP)

  • All hosts deployed with

an active/standby host pairing

7

slide-8
SLIDE 8

#

  • Redundant Connectivity to storage
  • FC - redundant Fabric to each controller

and HBA

  • SAS – each dual HBA connected to

each controller

  • Filers has multiple redundant data LIFs
  • 2 x 10 Gig NICs, HA bond, for the network

(NFS and p4d)

  • VIF for hosting public IP / hostname
  • Perforce licenses tied to this IP
slide-9
SLIDE 9

#

Each Commit/Edge server is configured in a pair consisting of

  • A production host, controlled through a virtual NIC

– Allows for a quick failover of the p4d without any DNS or changes to the users environment

  • Standby host with a warm database or read-only replica
  • Dedicated SAN volume for low latency database storage
  • Multiple levels of redundancy (Network, Storage, Power, HBA)
  • Common init framework for all Perforce daemon binaries
  • SnapMirrored volume used for hosting the infrastructure binaries & tools

(Perl, Ruby, Python, P4, Git-Fusion, common scripts)

slide-10
SLIDE 10

#

  • Storage devices used

– NetApp EF540 w/ FC for the Commit server

  • 24 x 800 Gig SSD

– NetApp E5512 w/ FC or SAS for each Edge server

  • 24 x 600 Gig 15k SAS

– All RAID 10 with multiple spare disks, XFS, dual controllers, and dual power supplies

  • Used for:

– Warm database or read-only replica on stand-by host – Production journal

  • Hourly journal truncations, then copied to the filer

– Production p4d log

  • Nightly log rotations, compressed and copied to the filer
slide-11
SLIDE 11

#

  • NetApp cDOT clusters used at each site with FAS6290 or better
  • 10 Gig data LIF
  • Dedicated vserver for Perforce
  • Shared NFS volumes between production/standby pairs for longer term

storage, snapshots, and offsite

  • Used for:

– Depot storage – Rotated journals & p4d logs – Checkpoints – Warm database

  • used for creating checkpoints and if both hosts are down to run the daemon

– Git-Fusion homedir & cache, dedicated volume per instance

slide-12
SLIDE 12

#

slide-13
SLIDE 13

#

  • Truncate the journal
  • Checksum the journal, copy to NFS and

verify they match

  • Create a SnapShot of the NFS volumes
  • Remove any old snapshots
  • Replay the journal on the warm SAN

database

  • Replay the journal on the warm NFS

database

  • Once a week create a temporary snapshot
  • n the NFS database and create a

checkpoint (p4d –jd)

Checksum journal on SAN Copy journal to NFS Compare checksums

  • f local and

NFS Create snapshot(s) Delete old snapshots Replay on warm standby Replay on warm NFS p4d -jj

Every 1 hour

slide-14
SLIDE 14

#

Warm database

  • Trigger on the Edge server events.csv changing
  • If a jj event, then get the journals that may need to

be applied:

– p4 journals –F “jdate>=(event epoch – 1)” –T jfile,jnum”

  • For each journal, run a p4d –jr
  • Weekly checkpoint from a snapshot

Read-only Replica from Edge

  • Weekly checkpoint
  • Created with:
  • p4 –p localhost:<port> admin checkpoint -Z

Edge server captures event in events.csv Monit triggers backups on events.csv Determine which journals to apply Apply journals Commit server truncates

slide-15
SLIDE 15

#

  • New process for Edge servers to avoid WAN NFS

mounts

  • For all the clients on an Edge server, at each site:

– Save the change output for any open changes – Generate the journal data for the client – Create an tarball of the open files – Retained for 14 days

  • A similar process will be used by users to clone

clients across Edge servers

slide-16
SLIDE 16

#

  • Snapshots:

– Main backup method – Created and kept for:

  • 4 hours every 20 minutes (20 & 40 minutes past the hour)
  • 8 hours every hour (top of the hour)
  • 3 weeks of nightly during backups (@midnight PT)
  • SnapVault

– Used for online backups – Created every 4 weeks, kept for 12 months

  • SnapMirrors

– Contains all of the data needed to recreate the instance – Sunnyvale

  • DataProtection (DP) Mirror for data recovery
  • Stored in the Cluster
  • Allows the possibility of fast test instances being created

from production snapshots with FlexClone – DR

  • RTP is the Disaster Recovery site for the Commit server
  • Sunnyvale is the Disaster Recovery site for the RTP and

Bangalore Edge servers

slide-17
SLIDE 17

#

slide-18
SLIDE 18

#

  • Monit & M/Monit

– Monitors and alerts

  • Filesystem thresholds, space and inodes
  • On specific processes, and file changes (timestamp/md5)
  • OS thresholds
  • Ganglia

– Used for identifying host or performance issues

  • NetApp OnCommand

– Storage monitoring

  • Internal Tools

– Monitor both infrastructure and the end-user experience

slide-19
SLIDE 19

#

  • Daemon that runs on each system, sends

data to a single M/Monit instance

  • Monitors core daemons (Perforce and

system)

ssh, sendmail, ntpd, crond, ypbind, p4p, p4d, p4web, p4broker

  • Able to restart or take actions when

conditions met (ie. clean a proxy cache or purge all)

  • Configured to alert on process children

thresholds

  • Dynamic monitoring from init framework

ties

  • Additional checks added for issues that

have affected production in the past:

– NIC errors – Number of filehandles – known patterns in the system log – p4d crashes

slide-20
SLIDE 20

#

  • Multiple Monit (one per host) communicate the status to a

single M/Monit instance

  • All alerts and rules are controlled through M/Monit
  • Provides the ability to remotely start/stop/restart daemons
  • Has a dashboard of all of the Monit instances
  • Keeps historical data of issues, both when found and

recovered from

slide-21
SLIDE 21

#

  • Collect historical data (depot, database, cache sizes,

license trends, number of clients and opened files per p4d)

  • Benchmarks collected every hour with the top user

commands

– Alerts if a site is 15% slower than a historical average – Runs for both the Perforce binary and internal wrappers

slide-22
SLIDE 22

#

slide-23
SLIDE 23

#

  • Faster performance for end-users

– Most noticeable for sites with higher latency WAN connections

  • Higher uptime for services since an Edge can service some

commands when the WAN or Commit site are inaccessible

  • Much smaller databases, from 1.2Tb to 82G on a new Edge

server

  • Automatic “backup” of the Commit server data through Edge

servers

  • Easily move users to new instances
  • Can partially isolate some groups from affecting all users
slide-24
SLIDE 24

#

  • Helpful to disable csv log rotations for frequent journal truncations

– Set the dm.rotatelogwithjnl configurable to 0

  • Shared log volumes with multiple databases (warm or with a daemon) can cause

interesting results with csv logs

  • Set global configurables where you can, monitor, rpl.*, track, etc
  • Use multiple pull –u threads to ensure the replicas have warm copies of the depot files
  • Need to have rock solid backups on all p4d’s with client data

– Warm databases are harder to maintain with frequent journal truncations, no way to trigger

  • n these events
  • Shelves are not automatically promoted
  • Users need to login to each edge server or ticket file updated from existing entries
  • Adjusting the perforce topologies may have unforeseen side-effects. Pointing proxies

to new P4TARGETs can cause increased load on the WAN depending on the topology.

slide-25
SLIDE 25

#

Scott Stanford sstanfor@netapp.com

slide-26
SLIDE 26

# Scott Stanford is the SCM Lead for NetApp where he also functions as a worldwide Perforce Administrator and tool

  • developer. Scott has twenty years experience in software

development, with thirteen years specializing in configuration

  • management. Prior to joining NetApp, Scott was a Senior IT

Architect at Synopsys.

slide-27
SLIDE 27

#

RESOUR RESOURCES CES

SnapShot:

http://www.netapp.com/us/technology/storage-efficiency/se-technologies.aspx

SnapVault & SnapMirror:

http://www.netapp.com/us/products/protection-software/index.aspx

Backup & Recovery of Perforce on NetApp:

http://www.netapp.com/us/system/pdf-reader.aspx?pdfuri=tcm:10-107938-16&m=tr-4142.pdf Monit: http://mmonit.com/