17 TF-Storage meeting - Pisa 13-14 October 2015
Build and operate a CEPH Infrastructure - University of Pisa case - - PowerPoint PPT Presentation
Build and operate a CEPH Infrastructure - University of Pisa case - - PowerPoint PPT Presentation
Build and operate a CEPH Infrastructure - University of Pisa case study Simone Spinelli simone.spinelli@unipi.it 17 TF-Storage meeting - Pisa 13-14 October 2015 Agenda C E P H @ u n i p i : a n o v e r v i e w Performances
17 TF-Storage meeting - Pisa 13-14 October 2015
Agenda
- CEPH@unipi: an overview
- Infrastructure bricks:
– Network – OSD nodes – Monitor Node – Racks – MGMT tools
- Performances
- Our experience
- Conclusions
17 TF-Storage meeting - Pisa 13-14 October 2015
University of Pisa
- Big sized Italian university:
– 70K students – 8K employees – Not a campus but spread all over the city → no big datacenter but many small sites
- Own and manage an optical infrastructure with on top an MPLS-based MAN
- Proud host of GARR Network PoP
- Surrounded by other research/educational institutions (CNR/SantAnna/Scuola
Normale…)
17 TF-Storage meeting - Pisa 13-14 October 2015
How we use CEPH
Currently in production as backend for an Openstack installation, it hosts:
- department tenants (Web servers, etc.. )
- tenants for research projects (DNA seq, etc… )
- tenants for us: multimedia content from elearning platforms
Working on:
- An email system for students hosted on Openstack → RBD
- A sync&share platform → RadosGW
17 TF-Storage meeting - Pisa 13-14 October 2015
Timeline
- Spring 2014: we started to plan:
–
Capacity/Replica planning
–
Rack engineering (power/cooling)
–
Bare metal management
–
Confjguration Management
- Dec 2014: fjrst testbed
- Feb 2015: 12 nodes cluster goes in production
- Jul 2015: Openstack goes in production
- Oct 2015: Start to deploy new ceph nodes (+12)
17 TF-Storage meeting - Pisa 13-14 October 2015
Overview
- 3 sites (we started with 2):
– One replica per site – 2 active computing and
storage
– 1 for storage and quorum
- 2 difgerent network infrastructures:
– services (1Gb and 10 Gb) – storage (10Gb and 40Gb)
17 TF-Storage meeting - Pisa 13-14 October 2015
Network
- Ceph clients and cluster
networks are realized as VLAN
- n the same switching
infrstructure
- Redundancy and loadbalancing
are achieved by LACP
- Switching platforms:
– Juniper EX4550: 32p SFP – Juniper EX4200: 24p copper
17 TF-Storage meeting - Pisa 13-14 October 2015
Storage ring
- Sites interconnected wirh a 2x40Gb
ERP
- For Storage nodes: 1VirtualChassis
per DC:
– Maximize the bandwidht: 128GB
backend inside the VC
– Easy to confjgure and manage
(NSSU)
– No more than 8 nodes per VC – For computing nodes difgerent VC
17 TF-Storage meeting - Pisa 13-14 October 2015
Hardware:OSD nodes
DELL R720XD (2U):
- 2 Xeon e5-2603@1.8Ghz: 8 core total
- 64GB RAM DDR3
- 2x10Gb Intel X520 Network Adapter
- 12 2TB SATA disks (6disks/RUs)
- 2 Samsung 850 256GB SSD disks
– Mdadm raid1 for OS – 6 partition per disk for XFS journal
- Ubuntu 14.04
- Linux 3.13.0-46-generic #77-Ubuntu
- Linux bonding driver:
– No special functions – Less complex
- Really easy to deploy with iDRAC
- Intended to be the virtual machine pool
(faster)
17 TF-Storage meeting - Pisa 13-14 October 2015
Hardware:OSD nodes
Supermicro SSG6047R-OSD120H:
- 2 Xeon e5-2630v2@2.60Ghz : 24 core
total
- 256GB RAM DDR3
- 4x10Gb Intel X520 Network Adapter
- 30 6TB SATA disks (7.5disks/RU)
- 6 intel 3700 SSD disks for XFS journal
– 1 disk → 5 OSD
- Ubuntu 14.04
- Linux 3.13.0-46-generic #77-Ubuntu
- 2 SSD raid 1 for OS (dedicated)
- Linux bonding driver:
– No special functions – Less complex
- Intended to be the object storage pool (slow)
17 TF-Storage meeting - Pisa 13-14 October 2015
Hardware: monitor nodes
Sun Sunfjre x4150
- Hardware not virtual (3 in production, going to be 5)
- Ubuntu 14.04 - Linux 3.13.0-46-generic #77-Ubuntu
- 2 Intel Xeon X5355@2.66Ghz
- 2x1GB intel for Ceph Client network (LACP)
- 16GB RAM
- 5x120GB intel 3500 SSD RAID 10 + HotSpare
17 TF-Storage meeting - Pisa 13-14 October 2015
Racks plans
NOW: computing and storage are mixed.
- 24U OSD nodes
- 4U Computing nodes
- 2U monitor/cache
- 10U network
IN PROGRESS: computing and storage will be in specifjc racks. For storage:
- 32U OSD nodes
- 2U monitor/cache
- 8U network
For computing:
- 32U for computing nodes
- 10U network
The storage network fan-out is optimized
17 TF-Storage meeting - Pisa 13-14 October 2015
confjguration essential
- 1 262.1
root default
- 15
87.36 datacenter fibonacci
- 16
87.36 rack rack-c03-fib
- 14
87.36 datacenter serra
- 17
87.36 rack rack-02-ser
- 35
87.36 datacenter ingegneria
- 31
rack rack-01-ing
- 32
rack rack-02-ing
- 33
rack rack-03-ing
- 34
rack rack-04-ing
- 18
87.36 rack rack-03-ser rule serra_fibo_ing_high-end_ruleset { ruleset 3 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type datacenter step chooseleaf firstn 1 type host-high- end step emit }
17 TF-Storage meeting - Pisa 13-14 October 2015
Tools
Just 3 people working on CEPH (not 100%) and you need to grow quickly → Automation is REALLY important
- Confjguration management: Puppet
– Most of the classes are already production-ready – A lot of documentation (best practices, books, community)
- Bare metal installation:The Foreman
– Complete lifecycle for hardware – DHCP, DNS, Puppet ENC
17 TF-Storage meeting - Pisa 13-14 October 2015
Tools
For monitoring/alarming:
- Nagios+CheckMK
– alarms – graphing
- Rsyslog
- Looking at collectD + Graphite
– Metrics correlation
Test environment: (Vagrant and
VirtualBox) to test what is hardware indipendent:
- new functionalities
- Puppet classes
- upgrades procedures
17 TF-Storage meeting - Pisa 13-14 October 2015
Openstack integration
- It works straightforward
- CEPH as a backend for:
– Volumes – Vms – Images
- Copy on Write: VM as a snapshot
- Shared storage → live migration
- multiple pools are supported
- Current issues: (OS=Juno Ceph=Giant)
– Massive volume deletion – Evacuate
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances – ceph bench writes
==================================== Total time run: 60.308706 Total writes made: 5942 Write size: 4194304 Bandwidth (MB/sec): 394.106 Stddev Bandwidth: 103.204 Max bandwidth (MB/sec): 524 Min bandwidth (MB/sec): 0 Average Latency: 0.162265 Stddev Latency: 0.211504 Max latency: 2.71961 Min latency: 0.041313 ==================================== ===================================== Total time run: 10.353915 Total writes made: 1330 Write size: 4194304 Bandwidth (MB/sec): 513.815 Stddev Bandwidth: 161.337 Max bandwidth (MB/sec): 564 Min bandwidth (MB/sec): 0 Average Latency: 0.123224 Stddev Latency: 0.0928879 Max latency: 0.955342 Min latency: 0.045272 ===================================== =================================== Total time run: 120.537838 Total writes made: 12593 Write size: 4194304 Bandwidth (MB/sec): 417.894 Stddev Bandwidth: 84.4311 Max bandwidth (MB/sec): 560 Min bandwidth (MB/sec): 0 Average Latency: 0.153105 Stddev Latency: 0.175394 Max latency: 2.05649 Min latency: 0.038814 ====================================
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances – ceph bench reads
rados bench -p BenchPool 10 rand =================================== Total time run: 10.065519 Total reads made: 1561 Read size: 4194304 Bandwidth (MB/sec): 620.336 Average Latency: 0.102881 Max latency: 0.294117 Min latency: 0.04644 =================================== rados bench -p BenchPool 10 seq ================================== Total time run: 10.057527 Total reads made: 1561 Read size: 4194304 Bandwidth (MB/sec): 620.829 Average Latency: 0.102826 Max latency: 0.328899 Min latency: 0.041481 ==================================
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances: adding VMs
What to measure:
- See how Latency is infmuenced by IOPS, measuring it while we add VMs (fjxed load generator).
- See how Total bandwidth decrease adding VMs
Setup:
- 40VM on Openstack with 2 10G volumes (pre-allocated with dd):
–
One with bandwidht CAP (100MB)
–
One with IOPS CAP (200 total)
- We use fjo as benchmark tool and dsh to launch it from a master node.
Refence:
- Measure Ceph RBD performance in a quantitative way: https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-
in-a-quantitative-way-part-i
17 TF-Storage meeting - Pisa 13-14 October 2015
Fio
fio --size=1G \
- -runtime 60 \
- -ioengine=libaio \
- -direct=1 \
- -rw=randread [randwrite]\
- -name=fiojob \
- -blocksize=4K \
- -iodepth=2 \
- -rate_iops=200 \
- -output=randread.out
fio --size=4G \
- -runtime=60 \
- -ioengine=libaio \
- -direct=1 \
- -rw=read [write]\
- -name=fiojob \
- -blocksize=128K [256K] \
- -iodepth=64 \
- -output=seqread.out
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances -write
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances - write
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances - read
17 TF-Storage meeting - Pisa 13-14 October 2015
Performances - read
17 TF-Storage meeting - Pisa 13-14 October 2015
Dealing with:
Software:
- Slow requests/Operation blocked
- Scrubs errors: fjx it with pg
repair,check the logs Automation:
- When something is broken,
puppet can make it worse
Hardware:
- Most of the problems came for
hardware (disks, controllers, nodes): but maybe we are too small…
- More RAM = less PAIN
(specially during recovery/rebalancing)
17 TF-Storage meeting - Pisa 13-14 October 2015
...so what?
- Ceph is addressing our needs:
– It performs (well?) – It's robust
- In about 9 months - production
and non-production - nothing really bad happen.
Now we are going to:
- Work more on monitoring and
performance graphing
- More benchmarks to understand what
to improve
- Add SSD cache
- Activate RadosGW (in production) and
the slow pool
17 TF-Storage meeting - Pisa 13-14 October 2015
Questions ?
For you:
- VMWARE support?
- Xex/XenServer?
- SMB/NFS/iSCSI?
17 TF-Storage meeting - Pisa 13-14 October 2015