Build and operate a CEPH Infrastructure - University of Pisa case - - PowerPoint PPT Presentation

build and operate a ceph infrastructure university of
SMART_READER_LITE
LIVE PREVIEW

Build and operate a CEPH Infrastructure - University of Pisa case - - PowerPoint PPT Presentation

Build and operate a CEPH Infrastructure - University of Pisa case study Simone Spinelli simone.spinelli@unipi.it 17 TF-Storage meeting - Pisa 13-14 October 2015 Agenda C E P H @ u n i p i : a n o v e r v i e w Performances


slide-1
SLIDE 1

17 TF-Storage meeting - Pisa 13-14 October 2015

Build and operate a CEPH Infrastructure - University of Pisa case study

Simone Spinelli simone.spinelli@unipi.it

slide-2
SLIDE 2

17 TF-Storage meeting - Pisa 13-14 October 2015

Agenda

  • CEPH@unipi: an overview
  • Infrastructure bricks:

– Network – OSD nodes – Monitor Node – Racks – MGMT tools

  • Performances
  • Our experience
  • Conclusions
slide-3
SLIDE 3

17 TF-Storage meeting - Pisa 13-14 October 2015

University of Pisa

  • Big sized Italian university:

– 70K students – 8K employees – Not a campus but spread all over the city → no big datacenter but many small sites

  • Own and manage an optical infrastructure with on top an MPLS-based MAN
  • Proud host of GARR Network PoP
  • Surrounded by other research/educational institutions (CNR/SantAnna/Scuola

Normale…)

slide-4
SLIDE 4

17 TF-Storage meeting - Pisa 13-14 October 2015

How we use CEPH

Currently in production as backend for an Openstack installation, it hosts:

  • department tenants (Web servers, etc.. )
  • tenants for research projects (DNA seq, etc… )
  • tenants for us: multimedia content from elearning platforms

Working on:

  • An email system for students hosted on Openstack → RBD
  • A sync&share platform → RadosGW
slide-5
SLIDE 5

17 TF-Storage meeting - Pisa 13-14 October 2015

Timeline

  • Spring 2014: we started to plan:

Capacity/Replica planning

Rack engineering (power/cooling)

Bare metal management

Confjguration Management

  • Dec 2014: fjrst testbed
  • Feb 2015: 12 nodes cluster goes in production
  • Jul 2015: Openstack goes in production
  • Oct 2015: Start to deploy new ceph nodes (+12)
slide-6
SLIDE 6

17 TF-Storage meeting - Pisa 13-14 October 2015

Overview

  • 3 sites (we started with 2):

– One replica per site – 2 active computing and

storage

– 1 for storage and quorum

  • 2 difgerent network infrastructures:

– services (1Gb and 10 Gb) – storage (10Gb and 40Gb)

slide-7
SLIDE 7

17 TF-Storage meeting - Pisa 13-14 October 2015

Network

  • Ceph clients and cluster

networks are realized as VLAN

  • n the same switching

infrstructure

  • Redundancy and loadbalancing

are achieved by LACP

  • Switching platforms:

– Juniper EX4550: 32p SFP – Juniper EX4200: 24p copper

slide-8
SLIDE 8

17 TF-Storage meeting - Pisa 13-14 October 2015

Storage ring

  • Sites interconnected wirh a 2x40Gb

ERP

  • For Storage nodes: 1VirtualChassis

per DC:

– Maximize the bandwidht: 128GB

backend inside the VC

– Easy to confjgure and manage

(NSSU)

– No more than 8 nodes per VC – For computing nodes difgerent VC

slide-9
SLIDE 9

17 TF-Storage meeting - Pisa 13-14 October 2015

Hardware:OSD nodes

DELL R720XD (2U):

  • 2 Xeon e5-2603@1.8Ghz: 8 core total
  • 64GB RAM DDR3
  • 2x10Gb Intel X520 Network Adapter
  • 12 2TB SATA disks (6disks/RUs)
  • 2 Samsung 850 256GB SSD disks

– Mdadm raid1 for OS – 6 partition per disk for XFS journal

  • Ubuntu 14.04
  • Linux 3.13.0-46-generic #77-Ubuntu
  • Linux bonding driver:

– No special functions – Less complex

  • Really easy to deploy with iDRAC
  • Intended to be the virtual machine pool

(faster)

slide-10
SLIDE 10

17 TF-Storage meeting - Pisa 13-14 October 2015

Hardware:OSD nodes

Supermicro SSG6047R-OSD120H:

  • 2 Xeon e5-2630v2@2.60Ghz : 24 core

total

  • 256GB RAM DDR3
  • 4x10Gb Intel X520 Network Adapter
  • 30 6TB SATA disks (7.5disks/RU)
  • 6 intel 3700 SSD disks for XFS journal

– 1 disk → 5 OSD

  • Ubuntu 14.04
  • Linux 3.13.0-46-generic #77-Ubuntu
  • 2 SSD raid 1 for OS (dedicated)
  • Linux bonding driver:

– No special functions – Less complex

  • Intended to be the object storage pool (slow)
slide-11
SLIDE 11

17 TF-Storage meeting - Pisa 13-14 October 2015

Hardware: monitor nodes

Sun Sunfjre x4150

  • Hardware not virtual (3 in production, going to be 5)
  • Ubuntu 14.04 - Linux 3.13.0-46-generic #77-Ubuntu
  • 2 Intel Xeon X5355@2.66Ghz
  • 2x1GB intel for Ceph Client network (LACP)
  • 16GB RAM
  • 5x120GB intel 3500 SSD RAID 10 + HotSpare
slide-12
SLIDE 12

17 TF-Storage meeting - Pisa 13-14 October 2015

Racks plans

NOW: computing and storage are mixed.

  • 24U OSD nodes
  • 4U Computing nodes
  • 2U monitor/cache
  • 10U network

IN PROGRESS: computing and storage will be in specifjc racks. For storage:

  • 32U OSD nodes
  • 2U monitor/cache
  • 8U network

For computing:

  • 32U for computing nodes
  • 10U network

The storage network fan-out is optimized

slide-13
SLIDE 13

17 TF-Storage meeting - Pisa 13-14 October 2015

confjguration essential

  • 1 262.1

root default

  • 15

87.36 datacenter fibonacci

  • 16

87.36 rack rack-c03-fib

  • 14

87.36 datacenter serra

  • 17

87.36 rack rack-02-ser

  • 35

87.36 datacenter ingegneria

  • 31

rack rack-01-ing

  • 32

rack rack-02-ing

  • 33

rack rack-03-ing

  • 34

rack rack-04-ing

  • 18

87.36 rack rack-03-ser rule serra_fibo_ing_high-end_ruleset { ruleset 3 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type datacenter step chooseleaf firstn 1 type host-high- end step emit }

slide-14
SLIDE 14

17 TF-Storage meeting - Pisa 13-14 October 2015

Tools

Just 3 people working on CEPH (not 100%) and you need to grow quickly → Automation is REALLY important

  • Confjguration management: Puppet

– Most of the classes are already production-ready – A lot of documentation (best practices, books, community)

  • Bare metal installation:The Foreman

– Complete lifecycle for hardware – DHCP, DNS, Puppet ENC

slide-15
SLIDE 15

17 TF-Storage meeting - Pisa 13-14 October 2015

Tools

For monitoring/alarming:

  • Nagios+CheckMK

– alarms – graphing

  • Rsyslog
  • Looking at collectD + Graphite

– Metrics correlation

Test environment: (Vagrant and

VirtualBox) to test what is hardware indipendent:

  • new functionalities
  • Puppet classes
  • upgrades procedures
slide-16
SLIDE 16

17 TF-Storage meeting - Pisa 13-14 October 2015

Openstack integration

  • It works straightforward
  • CEPH as a backend for:

– Volumes – Vms – Images

  • Copy on Write: VM as a snapshot
  • Shared storage → live migration
  • multiple pools are supported
  • Current issues: (OS=Juno Ceph=Giant)

– Massive volume deletion – Evacuate

slide-17
SLIDE 17

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances – ceph bench writes

==================================== Total time run: 60.308706 Total writes made: 5942 Write size: 4194304 Bandwidth (MB/sec): 394.106 Stddev Bandwidth: 103.204 Max bandwidth (MB/sec): 524 Min bandwidth (MB/sec): 0 Average Latency: 0.162265 Stddev Latency: 0.211504 Max latency: 2.71961 Min latency: 0.041313 ==================================== ===================================== Total time run: 10.353915 Total writes made: 1330 Write size: 4194304 Bandwidth (MB/sec): 513.815 Stddev Bandwidth: 161.337 Max bandwidth (MB/sec): 564 Min bandwidth (MB/sec): 0 Average Latency: 0.123224 Stddev Latency: 0.0928879 Max latency: 0.955342 Min latency: 0.045272 ===================================== =================================== Total time run: 120.537838 Total writes made: 12593 Write size: 4194304 Bandwidth (MB/sec): 417.894 Stddev Bandwidth: 84.4311 Max bandwidth (MB/sec): 560 Min bandwidth (MB/sec): 0 Average Latency: 0.153105 Stddev Latency: 0.175394 Max latency: 2.05649 Min latency: 0.038814 ====================================

slide-18
SLIDE 18

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances – ceph bench reads

rados bench -p BenchPool 10 rand =================================== Total time run: 10.065519 Total reads made: 1561 Read size: 4194304 Bandwidth (MB/sec): 620.336 Average Latency: 0.102881 Max latency: 0.294117 Min latency: 0.04644 =================================== rados bench -p BenchPool 10 seq ================================== Total time run: 10.057527 Total reads made: 1561 Read size: 4194304 Bandwidth (MB/sec): 620.829 Average Latency: 0.102826 Max latency: 0.328899 Min latency: 0.041481 ==================================

slide-19
SLIDE 19

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances: adding VMs

What to measure:

  • See how Latency is infmuenced by IOPS, measuring it while we add VMs (fjxed load generator).
  • See how Total bandwidth decrease adding VMs

Setup:

  • 40VM on Openstack with 2 10G volumes (pre-allocated with dd):

One with bandwidht CAP (100MB)

One with IOPS CAP (200 total)

  • We use fjo as benchmark tool and dsh to launch it from a master node.

Refence:

  • Measure Ceph RBD performance in a quantitative way: https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-

in-a-quantitative-way-part-i

slide-20
SLIDE 20

17 TF-Storage meeting - Pisa 13-14 October 2015

Fio

fio --size=1G \

  • -runtime 60 \
  • -ioengine=libaio \
  • -direct=1 \
  • -rw=randread [randwrite]\
  • -name=fiojob \
  • -blocksize=4K \
  • -iodepth=2 \
  • -rate_iops=200 \
  • -output=randread.out

fio --size=4G \

  • -runtime=60 \
  • -ioengine=libaio \
  • -direct=1 \
  • -rw=read [write]\
  • -name=fiojob \
  • -blocksize=128K [256K] \
  • -iodepth=64 \
  • -output=seqread.out
slide-21
SLIDE 21

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances -write

slide-22
SLIDE 22

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances - write

slide-23
SLIDE 23

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances - read

slide-24
SLIDE 24

17 TF-Storage meeting - Pisa 13-14 October 2015

Performances - read

slide-25
SLIDE 25

17 TF-Storage meeting - Pisa 13-14 October 2015

Dealing with:

Software:

  • Slow requests/Operation blocked
  • Scrubs errors: fjx it with pg

repair,check the logs Automation:

  • When something is broken,

puppet can make it worse

Hardware:

  • Most of the problems came for

hardware (disks, controllers, nodes): but maybe we are too small…

  • More RAM = less PAIN

(specially during recovery/rebalancing)

slide-26
SLIDE 26

17 TF-Storage meeting - Pisa 13-14 October 2015

...so what?

  • Ceph is addressing our needs:

– It performs (well?) – It's robust

  • In about 9 months - production

and non-production - nothing really bad happen.

Now we are going to:

  • Work more on monitoring and

performance graphing

  • More benchmarks to understand what

to improve

  • Add SSD cache
  • Activate RadosGW (in production) and

the slow pool

slide-27
SLIDE 27

17 TF-Storage meeting - Pisa 13-14 October 2015

Questions ?

For you:

  • VMWARE support?
  • Xex/XenServer?
  • SMB/NFS/iSCSI?
slide-28
SLIDE 28

17 TF-Storage meeting - Pisa 13-14 October 2015

Coffee time!