MySQL and Ceph Yves Trudeau, Principal Architect, Percona Yves - - PowerPoint PPT Presentation

mysql and ceph
SMART_READER_LITE
LIVE PREVIEW

MySQL and Ceph Yves Trudeau, Principal Architect, Percona Yves - - PowerPoint PPT Presentation

MySQL and Ceph Yves Trudeau, Principal Architect, Percona Yves Trudeau, Principal Architect, Percona Santa Clara, California | April 24th 27th, 2017 Santa Clara, California | April 24th 27th, 2017 Who am I? Physicist (by training)


slide-1
SLIDE 1

MySQL and Ceph

Yves Trudeau, Principal Architect, Percona Santa Clara, California | April 24th – 27th, 2017 Yves Trudeau, Principal Architect, Percona Santa Clara, California | April 24th – 27th, 2017

slide-2
SLIDE 2

2

Who am I?

  • Physicist (by training)
  • MySQL (2007-2008)
  • Sun Microsystems (2008-2009)
  • Percona since 2009
  • Principal architect → HA and distributed systems
  • Physicist (by training)
  • MySQL (2007-2008)
  • Sun Microsystems (2008-2009)
  • Percona since 2009
  • Principal architect → HA and distributed systems
slide-3
SLIDE 3

3

Plan

  • A Ceph 101 intro
  • What Ceph brings to MySQL?
  • MySQL/Rados
  • A Ceph 101 intro
  • What Ceph brings to MySQL?
  • MySQL/Rados
slide-4
SLIDE 4

A Ceph 101 intro

slide-5
SLIDE 5

5

Ceph 101: general

  • An object store (on steroids!!)
  • Distributed to 1000+ nodes
  • Scalable to multiple PB and 100k+ iops
  • Highly-Available
  • Multiple APIs
  • Very popular Openstack Cinder backend
  • An object store (on steroids!!)
  • Distributed to 1000+ nodes
  • Scalable to multiple PB and 100k+ iops
  • Highly-Available
  • Multiple APIs
  • Very popular Openstack Cinder backend
slide-6
SLIDE 6

6

Ceph 101: versions

  • Named releases are “stable”
  • In alphabetical order
  • Dumpling → Emperor → Firefmy ….
  • One release per ~6 months
  • Current is Kraken (January 2017)
  • Named releases are “stable”
  • In alphabetical order
  • Dumpling → Emperor → Firefmy ….
  • One release per ~6 months
  • Current is Kraken (January 2017)
slide-7
SLIDE 7

7

Ceph 101: nodes/processes

  • OSD → storage
  • MON → cluster manager and map
  • MDS → fjlesystem
  • + protocol proxies
  • OSD → storage
  • MON → cluster manager and map
  • MDS → fjlesystem
  • + protocol proxies
slide-8
SLIDE 8

8

Ceph 101: OSD nodes

  • The storage node
  • As you guess… stores data and retrieves data
  • Perform replication (write copies to other OSDs)
  • Atomic operations
  • T

ypically one per disk

  • NVMe may have more than one
  • The storage node
  • As you guess… stores data and retrieves data
  • Perform replication (write copies to other OSDs)
  • Atomic operations
  • T

ypically one per disk

  • NVMe may have more than one
slide-9
SLIDE 9

9

Ceph 101: MON

  • The MONitor nodes
  • Should be many and odd numbers (3,5, etc.)
  • Paxos protocol for quorum
  • Provides the cluster maps to the clients
  • Monitor OSDs
  • Perform maintenance tasks
  • The MONitor nodes
  • Should be many and odd numbers (3,5, etc.)
  • Paxos protocol for quorum
  • Provides the cluster maps to the clients
  • Monitor OSDs
  • Perform maintenance tasks
slide-10
SLIDE 10

1

Ceph 101: MDS

  • The MetaData Server
  • Optional, needed only for CephFS
  • Only one active, but MON can promote a

standby automatically

  • The MetaData Server
  • Optional, needed only for CephFS
  • Only one active, but MON can promote a

standby automatically

slide-11
SLIDE 11

1 1

Ceph 101: how objects are stored

client 1 2 3 Client view MON Crushmap, client side calculations OSD.1 OSD.2 OSD.3 1 2 3 1’ 3’ 2’ map

slide-12
SLIDE 12

1 2

Ceph 101: pools

  • Objects are stored in pools
  • Pools can have difgerent confjgurations
  • Pools can use difgerent sets of OSDs
  • Pools support snapshots
  • Replication is at the pool level
  • Access rights at the pool level → CephX
  • Pools are sharded by Placement Groups
  • Objects are stored in pools
  • Pools can have difgerent confjgurations
  • Pools can use difgerent sets of OSDs
  • Pools support snapshots
  • Replication is at the pool level
  • Access rights at the pool level → CephX
  • Pools are sharded by Placement Groups
slide-13
SLIDE 13

1 3

Ceph 101: access

  • Linux NBD
  • iSCSI
slide-14
SLIDE 14

1 4

A Ceph 101: rados

  • An object store like S3 → radosgw for S3 api
  • Objects can be edited in place
  • “rados” command line tool
  • An object store like S3 → radosgw for S3 api
  • Objects can be edited in place
  • “rados” command line tool

:~$ rados -p Mypool put ibdata1 /var/lib/mysql/ibdata1 :~$ rados -p Mypool ls ibdata1 :~$ rados -p Mypool get ibdata1 /tmp/ibdata1 :~$ rados -p Mypool rm ibdata1 :~$ rados -p Mypool put ibdata1 /var/lib/mysql/ibdata1 :~$ rados -p Mypool ls ibdata1 :~$ rados -p Mypool get ibdata1 /tmp/ibdata1 :~$ rados -p Mypool rm ibdata1

slide-15
SLIDE 15

1 5

A Ceph 101: rbd

  • Rbd stands for Rados Block Devices (disks)
  • Mount using librbd (KVM) or kernel module
  • Snapshot, clone, resize, thin provisioning, etc.
  • Rbd stands for Rados Block Devices (disks)
  • Mount using librbd (KVM) or kernel module
  • Snapshot, clone, resize, thin provisioning, etc.

# rbd -p Mypool create mydisk -s 2G –image-format 2 \

  • -image-feature layering

# rbd -p Mypool map mydisk /dev/rbd0 # mkfs.xfs /dev/rbd0 && mount /dev/rbd0 /var/lib/mysql # rbd -p Mypool create mydisk -s 2G –image-format 2 \

  • -image-feature layering

# rbd -p Mypool map mydisk /dev/rbd0 # mkfs.xfs /dev/rbd0 && mount /dev/rbd0 /var/lib/mysql

slide-16
SLIDE 16

1 6

A Ceph 101: CephFS

  • A distributed Posix fjlesystem
  • Fuse and Kernel module
  • Can be mounted my multiple clients
  • A distributed Posix fjlesystem
  • Fuse and Kernel module
  • Can be mounted my multiple clients

# mount -t ceph 10.2.2.20:6789/ /mnt/ceph \

  • o name=admin,secret=AQAaV2Qda...==

# df -h / … 10.2.2.20:6789/ 7,2T 1.2T 6.0T 17% /mnt/ceph # mount -t ceph 10.2.2.20:6789/ /mnt/ceph \

  • o name=admin,secret=AQAaV2Qda...==

# df -h / … 10.2.2.20:6789/ 7,2T 1.2T 6.0T 17% /mnt/ceph

slide-17
SLIDE 17

1 7

Ceph 101: Deploying

  • The ceph-deploy, a shell script used by the

documentation

  • Have a look to ceph-ansible (in ceph github)

– Still challenging though, many options – Very good for large deployments

  • The ceph-deploy, a shell script used by the

documentation

  • Have a look to ceph-ansible (in ceph github)

– Still challenging though, many options – Very good for large deployments

slide-18
SLIDE 18

1 8

Ceph 101: My home Ceph cluster

slide-19
SLIDE 19

What Ceph brings to MySQL?

slide-20
SLIDE 20

2

Ceph brings to MySQL: the basics...

  • A scalable storage
  • The possibility to leverage multiple disks to

scale the iops

  • Effjcient backups
  • A non-local storage allowing to move services

easily

  • Like an iSCSI SAN right?
  • A scalable storage
  • The possibility to leverage multiple disks to

scale the iops

  • Effjcient backups
  • A non-local storage allowing to move services

easily

  • Like an iSCSI SAN right?
slide-21
SLIDE 21

2 1

Ceph brings to MySQL: Efficient storage

  • Thin provisioned slaves
  • Thin provisioned PXC nodes
  • No need for a full copy per node
  • Thin provisioned slaves
  • Thin provisioned PXC nodes
  • No need for a full copy per node
slide-22
SLIDE 22

2 2

Ceph brings to MySQL: thin prov. slaves

slide-23
SLIDE 23

2 3

Ceph brings to MySQL: thin prov. slaves(2)

Example:

  • 3 servers with 1TB of disk, MySQL+OSD+MON
  • Dataset 500GB
  • Normally, each node has 500GB so 1.5TB
  • Here, master has 1TB (replicated)
  • Slaves have… the delta since their last snapshot
  • Slaves can use non-replicated and even

localized pools (tricky) Example:

  • 3 servers with 1TB of disk, MySQL+OSD+MON
  • Dataset 500GB
  • Normally, each node has 500GB so 1.5TB
  • Here, master has 1TB (replicated)
  • Slaves have… the delta since their last snapshot
  • Slaves can use non-replicated and even

localized pools (tricky)

slide-24
SLIDE 24

2 4

Ceph brings to MySQL: thin prov. Slaves(3)

  • But the delta will grow over time!!
  • Just re-provision the slaves
  • Demo!!!
  • But the delta will grow over time!!
  • Just re-provision the slaves
  • Demo!!!
slide-25
SLIDE 25

2 5

Ceph brings to MySQL: thin prov. PXC

  • Very similar to the thin provisioned slaves
  • Need to be careful with non-replicated pools
  • wsrep-sst-ceph script
  • Very similar to the thin provisioned slaves
  • Need to be careful with non-replicated pools
  • wsrep-sst-ceph script
slide-26
SLIDE 26

Can we go further? MySQL/Rados

slide-27
SLIDE 27

2 7

MySQL/Rados: the idea

  • InnoDB pages are basically objects
  • Rados is an object store
  • What about modifying InnoDB to store pages in

Rados?

  • With the galera or group replication, servers

could share the same pages

  • An opensource Aurora?
  • InnoDB pages are basically objects
  • Rados is an object store
  • What about modifying InnoDB to store pages in

Rados?

  • With the galera or group replication, servers

could share the same pages

  • An opensource Aurora?
slide-28
SLIDE 28

2 8

MySQL/Rados: RadosFS

  • A Cern project
  • Simple fjlesystem over Rados
  • Can write in chunks → atomic ops
  • Allow us not to care too much about fjlesystem

stufg

  • A Cern project
  • Simple fjlesystem over Rados
  • Can write in chunks → atomic ops
  • Allow us not to care too much about fjlesystem

stufg

slide-29
SLIDE 29

2 9

MySQL/Rados: Modifying InnoDB

  • Mostly in os0fjle.cc
  • Initialization in srv0start.cc
  • Some changes for temporary InnoDB fjles
  • Mostly in os0fjle.cc
  • Initialization in srv0start.cc
  • Some changes for temporary InnoDB fjles
slide-30
SLIDE 30

3

MySQL/Rados: Status

  • POC coded
  • Compiles OK
  • Strange SEGV deep in librados
  • Need help!
  • https://github.com/y-trudeau/percona-server-rados
  • tools: https://github.com/y-trudeau/ceph-related-tools
  • POC coded
  • Compiles OK
  • Strange SEGV deep in librados
  • Need help!
  • https://github.com/y-trudeau/percona-server-rados
  • tools: https://github.com/y-trudeau/ceph-related-tools
slide-31
SLIDE 31

Questions?

slide-32
SLIDE 32

3 2

Rate My Session