Forget everything you knew about Swift Rings (here's everything you - - PowerPoint PPT Presentation

forget everything you knew about swift rings
SMART_READER_LITE
LIVE PREVIEW

Forget everything you knew about Swift Rings (here's everything you - - PowerPoint PPT Presentation

Forget everything you knew about Swift Rings (here's everything you need to know about Rings) Your Ring Professors Christian Schwede Principal Engineer @ Red Hat Stand up guy Clay Gerrard Programmer @ SwiftStack Loud &


slide-1
SLIDE 1

Forget everything you knew about Swift Rings

(here's everything you need to know about Rings)

slide-2
SLIDE 2

Your Ring Professors

  • Christian Schwede

○ Principal Engineer @ Red Hat ○ Stand up guy

  • Clay Gerrard

○ Programmer @ SwiftStack ○ Loud & annoying

slide-3
SLIDE 3

Rings 201

  • Why Rings Matter
  • What are Rings
  • How Rings Work

How to use Rings Ninja SWIFT RING Tricks MOAR Awesome Stuff

slide-4
SLIDE 4

Swift 101

Looking for more general intro to Swift?

  • Swift 101: https://youtu.be/vAEU0Ld-GIU
  • Building webapps with Swift:

https://youtu.be/4bhdqtLLCiM

  • Stuff to read:

https://www.swiftstack.com/docs/introductio n/openstack_swift.html

slide-5
SLIDE 5

One Ring To Rule Them All

slide-6
SLIDE 6

Devops

Can be a wild ride Ring Masters

Operators Swift

slide-7
SLIDE 7

Ring Features

  • DEVICES & SERVERS
  • ZONES
  • Regions

○ Multi-Region ○ Cross-Region ○ Local-Region

  • Storage POLICIES
slide-8
SLIDE 8

Swift’s Rings use Simple Concepts

Consistent Hashing introduced by Karger et al. at MIT in 1997

The Same Year HTTP/1.1 is specified in RFC 2616

slide-9
SLIDE 9

Consistent what?

94104 27601 1 modulo 2

  • Just remember the

distribution function

  • No growing lookup tables!
  • Easy to distribute!
slide-10
SLIDE 10

Partitions in Swift

  • Object namespace is mapped to a number of partitions
  • Each partitions holds one or more objects

/srv/node/sdd/objects/9193/488/1c...88/1476361774.53303.data partition timestamp hashed

  • bjectname

Last 3 chars from hashed objectname

Suffix Dir hash Dir Part Dir

slide-11
SLIDE 11

replica2part2dev_id

Replica # 1 Replica # 2 Replica # 3 Part # 0 Device # 0 Device # 1 Device # 3 Part # 1 Device # 3 Device # 0 Device # 1 Part # 2 Device # 3 Device # 4 Device # 2 Part # 3 Device # 2 Device # 0 Device # 1 Part # 4 Device # 1 Device # 4 Device # 3 Part # 5 Device # 0 Device # 2 Device # 4 Part # ... ... ... ...

S w i f t ’ s A d D r e s s B

  • k
slide-12
SLIDE 12

How to lookup partition

Primary

Part # 2 Device # 3 Device # 4 Device # 2

Handoff

get_nodes(part) get_more_nodes(part)

slide-13
SLIDE 13

What makes a good ring

A good ring has good

  • Dispersion
  • Balance
  • Low overload

Reassigned 215 (83.98%) partitions. Balance is now 11.35. Dispersion is now 83.98

PC LOAD LETTER

(some, but not too much!)

slide-14
SLIDE 14

Fundamental Constraints

  • Devices (disks)
  • Servers
  • Zones (racks)
  • Regions (datacenters)

A Failure Domain FAILS TOGETHER

These are tiers

slide-15
SLIDE 15
slide-16
SLIDE 16

Dispersion Measurement that the Failure Domain of each Replica of a Part is unique as possible

slide-17
SLIDE 17

Fundamental Constraints

balance

slide-18
SLIDE 18
slide-19
SLIDE 19

"rings are not pixie dust that magic data off of hard drives"

  • - darrell

The Rebalance Process

slide-20
SLIDE 20

Rebalance Introduces a Fault!

slide-21
SLIDE 21

Fundamental Constraints

min_part_hours

Only move one replica of a partition per rebalance

slide-22
SLIDE 22
slide-23
SLIDE 23

Monitoring Replication Cycle

  • Only rebalance after a full replication cycle
  • swift-disperSion-report is your friend

Queried 8192 objects for dispersion reporting, ... There were 3190 partitions missing 0 copy. There were 5002 partitions missing 1 copy. 79.65% of object copies found (19574 of 24576)

slide-24
SLIDE 24

Patitions Assigned GB used STARTING TO FILL!

slide-25
SLIDE 25

Ring Push First Cycle Finished

Primary Partitions Handoff Partitions

slide-26
SLIDE 26

OVERLOAD

slide-27
SLIDE 27

Balance vs. Dispersion

FIGHT!

slide-28
SLIDE 28

1 . 5

The decimal fraction of one replicas worth of partitions

REPLICANTHS

slide-29
SLIDE 29

5 3 Replicas

“units”

= .6

slide-30
SLIDE 30

~1 Replica .6 + .6 + .6 + 1 = 2.8

slide-31
SLIDE 31

~1 Replica

.6 => .66~11%

}

2 Replicas

slide-32
SLIDE 32

Overload

Too Much => DRIVES FILL UP Not Enough => CORRELATED DISASTER

Just use 10% … it’ll probably be fine

(Hopefully it was cat pics?)

slide-33
SLIDE 33

Partition POWER

slide-34
SLIDE 34

Balancing the unknowns

  • How to distribute objects of unknown size well-balanced?

○ Objects vary between 0 bytes and 5 GiB in size

  • => Store more than one partition per disk
  • => Aggregation of random sizes balances out
slide-35
SLIDE 35

Disk fill level vs. partition count

Max A v g M i n

slide-36
SLIDE 36

Choosing partition power

  • Number of partition is fixed
  • More disks => less partitions per disk
  • Choose a part power with a ~ thousand partitions per disk

○ Based on today's need, not an imaginary future growth

  • It is highly unlikely that your partition power is >> 20,

and definitely not 32

https://gist.github.com/clayg/6879840

slide-37
SLIDE 37

You became an unicorn

  • Skyrocketing growth? Congrats!
  • We’re working on increasing

partition power for you to keep your cluster balanced https://review.openstack.org/#/c/337297/

  • Decreasing won’t be possible -

at least not without a serious downtime

clipartlord.com

slide-38
SLIDE 38

Wrapping Up

slide-39
SLIDE 39

Region 2 Region 1

Zone 1

What’s a good cluster?

Zone 2

8 x 4000 8 x 4000 8 x 4000 8 x 4000 6 x 5000 6 x 5000

64 TB 60 TB M a i n d a t a c e n t e r 2 n d d a t a c e n t e r One RACK + Switch Overload: 4.5%

Dispersion: 0 Balance: 4.65

Disk weight

(64+64+60) / 3 = 62.66 Partpower 14 -> 2^14 = 16384 16384 partitions * 3 replicas / 32 disks = 1536 parts per disk

slide-40
SLIDE 40

Questions?

Thanks! clay@swiftstack.com cschwede@redhat.com