Data Checking at Dropbox David Mah Dropbox Problems we are - - PowerPoint PPT Presentation

data checking at dropbox
SMART_READER_LITE
LIVE PREVIEW

Data Checking at Dropbox David Mah Dropbox Problems we are - - PowerPoint PPT Presentation

Data Checking at Dropbox David Mah Dropbox Problems we are tackling Examples of Checkers Generic Model for a Checker Our garbage collector had a rarely hit off by one bug Our garbage collector had a rarely hit off by one bug that resulted


slide-1
SLIDE 1

Data Checking at Dropbox

David Mah Dropbox

slide-2
SLIDE 2

Problems we are tackling Examples of Checkers Generic Model for a Checker

slide-3
SLIDE 3

Our garbage collector had a rarely hit

  • ff by one bug
slide-4
SLIDE 4

Our garbage collector had a rarely hit

  • ff by one bug that resulted in

removing user data that should not have been deleted

slide-5
SLIDE 5

The erasure encoding library we use actually is not thread-safe

slide-6
SLIDE 6

The erasure encoding library we use actually is not thread-safe, and in 0.0001% of re-encodes, we would corrupt our user data blocks

slide-7
SLIDE 7

As data passed through a particular machine

slide-8
SLIDE 8

As data passed through a particular machine, it would flip some bits of user data

slide-9
SLIDE 9

Some classes of problems Conditions of Scale Race Conditions Hardware Unreliability

slide-10
SLIDE 10

Problems we are tackling Examples of Checkers Generic Model for a Checker

slide-11
SLIDE 11

Block Scrubber

[Checksum 1][Block 1] [Checksum 2][Block 2] ..

slide-12
SLIDE 12

Block Scrubber

[Checksum 1][Block 1] [Checksum 2][Block 2] ..

Loop over every block, recompute the checksum, compare

slide-13
SLIDE 13

Hash Database Scanner

key → [server, server, server …] key → [server, server, server …] ...

slide-14
SLIDE 14

Hash Database Scanner

key → [server, server, server …] key → [server, server, server …] ...

Loop over every key, RPC to those servers, “Do you have this block?”

slide-15
SLIDE 15

Filesystem Verifier

File Tree ID → [mutation 1, mutation 2, mutation 3.. ] File Tree ID → [mutation 1, mutation 2, mutation 3.. ] (a log of mutations)

slide-16
SLIDE 16

Filesystem Verifier

File Tree ID → [mutation 1, mutation 2, mutation 3.. ] File Tree ID → [mutation 1, mutation 2, mutation 3.. ] (a log of mutations)

Read in rows for a file tree, running 15-20 checks against each

slide-17
SLIDE 17

What is the Pattern? Loop over every ‘unit’ Run a sanity check for each Not particularly complex Quantity of checks is high...

slide-18
SLIDE 18

Problems we are tackling Examples of Checkers Generic Model for a Checker

slide-19
SLIDE 19

Data Model

Unit

slide-20
SLIDE 20

Data Model

Unit → []Check

slide-21
SLIDE 21

Data Model

Unit → []Check → []Violation

slide-22
SLIDE 22

Data Model

Unit → []Check → []Violation Partition → []Unit

slide-23
SLIDE 23

Data Model

Unit → []Check → []Violation Partition → []Unit Run → []Partition

slide-24
SLIDE 24

Check Scheduling

Split the dataset into partitions

slide-25
SLIDE 25

Check Scheduling

Split the dataset into partitions For each partition, maintain a cursor

slide-26
SLIDE 26

Check Scheduling

Split the dataset into partitions For each partition, maintain a cursor Hand out cursors to check runners (Use a distributed worker system)

slide-27
SLIDE 27

Check Scheduling

RunId: 0 Partition: “1”, Cursor: “a” Partition: “2”, Cursor: “b”

slide-28
SLIDE 28

Check Scheduling

RunId: 0 Partition: “1”, Cursor: “a” Partition: “2”, Cursor: “b” CheckChunk(Partition, CursorStart) Returns []Violation, CursorEnd

slide-29
SLIDE 29

Reporting

Shove all Violations into a database. Dashboard graphs: Previous run: num violations per check Current run: num violations per check Current run: cursor progress

slide-30
SLIDE 30

Reporting

Shove all Violations into a database. Dashboard graphs: Previous run: num violations per check Current run: num violations per check Current run: cursor progress

Alert the team if nonzero

slide-31
SLIDE 31

Remediation

Correction scripts are extremely dangerous! Back-up your data After correction, re-run checks

slide-32
SLIDE 32

Checking the Checkers

Periodically, pick a unit and corrupt it Make sure the checker detects it

slide-33
SLIDE 33

Thanks for stopping by! David Mah mah@dropbox.com