Scaling for Humongous amounts of data with MongoDB Alvin Richards - - PowerPoint PPT Presentation

scaling for humongous amounts of data with mongodb
SMART_READER_LITE
LIVE PREVIEW

Scaling for Humongous amounts of data with MongoDB Alvin Richards - - PowerPoint PPT Presentation

Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com From here... http://bit.ly/OT71M4 ...to here... http://bit.ly/Oxcsis ...without one of these.


slide-1
SLIDE 1

Scaling for Humongous amounts of data with MongoDB

Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com

slide-2
SLIDE 2

http://bit.ly/OT71M4

From here...

slide-3
SLIDE 3

http://bit.ly/Oxcsis

...to here...

slide-4
SLIDE 4

...without one of these.

http://bit.ly/cnP77L

slide-5
SLIDE 5

Warning!

  • This is a technical talk
  • But MongoDB is very simple!
slide-6
SLIDE 6

Solving real world data problems with MongoDB

  • Efgective schema design for scaling
  • Linking versus embedding
  • Bucketing
  • Time series
  • Implications of sharding keys with alternatives
  • Read scaling through replication
  • Challenges of eventual consistency
slide-7
SLIDE 7
  • !Founded!in!2007
  • Dwight!Merriman,!Eliot!Horowitz
  • "$73M+!in!funding
  • Flybridge,!Sequoia,!Union!Square,!NEA
  • "Worldwide!Expanding!Team
  • 170+!employees
  • NY,!CA,!UK!and!Australia

Set$the$ direc*on$&$ contribute$ code$to$ MongoDB Foster$ community$ &$ ecosystem Provide$ MongoDB$ cloud$ services Provide$ MongoDB$ support$ services

A quick word from MongoDB sponsors, 10gen

slide-8
SLIDE 8

Since the dawn of the RDBMS

1970 2012

Main memory Intel 1103, 1k bits 4GB of RAM costs $25.99 Mass storage IBM 3330 Model 1, 100 MB 3TB Superspeed USB for $129 Microprocessor Nearly – 4004 being developed; 4 bits and 92,000 instructions per second Westmere EX has 10 cores, 30MB L3 cache, runs at 2.4GHz

slide-9
SLIDE 9

More recent changes

A decade ago Now

Faster Buy a bigger server Buy more servers Faster storage A SAN with more spindles SSD More reliable storage More expensive SAN More copies of local storage Deployed in Your data center The cloud – private or public Large data set Millions of rows Billions to trillions of rows Development Waterfall Iterative

slide-10
SLIDE 10

http://bit.ly/Qmg8YD

slide-11
SLIDE 11

Is Scaleout Mission Impossible?

  • What about the CAP Theorem?
  • Brewer's theorem
  • Consistency, Availability, Partition Tolerance
  • It says if a distributed system is partitioned, you can’t

be able to update everywhere and have consistency

  • So, either allow inconsistency or limit where updates

can be applied

slide-12
SLIDE 12
  • Cost efgective operationalize abundant data

(clickstreams, logs, tweets, ...)

  • Relaxed transactional semantics enable easy scale
  • ut
  • Auto Sharding for scale down and scale up
  • Applications store complex data that is easier to

model as documents

  • Schemaless DB enables faster development cycles

What MongoDB solves

Agility Flexibility Cost

slide-13
SLIDE 13

depth of functionality scalability & performance

  • memcached
  • key/value
  • RDBMS

Design Goal of MongoDB

slide-14
SLIDE 14

Schema Design at Scale

slide-15
SLIDE 15

Design Schema for Twitter

  • Model each users activity stream
  • Users
  • Name, email address, display name
  • Tweets
  • Text
  • Who
  • Timestamp
slide-16
SLIDE 16

Solution A Two Collections - Normalized

// users - one doc per user { _id: "alvin", email: "alvin@10gen.com", display: "jonnyeight" } // tweets - one doc per user per tweet { user: "bob", tweet: "20111209-1231", text: "Best Tweet Ever!", ts: ISODate("2011-09-18T09:56:06.298Z") }

slide-17
SLIDE 17

Solution B Embedded - Array of Objects

// users - one doc per user with all tweets { _id: "alvin", email: "alvin@10gen.com", display: "jonnyeight", tweets: [ ! { ! ! user: "bob", ! ! tweet: "20111209-1231", ! ! text: "Best Tweet Ever!", ts: ISODate("2011-09-18T09:56:06.298Z") ! } ] }

slide-18
SLIDE 18

Embedding

  • Great for read performance
  • One seek to load entire object
  • One roundtrip to database
  • Object grows over time when adding child objects
slide-19
SLIDE 19

Linking or Embedding?

Linking can make some queries easy

// Find latest 50 tweets for "alvin" > db.tweets.find( { _id: "alvin" } ) .sort( { ts: -1 } ) .limit(10)

But what efgect does this have on the systems?

slide-20
SLIDE 20

Collection 1 Index 1

slide-21
SLIDE 21

Virtual Address Space 1 Collection 1 Index 1

This is your virtual memory size (mapped)

slide-22
SLIDE 22

Virtual Address Space 1 Physical RAM Collection 1 Index 1

This is your resident memory size

slide-23
SLIDE 23

Virtual Address Space 1 Physical RAM Disk Collection 1 Index 1

slide-24
SLIDE 24

Virtual Address Space 1 Physical RAM Disk Collection 1 Index 1

100 ns 10,000,000 ns = =

slide-25
SLIDE 25

Virtual Address Space 1 Physical RAM Disk Collection 1 Index 1

db.tweets.find( { _id: "alvin" } ) .sort( { ts: -1 } ) .limit(10)

1 2 3

Linking = Many Random Reads + Seeks

slide-26
SLIDE 26

Virtual Address Space 1 Physical RAM Disk Collection 1 Index 1

1

Embedding = Large Sequential Read

db.tweets.find( { _id: "alvin" } )

slide-27
SLIDE 27

Problems

  • Large sequential reads
  • Good: Disks are great at Sequential reads
  • Bad: May read too much data
  • Many Random reads
  • Good: Easy of query
  • Bad: Disks are poor at Random reads (SSD?)
slide-28
SLIDE 28

Solution C Buckets

// tweets : one doc per user per day > db.tweets.findOne() { _id: "alvin-2011/12/09", email: "alvin@10gen.com", tweets: [ ! { user: "Bob", ! tweet: "20111209-1231", ! text: "Best Tweet Ever!" } , ! { author: "Joe", ! date: "May 27 2011", ! text: "Stuck in traffic (again)" } ] ! }

slide-29
SLIDE 29

Solution C Last 10 Tweets

// Get the latest bucket, slice the last 10 tweets db.tweets.find( { _id: "alvin-2011/12/09" }, { tweets: { $slice : 10 } } ) .sort( { _id: -1 } ) .limit(1)

slide-30
SLIDE 30

Virtual Address Space 1 Physical RAM Disk Collection 1 Index 1

db.tweets.find( { _id: "alvin-2011/12/09" }, { tweets: { $slice : 10 } } ) .sort( { _id: -1 } ) .limit(1)

Bucket = Small Sequential Read

1

slide-31
SLIDE 31

Sharding - Goals

  • Data location transparent to your code
  • Data distribution is automatic
  • Data re-distribution is automatic
  • Aggregate system resources horizontally
  • No code changes
slide-32
SLIDE 32

Sharding - Range distribution

shard01 shard02 shard03

sh.shardCollection("test.tweets",3{_id:31}3,3false)

slide-33
SLIDE 33

Sharding - Range distribution

shard01 shard02 shard03

a-i j-r s-z

slide-34
SLIDE 34

Sharding - Splits

shard01 shard02 shard03

a-i ja-jz s-z k-r

slide-35
SLIDE 35

Sharding - Splits

shard01 shard02 shard03

a-i ja-ji s-z ji-js js-jw jz-r

slide-36
SLIDE 36

Sharding - Auto Balancing

shard01 shard02 shard03

a-i ja-ji s-z ji-js js-jw jz-r js-jw jz-r

slide-37
SLIDE 37

Sharding - Auto Balancing

shard01 shard02 shard03

a-i ja-ji s-z ji-js js-jw jz-r

slide-38
SLIDE 38

How does sharding efgect Schema Design?

  • Sharding key choice
  • Access patterns (query versus write)
slide-39
SLIDE 39

Sharding Key

{ photo_id : ???? , data : <binary> }

  • What’s the right key?
  • auto increment
  • MD5( data )
  • month() + MD5( data )
slide-40
SLIDE 40
  • Only have to keep small

portion in ram

  • Right shard "hot"
  • Time Based
  • ObjectId
  • Auto Increment

Right balanced access

slide-41
SLIDE 41
  • Have to keep entire

index in ram

  • All shards "warm"
  • Hash

Random access

slide-42
SLIDE 42
  • Have to keep some

index in ram

  • Some shards "warm"
  • Month + Hash

Segmented access

slide-43
SLIDE 43

Solution A Shard by a single identifier

{ _id : "alvin", // shard key email: "alvin@10gen.com", display: "jonnyeight" li: "alvin.j.richards", tweets: [ ... ] } Shard on { _id : 1 } Lookup by _id routed to 1 node Index on { “email” : 1 }

slide-44
SLIDE 44

Sharding - Routed Query

shard01 shard02 shard03

a-i ja-ji s-z ji-js js-jw jz-r

find(3{_id:3"alvin"}3)

slide-45
SLIDE 45

Sharding - Routed Query

shard01 shard02 shard03

a-i ja-ji s-z ji-js

find(3{_id:3"alvin"}3)

js-jw jz-r

slide-46
SLIDE 46

Sharding - Scatter Gather

shard01 shard02 shard03

a-i ja-ji s-z ji-js js-jw jz-r

find(3{3email:3"alvin@10gen.com"3}3)

slide-47
SLIDE 47

Sharding - Scatter Gather

shard01 shard02 shard03

a-i ja-ji s-z ji-js js-jw jz-r

find(3{3email:3"alvin@10gen.com"3}3)

slide-48
SLIDE 48

Multiple Identities

  • User can have multiple identities
  • twitter name
  • email address
  • etc.
  • What is the best sharding key & schema design?
slide-49
SLIDE 49

Solution B Shard by multiple identifiers

identities { type: "_id", val: "alvin", info: "1200-42"} { type: "em", val: "alvin@10gen.com", info: "1200-42"} { type: "li", val: "alvin.j.richards",info: "1200-42"} tweets { _id: "1200-42", tweets : [ ... ] }

  • Shard identities on { type : 1, val : 1 }
  • Lookup by type & val routed to 1 node
  • Can create unique index on type & val
  • Shard info on { _id: 1 }
  • Lookup info on _id routed to 1 node
slide-50
SLIDE 50

Sharding - Routed Query

shard01 shard02 shard03

type: em val: a-q type: em val: r-z type: _id val: a-z type: li val: s-z type: li val: a-c type: li val: d-r "Min"- "1100" "1100"- "1200" "1200"- "Max"

slide-51
SLIDE 51

Sharding - Routed Query

shard01 shard02 shard03

type: em val: a-q type: em val: r-z type: _id val: a-z type: li val: s-z type: li val: a-c type: li val: d-r "Min"- "1100" "1100"- "1200" "1200"- "Max"

find(3{3type:3"em",3 33333333val:3"alvin@10gen.com3}3)

slide-52
SLIDE 52

Sharding - Routed Query

shard01 shard02 shard03

type: em val: a-q type: em val: r-z type: _id val: a-z type: li val: s-z type: li val: a-c type: li val: d-r "Min"- "1100" "1100"- "1200" "1200"- "Max"

find(3{3type:3"em",3 33333333val:3"alvin@10gen.com3}3) find(3{3_id:3"1200C42"3}3)

slide-53
SLIDE 53

Sharding - Caching

shard01

a-i j-r s-z

300 GB Data 300 GB 96 GB Mem 3:1 Data/Mem

slide-54
SLIDE 54

Aggregate Horizontal Resources

shard01 shard02 shard03

a-i j-r s-z

96 GB Mem 1:1 Data/Mem 100 GB 100 GB 100 GB 300 GB Data 96 GB Mem 1:1 Data/Mem 96 GB Mem 1:1 Data/Mem

j-r s-z

slide-55
SLIDE 55

Auto Sharding - Summary

  • Fully consistent
  • Application code unaware of data location
  • Zero code changes
  • Shard by Compound Key, Tag, Hash (2.4)
  • Add capacity
  • On-line
  • When needed
  • Zero downtime
slide-56
SLIDE 56

Time Series Data

  • Records votes by
  • Day, Hour, Minute
  • Show time series of votes cast
slide-57
SLIDE 57

Solution A Time Series

// Time series buckets, hour and minute sub-docs { _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, hourly: { 0: 3, 1: 14, 2: 19 ... 23: 72 }, minute: { 0: 0, 1: 4, 2: 6 ... 1439: 0 } } // Add one to the last minute before midnight > db.votes.update( { _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.037Z") }, { $inc: { "hourly.23": 1 }, $inc: { "minute.1439": 1 }, $inc: { "daily": 1 } } )

slide-58
SLIDE 58
  • Sequence of key/value pairs
  • NOT a hash map
  • Optimized to scan quickly

BSON Storage

... 0 1 2 3 1439

What is the cost of update the minute before midnight?

slide-59
SLIDE 59
  • Can skip sub-documents

BSON Storage

1

...

How could this change the schema?

... ...

59

1 23

1380 1439 60 ... 119

slide-60
SLIDE 60

Solution B Time Series

// Time series buckets, each hour a sub-document { _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, minute: { 0: { 0: 0, 1: 7, ... 59: 2 }, ... 23: { 0: 15, ... 59: 6 } } } // Add one to the last second before midnight > db.votes.update( { _id: "20111209-1231" }, ts: ISODate("2011-12-09T00:00:00.000Z") }, { $inc: { "minute.23.59": 1 }, $inc: { daily: 1 } } )

slide-61
SLIDE 61

Replica Sets

  • Data Protection
  • Multiple copies of the data
  • Spread across Data Centers, AZs
  • High Availability
  • Automated Failover
  • Automated Recovery
slide-62
SLIDE 62

Replica Sets

Primary Secondary Secondary

Read Write Read Read

App

Asynchronous Replication

slide-63
SLIDE 63

Replica Sets

Primary Secondary Secondary

Read Write Read Read

App

slide-64
SLIDE 64

Replica Sets

Primary Primary Secondary

Read Write Read Automatic Election of new Primary

App

slide-65
SLIDE 65

Replica Sets

Recovering Primary Secondary

Read Write Read New primary serves data

App

slide-66
SLIDE 66

Replica Sets

Secondary Primary Secondary

Read Write Read Read

App

slide-67
SLIDE 67

Replica Sets - Summary

  • Data Protection
  • High Availability
  • Scaling eventual consistent reads
  • Source to feed other systems
  • Backups
  • Indexes (Solr etc.)
slide-68
SLIDE 68

Types of Durability with MongoDB

  • Fire and forget
  • Wait for error
  • Wait for fsync
  • Wait for journal sync
  • Wait for replication
slide-69
SLIDE 69

Least durability - Don't use!

Driver Primary

apply3in3memory write

slide-70
SLIDE 70

More durability

Driver Primary

apply3in3memory write w:2

Secondary

replicate getLastError

slide-71
SLIDE 71

Durability Summary

Memory Journal Secondary Other Data Center

RDBMS Default "Fire & Forget" w=1 w=1 j=true w="majority" w=n w="myTag" Less More

slide-72
SLIDE 72

Eventual Consistency Using Replicas for Reads

slaveOk()

  • driver will send read requests to Secondaries
  • driver will always send writes to Primary

Java examples

  • 3DB.slaveOk()
  • 3Collection.slaveOk()
  • 3find(q).addOption(Bytes.QUERYOPTION_SLAVEOK);
slide-73
SLIDE 73

Understanding Eventual Consistency

Primary Secondary

v1

Application #1

v1 v2 v2

Insert Update Read

slide-74
SLIDE 74

Reads v2 Reads v1 Reads v1 v1 not present

Understanding Eventual Consistency

Primary Secondary

v1

Application #1

v1 v2

Application #2 Read Read Read Read

v2

slide-75
SLIDE 75

Product & Roadmap

slide-76
SLIDE 76

The Evolution of MongoDB

2.2 Aug ‘12 2.4 winter ‘12 2.0 Sept ‘11 1.8 March ‘11

Journaling Sharding and Replica set enhancements Spherical geo search Index enhancements to improve size and performance Authentication with sharded clusters Replica Set Enhancements Concurrency improvements Aggregation Framework Multi-Data Center Deployments Improved Performance and Concurrency

slide-77
SLIDE 77