ToroDB Stampede: Open-source BI connector for MongoDB lvaro - - PowerPoint PPT Presentation

torodb stampede open source bi connector for mongodb
SMART_READER_LITE
LIVE PREVIEW

ToroDB Stampede: Open-source BI connector for MongoDB lvaro - - PowerPoint PPT Presentation

ToroDB Stampede: Open-source BI connector for MongoDB lvaro Hernndez Tortosa <aht@8kdata.com> Percona live 2017 Who I am CEO, 8Kdata.com What we do @8Kdata: ALVARO HERNANDEZ Creators of ToroDB.com, NoSQL & SQL


slide-1
SLIDE 1

ToroDB Stampede: Open-source BI connector for MongoDB

Álvaro Hernández Tortosa <aht@8kdata.com> Percona live 2017

slide-2
SLIDE 2

Who I am

ALVARO HERNANDEZ

CEO, 8Kdata.com

Founder, President Spanish Postgres User Group postgrespana.es ~ 750 members

  • What we do @8Kdata:

✓Creators of ToroDB.com, NoSQL & SQL database ✓Database R&D, product development ✓Training and consulting in PostgreSQL ✓PostgreSQL Support

Linkedin: http://es.linkedin.com/in/alvarohernandeztortosa/ Twitter: @ahachete

slide-3
SLIDE 3

MongoDB Connector for BI

slide-4
SLIDE 4

MongoDB BI Connector

  • https://docs.mongodb.com/bi-

connector/master/

  • Developed by MongoDB themselves
  • Effectively allows to query mongo in a

SQL-ish way But...

slide-5
SLIDE 5

MongoDB BI Connector

  • How do they do it? You don’t know it.
  • Harder to detect bugs
  • They’re part of bigger solutions which

may or may not fit you

  • You need Enterprise version, you are

paying for the whole solution

slide-6
SLIDE 6

MongoDB does *not* have consistent reads

https://blog.meteor.com/mongodb-queries-dont-always-return-all-matching-documents-654b6594a827#.fplxodagr

slide-7
SLIDE 7

ToroDB Stampede

slide-8
SLIDE 8

ToroDB: from MongoDB to PostgreSQL

slide-9
SLIDE 9

How do we generate the schema?

{ name: “Alice”, address: { street: “Wonderland”, number: 42 }, hobbies: [ “reading”, “talking to rabbits” ] }

★There’s no need to generate it, it’s implicit!

implicit schema

slide-10
SLIDE 10

How do we generate the schema?

  • ToroDB analyzes every incoming document and separates

metadata (schema) from data (tuples)

  • With that schema info, ToroDB creates 1+ tables per

MongoDB collection

  • ToroDB also creates a RDBMS catalog schema per MongoDB

database

slide-11
SLIDE 11

How do we generate the schema?

did name_s _id_x 1 Alice \x01d4a63dc6944f37a61cf3d6

SELECT * FROM torodb.people; SELECT * FROM torodb.people_address;

did name_s seq number_d street_s 1 1 42 Wonderland

SELECT * FROM torodb.people_hobbies;

did rid seq value_s 1 1 “reading” 1 2 1 “talking to rabbits”

database: torodb collection: people { name: “Alice”, _id: (implicit val), address: { street: “Wonderland”, number: 42 }, hobbies: [ “reading”, “talking to rabbits” ] }

slide-12
SLIDE 12

How to react to dynamic changes in the schema?

[ { name: “Alice” }, { name: “Harry”, surname: “Potter” } ]

★What happens if a new document in the collection brings new fields?

slide-13
SLIDE 13

How to react to dynamic changes in the schema?

★ToroDB automatically performs: ✓ALTER TABLE ... ADD COLUMN surname_s text

did _id_x name_s surname_s 1 \x01d4a63dc6944f37a61cf3d6 Alice 2 \x21f4f563dc6944f37a61cf436 Harry Potter

slide-14
SLIDE 14

How to react to dynamic changes in the schema?

[ { name: “Harry”, surname: “Potter” } { name: “Henry”, surname: 8 }, ]

★What happens if a new document in the collection brings type conflicts?

slide-15
SLIDE 15

How to react to dynamic changes in the schema?

★ Type conflicts are resolved by suffixing column names with the type of the

data: "s" for text, "i" for integer and so forth

did _id_x name_s surname_s surname_i 1 \x01d4a63dc6944f37a61cf3d6 Harry Potter 2 \x21f4f563dc6944f37a61cf436 Henry 8

slide-16
SLIDE 16

ToroDB Architecture

ToroDB ToroDB works as a secondary node on a MongoDB replica set

This is your SQL replica!

slide-17
SLIDE 17

ToroDB Architecture

  • ToroDB works as a secondary node on a replica set
  • This node has two main modes:

✓ First, it imports all MongoDB data into PostgreSQL ✓ Then it stays tuned to changes in MongoDB Oplog and applies them in PostgreSQL

slide-18
SLIDE 18

Backends Data2Relational Abstract Doc Protocols

MongoDB*, Couchbase, Cassandra… Here is where ‘the magic’ happens transforming Docs into Tables and Rows{ PostgreSQL*, MySQL, SQL Server, Oracle…

ToroDB offers the interface of any NoSQL database, but persisting data in tables and rows within a SQL database.

* currently available

ToroDB Architecture

slide-19
SLIDE 19

Measuring I/O with iotop required to answer a query

Github Archive: top 10 actors (1,4GB dataset) Disk Read (MB) 125 250 375 500 MongoDB PostgreSQL PostgreSQL + cstore

6,5 87,93 536,36

Exactly 100% of the storageSize!

MongoDB storageSize: 536.37 MB

db.githubarchive.aggregate([ { $group: { _id: '$actor.login', events: { $sum: 1 } } }, { $sort: { events: -1 }}, { $limit: 10 } ]) SELECT count(*), login FROM actor GROUP BY login ORDER BY 1 DESC LIMIT 10;

MongoDB size: 1410,35 MB

slide-20
SLIDE 20

Building the benchmarks

  • Data from Github archive
  • Timing three different environments that contain the same data:

✓ MongoDB ✓ MongoDB with 3 shards ✓ PostgreSQL (obtained by migrating MongoDB environment via ToroDB)

  • We run 6 queries (PostgreSQL queries being the relational

equivalent to MongoDB queries) ✓ Available at https://gist.github.com/ahachete

slide-21
SLIDE 21

A B C D E F

Seconds (Less is better)

4.000 8.000 12.000 16.000

1.021 1.418 3.691 1.191 1.184 621 4.126 4.732 14.504 4.253 4.333 1.851 402 1.650 254 239 83 880

PostgreSQL MongoDB MongoDB (3 Shards)

Github 500GB

57x

slide-22
SLIDE 22

A B C D E F

Seconds (Less is better)

750 1.500 2.250 3.000

117 193 383 163 194 89 664 768 2.405 747 750 361 53 240 9 38 13 34

PostgreSQL MongoDB MongoDB (3 Shards)

Github 100GB

That’s 267x faster!

slide-23
SLIDE 23

Github 100GB no idx

Important because many BI queries don’t use indexes A B C D E F

Seconds (Less is better)

200 400 600 800

164 211 158 193 209 199 667 793 658 668 694 715 53 241 7 37 13 31

PostgreSQL MongoDB MongoDB (3 Shards) 94x

slide-24
SLIDE 24

What's on my MongoDB database?

slide-25
SLIDE 25

Stampede infers and clearly shows you your data

slide-26
SLIDE 26

Or use SQL tools to show your schema

slide-27
SLIDE 27

Installation

Prerequisites

ToroDB Stampede

  • PostgreSQL

  • Java SE 8

MongoDB connector for BI

  • MongoDB Enterprise 3.2+

  • OpenSSL
slide-28
SLIDE 28

ToroDB Stampede

wget https://www.torodb.com/download/torodb-stampede-1.0.0- beta1.tar.bz2 tar xjf torodb-stampede-1.0.0-beta1.tar.bz2

MongoDB connector for BI

wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64- ubuntu1604-3.4.3.tgz tar xjf mongodb-linux-x86_64-ubuntu1604-3.4.3.tgz sudo install -m755 bin/mongo* /usr/local/bin/

Installation

slide-29
SLIDE 29

MongoDB connector for BI

  • You have to create an schema for

each database (or collection).

mongodrdl --host example.com

  • d dbname [-c collname]
  • o schema.drdl
  • You can edit manually…

... It’s something like this

schema:

  • db: database

tables:

  • table: name_table

collection: name_collection pipeline: [] columns:

  • Name: _id

MongoType: bson.ObjectId SqlName: _id SqlType: varchar

  • Name: column_name

MongoType: type SqlName: sqlname SqlType: sqltype

Configuration

slide-30
SLIDE 30

ToroDB Stampede

  • Requires a new user (‘torodb') and a new database (’torod') in

PostgreSQL.

  • Access configuration to PostgreSQL detailed in .toropass:

host:port:torod:torodb:<password>

  • You can also check the configuration used by ToroDB Stampede using the
  • l parameter:

torodb-stampede -l

Configuration

slide-31
SLIDE 31

ToroDB Stampede

$TOROHOME/bin/torodb-stampede

  • And it starts the automagically replication

MongoDB connector for BI

mongosqld --schema schema.drdl --mongo-uri <your.mongohost.com>

  • You need one schema per database
  • It enables a MySQL connection

Start!

slide-32
SLIDE 32

ToroDB Stampede

  • Creates new columns and

tables if it’s necessary

MongoDB connector for BI

  • You will surely have to edit

manually the mongodrdl schema: ➡ mongodrdl infers the schema by sampling

Problem resolution

Insert Documents with new fields

slide-33
SLIDE 33

ToroDB Stampede

  • The column names contain

a postfix, indicating the data type ➡ If one path contains two different data types then two different columns are created.

MongoDB connector for BI

  • You should execute mongodrdl

again, but surely you don’t have the new type, so: ➡ You have to edit the schema and add a new field with different sql name.

Problem resolution

Type Conflicts

slide-34
SLIDE 34

Problem resolution

Array / Document exploding MongoDB Connector for BI

database: stampede collection: restaurants { _id: (implicit val), name: “Glorious Food” address: { street:”Wonderland”, number: 42, coord: [ -73.9832324, 40.6563633] } }

_id name address.street address.number

01d4a63 dc6944f

“Glorious Food” “Wonderland” 42

stampede.restaurants stampede.restaurants_address_coord

_id address.coord address.coord_idx

01d4a63 dc6944f

  • 73.9832324

01d4a63 dc6944f

40.6563633 1

slide-35
SLIDE 35

Problem resolution

Array / Document exploding ToroDB Stampede

did _id_x name_s address_e

\x01d4a “Glorious Food”

f

stampede.restaurants stampede.restaurants_address

did rid seq street_s number_d coord_e “Wonderland” 42 t

database: stampede collection: restaurants { _id: (implicit val), name: “Glorious Food” address: { street:”Wonderland”, number: 42, coord: [ -73.9832324, 40.6563633] } }

stampede.restaurants_address_coord

did rid pid seq v_d

  • 73,9832324

1 1 40.6563633

slide-36
SLIDE 36

Use Case with…

M

  • n

g

  • D

B B I 1 . r

  • w

s 8 2 . 4 1 2 m s ( 8 x ! ! ) PostgreSQL 100.000 rows 1.008 ms

SELECT * FROM twitter.message INNER JOIN twitter.message_user ON twitter.message.did = twitter.message_user.did LIMIT 100000 SELECT * FROM `twitter.message` LIMIT 100000

slide-37
SLIDE 37

Use Case With Tableau

Top ten users sending messages

We ran the following query:

SELECT screen_name_s, count(1) FROM (SELECT screen_name_s FROM twitter.message_user LIMIT 1000000) AS users GROUP BY 1 ORDER BY 2 desc LIMIT 10

PostgreSQL 3.791 ms MongoDB 7.107 ms (1,8x)

slide-38
SLIDE 38

Use Case With Tableau

Top user locations

We ran the with the following query:

SELECT location_s, count(1) FROM (SELECT location_s FROM twitter.message_user LIMIT 10000000) AS locations GROUP BY 1 ORDER BY 2 desc LIMIT 20

PostgreSQL 15.452 ms MongoDB 72.536 ms (4,7x)

slide-39
SLIDE 39

Use Case With Tableau

Top Hashtags

We did the with the following query:

SELECT text_s, count(1) FROM (SELECT text_s FROM twitter.message_entities_hashtags LIMIT 10000000)) AS hashtags GROUP BY 1 ORDER BY 2 desc LIMIT 20

PostgreSQL 4.819 ms MongoDB 305.021 ms (64x !!!)

slide-40
SLIDE 40

Summary

MongoDB Connector for BI ToroDB Stampede Installation Easy and fast Easy and fast Configuration Need to create schema Easy and fast Querying Indirect (convert to NoSQL) Direct (Keeps data synchronized) Adaptability Manual update schema for new fields or types Adapts automagically to changes in schema Query capabilities Limited, requires many self- joins (flattened schema) Advance, schema looks very natural Speed x Up to 100x faster License MongoDB Enterprise Open Source

slide-41
SLIDE 41

One more thing…

+

slide-42
SLIDE 42

DEMO

slide-43
SLIDE 43

DEMO

slide-44
SLIDE 44

DEMO

slide-45
SLIDE 45

DEMO

slide-46
SLIDE 46

DEMO

Tables:

slide-47
SLIDE 47

DEMO

slide-48
SLIDE 48

DEMO

slide-49
SLIDE 49

ToroDB Stampede: Open-source BI connector for MongoDB

Let’s Talk!

www.8kdata.com info@8kdata.com