[PPT] - Big Data Management and NoSQL Databases Lecture 12 PD Dr. Andreas PowerPoint Presentation

SLIDE 1

Big Data Management and NoSQL Databases

Lecture 12

PD Dr. Andreas Behrend

behrend@cs.uni-bonn.de

Acknowledgements I am indebted to Prof. Dr.-Ing. Sebastian Michel,

Prof. Johan Gamper, and Dr. Holubova for providing me slides.

SLIDE 2

What is Big Data?

 buzzword?  bubble?  gold rush?  revolution? “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” Dan Ariely

SLIDE 3

What is Big Data?

 No standard definition  First occurrence of the term: High

Performance Computing (HPC)

Gartner: “Big Data” is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

3 (4, 5) Vs

Volume V ariety Velocity

Big Data

SLIDE 4

What is Big Data?

IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources such as transactions, social media, enterprise content, sensors, and mobile devices. Companies can leverage data to adapt their products and services to better meet customer needs, optimize operations and infrastructure, and find new sources of revenue.

http://www.ibmbigdatahub.com/

Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)

SLIDE 5

Big Data Characteristics: Volume (Scale)

http://www.ibmbigdatahub.com/

Data volume is increasing exponentially , not linearly 1021 109 1018 1012

SLIDE 6

Big Data Characteristics: Variety (Complexity)

http://www.ibmbigdatahub.com/

Various formats, types, and structures (from semi-structured XML to unstructured multimedia) Static data vs. streaming data 1018 109

SLIDE 7

Big Data Characteristics: Velocity (Speed)

http://www.ibmbigdatahub.com/

Data is being generated fast and need to be processed fast Online Data Analytics

SLIDE 8

Big Data Characteristics: Veracity (Uncertainty)

http://www.ibmbigdatahub.com/

Uncertainty due to inconsistency, incompleteness, latency , ambiguities, or approximations. 1012

SLIDE 9

Some Numbers as of 2015

Estimated Size of Data
Google: 15 000 PB (=15

Exabytes)

Facebook: 300 PB
Ebay: 90 PB
Spotify: 10 PB
Data Processed per Day
Google: 100 PB
Ebay: 100 PB
NSA: 29 PB
Facebook: 600 TB
Twitter: 100 TB
Spotify: 2,2 TB

MB = 106 Bytes GB = 109 Bytes TB (Terabyte) = 1012 Bytes PB (Petabyte) = 1015 Bytes EB (Exabyte) = 1018 Bytes

SLIDE 10

How does Data Look Like?

Not necessarily like you got used to in database

lectures: usually not nicely structured (BCNF or 3NF) relations with known schema information.

But:

– Twitter Tweets – Server Access Logs – Web Pages – Web Graph – Huge CSV files in general (e.g., holding a “relation”)

SLIDE 11

{"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823764586496,"id_str":"557920823764586496","text":"#T ulsaAirport #Oklahoma Jan 21 08:53 Temperature 37\u00b0F clouds Wind NW 7 km\/h Humidity 85% .. http:\/\/t.co\ /SnC8ST3gQC","source":"\u003ca href=\"http:\/\/www.woweather.com\/USA\/TulsaIAP.htm\" rel=\"nofollow\"\u003eupd ate weather tulsa\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":nu ll,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":255167 921,"id_str":"255167921","name":"Weather Tulsa","screen_name":"wo_tulsa","location":"Tulsa","url":"http:\/\/itu nes.apple.com\/app\/weatheronline\/id299504833?mt=8","description":"Weather Tulsa\n\nhttp:\/\/www.woweather.com \/USA\/Tulsa.htm","protected":false,"verified":false,"followers_count":111,"friends_count":60,"listed_count":5, "favourites_count":0,"statuses_count":33805,"created_at":"Sun Feb 20 20:31:42 +0000 2011","utc_offset":7200,"ti me_zone":"Athens","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_b ackground_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1 \/bg.pn g","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_ back ground_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_ color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/ \ /pbs.twimg.com\/profile_images\/1249942071\/WO-20px- linien_normal.png","profile_image_url_https":"https:\/\/pbs .twimg.com\/profile_images\/1249942071\/WO- 20px-linien_normal.png","default_profile":true,"default_profile_imag e":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place ":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"TulsaAirport", "indices":[0,13]},{"text":"Oklahoma","indices":[14,23]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/SnC8ST3gQC","expa nded_url":"http:\/\/bit.ly\/188eNcw","display_url":"bit.ly\/188eNcw","indices":[93,115]}],"user_mentions":[],"sym bols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"e n","timestamp_ms":"1421853664710"} {"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823877464064,"id_str":"557920823877464064","text":"An ime episode updated: Kyoukai no

How to store or analyse such Data?

SLIDE 12

Processing Big Data

 OLTP: Online Transaction Processing (DBMSs)

 Database applications  Storing, querying, multiuser access

 OLAP: Online Analytical Processing (Data Warehousing)

 Answer multi-dimensional analytical queries  Financial/marketing reporting, budgeting, forecasting, …

 RTAP: Real-Time Analytic Processing (Big Data

Architecture & Technology)

 Data gathered & processed in a real-time

 Streaming fashion

 Real-time data queried and presented in an online fashion  Real-time and history data combined and mined interactively

SLIDE 13

Key Big Data-Related Technologies

 Distributed file

systems

 NoSQL databases  Grid computing,

cloud computing

 MapReduce and

ther new

paradigms

 Large scale

machine learning

http://e-theses.imtlucca.it/34/

SLIDE 14

Relational Database Management Systems (RDMBSs)

 Predominant technology for storing structured

data

 Established query languages, e.g. SQL, RA  Often thought of as the only alternative for data

storage

 Persistence, concurrency control, consistency

control, …

 Alternatives: Object databases or XML stores

 Never gained the same adoption and market

shareT

SLIDE 15

Why Distributed File Systems?

Assume you got 10 TB data on disk
Now, do some analysis of it
With a 100MB/s disk, reading alone takes

– 100000 seconds – 1666 minutes – 27 hours

SLIDE 16

Need to do something about it

http://flickr.com/photos/jurvetson/157722937/ http://www.google.com/about/datacenter

SLIDE 17

Scale-up vs Scale-out

Scale-Up (vertical scaling):

More RAM More CPU More HDD

Scale-Out (horizontal scaling):

Same Hardware Connected by network

SLIDE 18

Data Centers

source: http://www.google.com/about/datacenters/inside/index.html

SLIDE 19

Hardware Failures

Lots of machines (commodity hardware) 

failure is not an exception but very common

P[machine fails today] = 1/365
n machines: P[failure of at least 1 machine] =

1-(1-P[machine fails today])^n

– for n=1: 0.0027 – for n=10: 0.02706 – for n=100: 0.239 – for n=1000: 0.9356 – for n=10 000: ~ 1.0

source: google.com

SLIDE 20

Fallacies of Distributed Computing

1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn't change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous

source: Peter Deutsch and others at Sun

SLIDE 21

Failure Handling & Recovery

Hardware failures happen virtually at any time
Algorithms/Infrastructures have to compensate
Issues in distributed computing:
Replication of data
Logging of state
Redundancy in task execution

SLIDE 22

„NoSQL“

 1998 first used for a relational database that

mitted the use of SQL

 Carlo Strozzi

 2009 used for conferences of advocates of non-

relational databases

 Eric Evans

 Blogger, developer at Rackspace

NoSQL movement = “the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for”

SLIDE 23

„NoSQL“

 Not „no to SQL“

 Another option, not the only one

 Not „not only SQL“

 Oracle DB or PostgreSQL would fit the definition

 „Next Generation Databases mostly addressing some of

the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent (BASE, not ACID), a huge data amount, and more“

http://nosql-database.org/

SLIDE 24

The End of Relational Databases?

 Relational databases will not disappear  Compelling arguments for most projects

 Familiarity, stability, feature set, and available support

 We should see relational databases as one

ption for data storage

 Polyglot persistence – using different data stores in

different circumstances

 Search for optimal storage for a particular application

SLIDE 25

Motivation for NoSQL Databases

 Huge amounts of data are now handled in real-

time

 Both data and use cases are getting more and

more dynamic

 Social networks (relying on graph data) have

gained impressive momentum

 Special type of NoSQL databases: graph databases

 Full-texts have always been treated shabbily by

RDBMS

SLIDE 26

Example: FaceBook

http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/

Statistics from 2010

 500 million users  570 billion page views per month  3 billion photos uploaded per month  1.2 million photos saved per second  25 billion pieces of content (updates, comments) shared every

month

 50 million server-side operations per second

2008: 10,000 servers 2009: 30,000 servers … → One RDBMS may not be enough to keep this going on!

And even newer numbers: https://research.facebook.com/blog/facebook-s-top-open-data-problems/

SLIDE 27

Example: FaceBook

Architecture from 2010 Cassandra

 NoSQL distributed storage system with

no single point of failure

 For searching your inbox messages

Hadoop/Hive

 An open source MapReduce

implementation

 Enables to perform calculations on

massive amounts of data

 Hive enables to use SQL queries

against Hadoop

SLIDE 28

Example: FaceBook

Architecture from 2010 and later Memcached

 Distributed memory caching system  Caching layer between the web servers

and MySQL servers

 Since database access is relatively slow

HBase

 Hadoop database, used for e-mails,

instant messaging and SMS

 Has recently replaced MySQL,

Cassandra and few others

 Built on Google’s BigTable model

SLIDE 29

NoSQL Databases

Five Advantages

1. Elastic scaling

 “Classical” database administrators scale up – buy

bigger servers as database load increases

 Scaling out – distributing the database across multiple

hosts as load increases

2. Big Data

 Volumes of data that are being stored have increased

massively

 Opens new dimensions that cannot be handled with

RDBMS

http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772

SLIDE 30

NoSQL Databases

Five Advantages

3. Goodbye DBAs (see you later?)

 Automatic repair, distribution, tuning, … vs. expensive,

highly trained DBAs of RDBMS

4. Economics

 Based on cheap commodity servers → less costs per

transaction/second

5. Flexible Data Models

 Non-existing/relaxed data schema → structural changes

cause no overhead

SLIDE 31

NoSQL Databases

Five Challenges

1. Maturity

 Still in pre-production phase  Key features yet to be implemented

2. Support

 Mostly open source, result from start-ups

 Enables fast development

 Limited resources or credibility

3. Administration

 Require lot of skills to install and effort to maintain

SLIDE 32

NoSQL Databases

Five Challenges

4. Analytics and Business Intelligence

 Focused on web apps scenarios

 Modern Web 2.0 applications  Insert-read-update-delete

 Limited ad-hoc querying

 Even a simple query requires significant programming expertise

5. Expertise

 Few number of NoSQL experts available in the market

SLIDE 33

Data Assumptions

RDBMS NoSQL integrity is mission-critical OK as long as most data is correct data format consistent, well-defined data format unknown or inconsistent data is of long-term value data is expected to be replaced data updates are frequent write-once, read multiple (no updates, or at least not often) predictable, linear growth unpredictable growth (exponential) non-programmers writing queries

nly programmers writing queries

regular backup replication access through master server sharding across multiple nodes

SLIDE 34

NoSQL Data Model

Aggregates

 Data model = the model by which the database

rganizes data

 Each NoSQL solution has a different model

 Key-value, document, column-family, graph  First three orient on aggregates

 Aggregate

 A data unit with a complex structure

 Not just a set of tuples like in RDBMS

 Domain-Driven Design: “an aggregate is a collection

f related objects that we wish to treat as a unit”

 A unit for data manipulation and management of consistency

SLIDE 35

SLIDE 36

SLIDE 37

SLIDE 38

NoSQL Data Model

Aggregates – aggregate-ignorant

 There is no universal strategy how to draw

aggregate boundaries

 Depends on how we manipulate the data

 RDBMS and graph databases are aggregate-

ignorant

 It is not a bad thing, it is a feature  Allows to easily look at the data in different ways  Better choice when we do not have a primary

structure for manipulating data

NoSQL

SLIDE 39

NoSQL Data Model

Aggregates – aggregate-oriented

 Aggregate orientation

 Aggregates give the database information about

which bits of data will be manipulated together

 Which should live on the same node

 Helps greatly with running on a cluster

 We need to minimize the number of nodes we need to query

when we are gathering data

 Consequence for transactions

 NoSQL databases support atomic manipulation of a

single aggregate at a time

SLIDE 40

NoSQL Databases

Materialized Views

 Disadvantage: the aggregated structure is given, other

types of aggregations cannot be done easily

 RDBMSs lack of aggregate structure → support for accessing

data in different ways (using views)

 Solution: materialized views

 Pre-computed and cached queries

 Strategies:

 Update materialized view when we update the base data

 For more frequent reads of the view than writes

 Run batch jobs to update the materialized views at regular

intervals

SLIDE 41

NoSQL Databases

Schemalessness

 When we want to store data in a RDBMS, we need to

define a schema

 Advocates of schemalessness rejoice in freedom and

flexibility

 Allows to easily change your data storage as we learn more

about the project

 Easier to deal with non-uniform data

 Fact: there is usually an implicit schema present

 The program working with the data must know its structure

SLIDE 42

Types of NoSQL Databases

Core:

 Key-value stores (databases)  Document databases  Column-family (column-oriented/columnar) stores

(in constrast to relational columnar DBs like Monet DB)

 Graph databases

Non-core:

 Object databases  XML databases  …

http://nosql-database.org/

SLIDE 43

Key-value store

Basic characteristics

 The simplest NoSQL data stores  A simple hash table (map), primarily used when all

access to the database is via primary key

 A table in RDBMS with two columns, such as ID and

NAME

 ID column being the key  NAME column storing the value

 A BLOB that the data store just stores

 Basic operations:

 Get the value for the key  Put a value for a key  Delete a key from the data store

 Simple → great performance, easily scaled  Simple → not for complex queries, aggregation needs

SLIDE 44

Data model: (key) -> value
Interface: CRUD (Create, Read, Update, Delete)

Key-Value Stores

users:2:friends {23, 76, 233, 11} users:2:inbox [234, 3466, 86,55] Theme → "dark", cookies → "false" users:2:settings

Value: An opaque blob

Key

SLIDE 45

Key-value store

Representatives

MemcachedDB not

pen-source

Project Voldemort

pen-source

version

SLIDE 46

Key-value store

Suitable Use Cases

Storing Session Information

 Every web session is assigned a unique session_id value  Everything about the session can be stored by a single PUT request

r retrieved using a single GET

 Fast, everything is stored in a single object

User Profiles, Preferences

 Every user has a unique user_id, user_name + preferences such as

language, colour, time zone, which products the user has access to, …

 As in the previous case:  Fast, single object, single GET/PUT

Shopping Cart Data

 Similar to the previous cases

SLIDE 47

Key-value store

When Not to Use

Relationships among Data

 Relationships between different sets of data  Some key-value stores provide link-walking features  Not usual

Multioperation Transactions

 Saving multiple keys  Failure to save any one of them → revert or roll back the rest of the

perations

Query by Data

 Search the keys based on something found in the value part

Operations by Sets

 Operations are limited to one key at a time  No way to operate upon multiple keys at the same time

SLIDE 48

Column-Family Stores

Basic Characteristics

 Also “columnar” or “column-oriented”  Column families = rows that have many columns

associated with a row key

 Column families are groups of related data that is often

accessed together

 e.g., for a customer we access all profile information at the same

time, but not orders

SLIDE 49

 Examples: Cassandra (AP), Google BigTable (CP),

Wide-Column Stores

com.cnn.www crawled: … content : "<html>…" content : "<html>…" content : "<html>…" title : "CNN"

Row Key Column

 Data model: (rowkey, column, timestamp) -> value  Interface: CRUD, Scan

Versions (timestamped)

HBase (CP)

SLIDE 50

Column-Family Stores

Representatives Google’s BigT able

SLIDE 51

Column-Family Stores

Suitable Use Cases

Event Logging



Ability to store any data structures → good choice to store event information Content Management Systems, Blogging Platforms



We can store blog entries with tags, categories, links, and trackbacks in different columns



Comments can be either stored in the same row or moved to a different keyspace



Blog users and the actual blogs can be put into different column families

SLIDE 52

Column-Family Stores

When Not to Use

Systems that Require ACID Transactions

 Column-family stores are not just a special kind of RDBMSs with

variable set of columns! Aggregation of the Data Using Queries

 (such as SUM or AVG)  Have to be done on the client side

For Early Prototypes

 We are not sure how the query patterns may change  As the query patterns change, we have to change the column family

design

SLIDE 53

Document Databases

Basic Characteristics

 Documents are the main concept

 Stored and retrieved  XML, JSON, …

 Documents are

 Self-describing  Hierarchical tree data structures  Can consist of maps, collections (lists, sets, …), scalar values,

nested documents, …

 Documents in a collection are expected to be similar

 Their schema can differ

 Document databases store documents in the value part

f the key-value store

 Key-value stores where the value is examinable

SLIDE 54

 Data model: (collection, key) -> document  Interface: CRUD, Querys, Map-Reduce  Examples: CouchDB (AP), Amazon SimpleDB (AP),

Document Stores

rder-12338

{

rder-id: 23,

customer: { name : "Felix Gessert", age : 25 } line-items : [ {product-name : "x", …} , …] }

ID/Key JSON Document

MongoDB (CP)

SLIDE 55

Document Databases

Representatives

Lotus Notes Storage Facility

SLIDE 56

Document Databases

Suitable Use Cases

Event Logging



Many different applications want to log events

 Type of data being captured keeps changing



Events can be sharded (i.e. divided) by the name of the application or type

f event

Content Management Systems, Blogging Platforms



Managing user comments, user registrations, profiles, web-facing documents, … Web Analytics or Real-Time Analytics



Parts of the document can be updated



New metrics can be easily added without schema changes

 E.g. adding a member of a list, set,…

E-Commerce Applications



Flexible schema for products and orders



Evolving data models without expensive data migration

SLIDE 57

Document Databases

When Not to Use

Complex Transactions Spanning Different Operations

 Atomic cross-document operations

 Some document databases do support (e.g., RavenDB)

Queries against Varying Aggregate Structure

 Design of aggregate is constantly changing → we need

to save the aggregates at the lowest level of granularity

 i.e. to normalize the data

SLIDE 58

Graph Databases

Basic Characteristics

 To store entities and relationships between these entities

 Node is an instance of an object  Nodes have properties

 e.g., name

 Edges have directional significance  Edges have types

 e.g., likes, friend, …

 Nodes are organized by relationships

 Allow to find interesting patterns  e.g., “Get all nodes employed by Big Co that like NoSQL

Distilled”

SLIDE 59

Example:

SLIDE 60

Graph Databases

RDBMS vs. Graph Databases

 When we store a graph-like structure in RDBMS, it is for

a single type of relationship

 “Who is my manager”

 Adding another relationship usually means a lot of

schema changes

 In RDBMS we model the graph beforehand based on the

traversal we want

 If the traversal changes, the data will have to change  In graph databases the relationship is not calculated at query

time but persisted

SLIDE 61

Graph Databases

Representatives

FlockDB

SLIDE 62

Graph Databases

Suitable Use Cases

Connected Data

 Social networks  Any link-rich domain is well suited for graph databases

Routing, Dispatch, and Location-Based Services

 Node = location or address that has a delivery  Graph = nodes where a delivery has to be made  Relationships = distance

Recommendation Engines

 “your friends also bought this product”  “when invoicing this item, these other items are usually invoiced”

SLIDE 63

Graph Databases

When Not to Use

 When we want to update all or a subset of entities

 Changing a property on all the nodes is not a straightforward

peration

 e.g., analytics solution where all entities may need to be updated

with a changed property

 Some graph databases may be unable to handle lots of

data

 Distribution of a graph is difficult or impossible

SLIDE 64

NoSQL Data Model

Aggregates and NoSQL databases

Key-value database

 Aggregate = some big blob of mostly meaningless bits

 But we can store anything

 We can only access an aggregate by lookup based on

its key Document database

 Enables to see a structure in the aggregate

 But we are limited by the structure when storing (similarity)

 We can submit queries to the database based on the

fields in the aggregate

SLIDE 65

NoSQL Data Model

Aggregates and NoSQL databases

Column-family stores

 A two-level aggregate structure

 The first key is a row identifier, picking up the aggregate of

interest

 The second-level values are referred to as columns

 Ways to think about how the data is structured:

 Row-oriented: each row is an aggregate with column families

representing useful chunks of data (profile, order history)

 Column-oriented: each column family defines a record type (e.g.,

customer profiles) with rows for each of the records; a row is the join of records in all column families

SLIDE 66

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

History

Google File System MapReduce CouchDB MongoDB Dynamo Cassandra Riak MegaStore F1 Redis HyperDeX Spanner CouchBase Dremel Hadoop &HDFS HBase BigT able

SLIDE 67

References

 http://nosql-database.org/  Pramod J. Sadalage – Martin Fowler: NoSQL Distilled:

A Brief Guide to the Emerging World of Polyglot Persistence

 Eric Redmond – Jim R. Wilson: Seven Databases in

Seven Weeks: A Guide to Modern Databases and the NoSQL Movement

 Sherif Sakr – Eric Pardede: Graph Data Management:

Techniques and Applications

 Shashank Tiwari: Professional NoSQL