Systems Infrastructure for Data Science Web Science Group Uni - - PowerPoint PPT Presentation

systems infrastructure for data science
SMART_READER_LITE
LIVE PREVIEW

Systems Infrastructure for Data Science Web Science Group Uni - - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture VII: Introduction to Distributed Databases Why do we distribute? Applications are inherently distributed. A distributed system is more reliable.


slide-1
SLIDE 1

Systems Infrastructure for Data Science

Web Science Group Uni Freiburg WS 2012/13

slide-2
SLIDE 2

Lecture VII: Introduction to Distributed Databases

slide-3
SLIDE 3

Why do we distribute?

  • Applications are inherently distributed.
  • A distributed system is more reliable.
  • A distributed system performs better.
  • A distributed system scales better.

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 3

slide-4
SLIDE 4

Distributed Database Systems

  • Union of two technologies:

– Database Systems + Computer Networks

  • Database systems provide

– data independence (physical & logical) – centralized and controlled data access – integration

  • Computer networks provide distribution.
  • integration ≠ centralization
  • integration + distribution

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 4

slide-5
SLIDE 5

DBMS Provides Data Independence

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 5

File Systems Database Management Systems

slide-6
SLIDE 6

Distributed Database Systems

  • Union of two technologies:

– Database Systems + Computer Networks

  • Database systems provide

– data independence (physical & logical) – centralized and controlled data access – integration

  • Computer networks provide distribution.
  • integration ≠ centralization
  • integration + distribution

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 6

slide-7
SLIDE 7

Distributed Systems

  • Tanenbaum et al:

“a collection of independent computers that appears

to its users as a single coherent system”

  • Coulouris et al:

“a system in which hardware and software

components located at networked computers communicate and coordinate their actions only by passing messages”

Uni Freiburg, WS2012/13 7 Systems Infrastructure for Data Science

slide-8
SLIDE 8

Distributed Systems

  • Ozsu et al:

“a number of autonomous processing elements (not

necessarily homogeneous) that are interconnected by a computer network and that cooperate in performing their assigned tasks”

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 8

slide-9
SLIDE 9

What is being distributed?

  • Processing logic
  • Function
  • Data
  • Control
  • For distributed DBMSs, all are required.

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 9

slide-10
SLIDE 10

Centralized DBMS on a Network

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 10

What is being distributed here?

slide-11
SLIDE 11

Distributed DBMS

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 11

And here?

slide-12
SLIDE 12

Distributed DBMS Promises

  • 1. Transparent management of distributed and

replicated data

  • 2. Reliability/availability through distributed

transactions

  • 3. Improved performance
  • 4. Easier and more economical system expansion

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 12

slide-13
SLIDE 13
  • Hiding implementation details from users
  • Providing data independence in the distributed environment
  • Different transparency types, related:
  • Full transparency is neither always possible nor desirable!

Promise #1: Transparency

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 13

slide-14
SLIDE 14

Transparency Example

  • Employee (eno, ename, title)
  • Project (pno, pname, budget)
  • Salary (title, amount)
  • Assignment (eno, pno, responsibility, duration)

SELECT ename, amount FROM Employee, Assignment, Salary WHERE Assigment.duration > 12 AND Employee.eno = Assignment.eno AND Salary.title = Employee.title

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 14

slide-15
SLIDE 15

Transparency Example

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 15

What types of transparencies are provided here?

slide-16
SLIDE 16

Promise #2: Reliability & Availability

  • Distribution of replicated components
  • When sites or links between sites fail

– No single point of failure

  • Distributed transaction protocols keep

database consistent via

– Concurrency transparency – Failure atomicity

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 16

slide-17
SLIDE 17

Promise #3: Improved Performance

  • Place data fragments closer to their users

– less contention for CPU and I/O at a given site – reduced remote access delay

  • Exploit parallelism in execution

– inter-query parallelism – intra-query parallelism

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 17

slide-18
SLIDE 18

Promise #4: Easy Expansion

  • It is easier to scale a distributed collection of

smaller systems than one big centralized system.

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 18

slide-19
SLIDE 19

19 ETH Zurich, Fall 2010 Networked Information Systems

How do we distribute?

  • Basic distributed architectures:

– Shared-Memory – Shared-Disk – Shared-Nothing

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-20
SLIDE 20

20 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Memory

  • Fast interconnect
  • Single OS
  • Advantages:

– Simplicity – Easy load balancing

  • Problems:

– High cost (the interconnect) – Limited extensibility (~ 10) – Low availability

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-21
SLIDE 21

21 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Disk

  • Separate OS per P-M
  • Advantages:

– No distributed database design - easy migration/evolution – Load balancing – Availability

  • Problems:

– Limited extensibility (~ 20) - disk/interconnect bottleneck

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-22
SLIDE 22

22 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Cache

  • Oracle RAC
  • Interconnect is used to

communicate between nodes and disk: if data are missing in the local buffer, they are first queried in buffers on other nodes and then on the disk

  • The same pros/cons, just

faster

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-23
SLIDE 23

23 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Nothing

  • Separate OS per P-M-D
  • E.g. DB2 Parallel Edition,

Teradata

  • Advantages:

– Extensibility and scalability – Lower cost – High availability

  • Problems:

– Distributed database design for particular queries/workload

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-24
SLIDE 24

24 ETH Zurich, Spring 2009 Networked Information Systems

Retrospective summary

  • Shared-cache (disk) won in enterprise

because:

– enterprises usually do not requires extreme scalability – it was easy to migrate from non-distributed database

  • Shared-Nothing is now popular because of the

Web applications require extreme scalability

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-25
SLIDE 25

25 ETH Zurich, Spring 2009 Networked Information Systems

Basic Shared-Nothing Techniques

  • Data Partitioning
  • Data Replication
  • Query Decomposition and Function Shipping

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-26
SLIDE 26

26 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Nothing Techniques: Partitioning

  • Each relation is divided into n partitions that are

mapped onto different disks.

  • Provides storing large amounts of data and

improved performance

  • By key - values of a column(s):

– Range

  • e.g. using B-tree index
  • Supports range queries but index required

– Hashing

  • Hash function
  • Only exact-match queries but no index
  • Provides storing large amounts of data and

improved performance

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-27
SLIDE 27

27 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Nothing Techniques: Replication

  • Storing copies of data on different nodes
  • Provides high availability and reliability
  • Requires distributed transactions to keep

replicas consistent:

– Two phase commit - data always consistent but the system is fragile – Eventually consistency - eventually becomes consistent but always writable

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-28
SLIDE 28

28 ETH Zurich, Spring 2009 Networked Information Systems

Shared-Nothing Techniques: Query Decomposition and Shipping

  • Query operations are performed where the data

resides.

– Query is decomposed into subtasks according to the data placement (partitioning and replication). – Subtasks are executed at the corresponding nodes.

  • Data placement is always good only for some queries

=>

– hard to design database – need to redesign when queries change

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-29
SLIDE 29

29 ETH Zurich, Spring 2009 Networked Information Systems

Classes of shared-nothing databases

  • Two broad classes of shared-nothing systems

we will talk about:

– SQL DBMS - DB2 Parallel Edition (Enterprise apps) – Key-value store - Cassandra (Web apps)

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science

slide-30
SLIDE 30

Distributed DBMS Major Design Issues

  • Distributed DB design (Data storage)

– partition vs. replicate – full vs. partial replicas – optimal fragmentation and distribution is NP-hard

  • Distributed metadata management

– where to place directory data

  • Distributed query processing

– cost-efficient query execution over the network – query optimization is NP-hard

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 30

slide-31
SLIDE 31

Distributed DBMS Major Design Issues

  • Distributed transaction management

– Synchronizing concurrent access – Consistency of multiple copies of data – Detecting and recovering from failures – Deadlock management – Providing ACID properties in general => Distributed Systems Lecture (Schindelhauer/Lausen)

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 31

slide-32
SLIDE 32

[Silberschatz et al]

Typical Centralized DBMS Architecture

Uni Freiburg, WS2012/13 32 Systems Infrastructure for Data Science

slide-33
SLIDE 33

Important Architectural Dimensions for Distributed DBMSs

Uni Freiburg, WS2012/13 33 Systems Infrastructure for Data Science

slide-34
SLIDE 34

Client/Server DBMS Architecture

Client machine Server machine Network Cached data management Data management

Uni Freiburg, WS2012/13 34 Systems Infrastructure for Data Science

slide-35
SLIDE 35

Three-tier Client/Server Architecture

User interface Application programs Data management

Uni Freiburg, WS2012/13 35 Systems Infrastructure for Data Science

slide-36
SLIDE 36

Extensions to Client/Server Architectures

  • Multiple clients
  • Multiple application servers
  • Multiple database servers

Uni Freiburg, WS2012/13 36 Systems Infrastructure for Data Science

slide-37
SLIDE 37

Peer-to-Peer DBMS Systems

  • Classical (same functionality at each site)
  • Modern (as in P2P data sharing systems)

– Large scale – Massive distribution – High heterogeneity – High autonomy

Uni Freiburg, WS2012/13 37 Systems Infrastructure for Data Science

slide-38
SLIDE 38

Classical Peer-to-Peer DBMS Architecture

Peer machine

Logical organization

  • f data at all sites

Logical organization

  • f data at local site

Physical organization

  • f data at local site

User view Transparency support

Uni Freiburg, WS2012/13 38 Systems Infrastructure for Data Science

slide-39
SLIDE 39

Multi-database System Architecture

Peer machines

  • Full autonomy
  • Potential heterogeneity

Middleware layer

Uni Freiburg, WS2012/13 39 Systems Infrastructure for Data Science

slide-40
SLIDE 40

What is a Distributed DBMS?

  • Distributed database:

– “a collection of multiple, logically interrelated databases distributed over a computer network”

  • Distributed DBMS:

– “the software system that permits the management

  • f the distributed database and makes the distribution

transparent to the users”

  • This definition is relaxed for modern networked

information systems (e.g., web).

Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 40