22
play

22 Introduction to Distributed Databases Intro to Database Systems - PowerPoint PPT Presentation

22 Introduction to Distributed Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2019 2 ADM IN ISTRIVIA Homework #5 : Monday Dec 3 rd @ 11:59pm Project #4 : Monday Dec 10


  1. 22 Introduction to Distributed Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2019

  2. 2 ADM IN ISTRIVIA Homework #5 : Monday Dec 3 rd @ 11:59pm Project #4 : Monday Dec 10 th @ 11:59pm Extra Credit : Wednesday Dec 10 th @ 11:59pm Final Exam : Monday Dec 9 th @ 5:30pm CMU 15-445/645 (Fall 2019)

  3. 3 ADM IN ISTRIVIA Monday Dec 2 th – Oracle Lecture → Shasank Chavan (VP In-Memory Databases) Wednesday Dec 4 th – Potpourri + Review → Vote for what system you want me to talk about. → https://cmudb.io/f19-systems Sunday Nov 24 th – Extra Credit Check → Submit your extra credit assignment early to get feedback from me. CMU 15-445/645 (Fall 2019)

  4. 4 UPCO M IN G DATABASE EVEN TS Oracle Research Talk → Tuesday December 4 th @ 12:00pm → CIC 4 th Floor CMU 15-445/645 (Fall 2019)

  5. 5 PARALLEL VS. DISTRIBUTED Parallel DBMSs: → Nodes are physically close to each other. → Nodes connected with high-speed LAN. → Communication cost is assumed to be small. Distributed DBMSs: → Nodes can be far from each other. → Nodes connected using public network. → Communication cost and problems cannot be ignored. CMU 15-445/645 (Fall 2019)

  6. 6 DISTRIBUTED DBM Ss Use the building blocks that we covered in single- node DBMSs to now support transaction processing and query execution in distributed environments. → Optimization & Planning → Concurrency Control → Logging & Recovery CMU 15-445/645 (Fall 2019)

  7. 7 TO DAY'S AGEN DA System Architectures Design Issues Partitioning Schemes Distributed Concurrency Control CMU 15-445/645 (Fall 2019)

  8. 8 SYSTEM ARCH ITECTURE A DBMS's system architecture specifies what shared resources are directly accessible to CPUs. This affects how CPUs coordinate with each other and where they retrieve/store objects in the database. CMU 15-445/645 (Fall 2019)

  9. 9 SYSTEM ARCH ITECTURE Network Network Network Shared Shared Shared Shared Everything Memory Disk Nothing CMU 15-445/645 (Fall 2019)

  10. 10 SH ARED M EM O RY CPUs have access to common memory address space via a fast Network interconnect. → Each processor has a global view of all the in-memory data structures. → Each DBMS instance on a processor has to "know" about the other instances. CMU 15-445/645 (Fall 2019)

  11. 11 SH ARED DISK All CPUs can access a single logical disk directly via an interconnect, but each have their own private Network memories. → Can scale execution layer independently from the storage layer. → Must send messages between CPUs to learn about their current state. CMU 15-445/645 (Fall 2019)

  12. 12 SH ARED DISK EXAM PLE Node Storage Page ABC Update 101 Get Id=101 Get Id=101 Page ABC Node Get Id=200 Page XYZ Application Server Node CMU 15-445/645 (Fall 2019)

  13. 13 SH ARED N OTH IN G Each DBMS instance has its own Network CPU, memory, and disk. Nodes only communicate with each other via network. → Hard to increase capacity. → Hard to ensure consistency. → Better performance & efficiency. CMU 15-445/645 (Fall 2019)

  14. 14 SH ARED N OTH IN G EXAM PLE Get Id=10 Node Get Id=200 P1→ ID:1-150 P1→ ID:1-100 Node Get Id=200 Get Id=200 P3→ ID:101-200 Application Server Node P2→ ID:201-300 P2→ ID:151-300 CMU 15-445/645 (Fall 2019)

  15. 15 EARLY DISTRIBUTED DATABASE SYSTEM S MUFFIN – UC Berkeley (1979) SDD-1 – CCA (1979) System R* – IBM Research (1984) Stonebraker Bernstein Gamma – Univ. of Wisconsin (1986) NonStop SQL – Tandem (1987) Mohan DeWitt Gray CMU 15-445/645 (Fall 2019)

  16. 16 DESIGN ISSUES How does the application find data? How to execute queries on distributed data? → Push query to data. → Pull data to query. How does the DBMS ensure correctness? CMU 15-445/645 (Fall 2019)

  17. 17 H O M O GEN O US VS. H ETERO GEN O US Approach #1: Homogenous Nodes → Every node in the cluster can perform the same set of tasks (albeit on potentially different partitions of data). → Makes provisioning and failover "easier". Approach #2: Heterogenous Nodes → Nodes are assigned specific tasks. → Can allow a single physical node to host multiple "virtual" node types for dedicated tasks. CMU 15-445/645 (Fall 2019)

  18. 18 M O N GO DB H ETERO GEN O US ARCH ITECTURE Shards ( mongod ) Router ( mongos ) P1 P2 Get Id=101 Router ( mongos ) ⋮ P3 P4 Application Server P1→ ID:1-100 Config Server ( mongod ) P2→ ID:101-200 P3→ ID:201-300 ⋮ P4→ ID:301-400 CMU 15-445/645 (Fall 2019)

  19. 19 DATA TRAN SPAREN CY Users should not be required to know where data is physically located, how tables are partitioned or replicated . A SQL query that works on a single-node DBMS should work the same on a distributed DBMS. CMU 15-445/645 (Fall 2019)

  20. 20 DATABASE PARTITIO N IN G Split database across multiple resources: → Disks, nodes, processors. → Sometimes called "sharding" The DBMS executes query fragments on each partition and then combines the results to produce a single answer. CMU 15-445/645 (Fall 2019)

  21. 21 N AÏVE TABLE PARTITIO N ING Each node stores one and only table. Assumes that each node has enough storage space for a table. CMU 15-445/645 (Fall 2019)

  22. 22 N AÏVE TABLE PARTITIO N ING Partitions Table2 Table1 Table1 Ideal Query: Table2 SELECT * FROM table CMU 15-445/645 (Fall 2019)

  23. 23 H O RIZO N TAL PARTITIO N IN G Split a table's tuples into disjoint subsets. → Choose column(s) that divides the database equally in terms of size, load, or usage. → Hash Partitioning, Range Partitioning The DBMS can partition a database physical (shared nothing) or logically (shared disk). CMU 15-445/645 (Fall 2019)

  24. 24 H O RIZO N TAL PARTITIO N IN G Partitioning Key Partitions Table1 101 a XXX 2019-11-29 hash(a)%4 = P2 P1 P2 102 b XXY 2019-11-28 hash(b)%4 = P4 103 c XYZ 2019-11-29 hash(c)%4 = P3 104 d XYX 2019-11-27 hash(d)%4 = P2 105 e XYY 2019-11-29 hash(e)%4 = P1 P3 P4 Ideal Query: SELECT * FROM table WHERE partitionKey = ? CMU 15-445/645 (Fall 2019)

  25. 25 CO N SISTEN T H ASH IN G hash(key1) 1 0 E Replication Factor = 3 A C If hash(key)=D F hash(key2) B D 1/2 CMU 15-445/645 (Fall 2019)

  26. 26 LO GICAL PARTITIO N IN G Node Id=1 Storage Id=2 Get Id=1 Id=1 Id=2 Get Id=3 Id=3 Application Id=4 Server Node Id=3 Id=4 CMU 15-445/645 (Fall 2019)

  27. 27 PH YSICAL PARTITIO N IN G Node Id=1 Get Id=1 Id=2 Get Id=3 Application Server Node Id=3 Id=4 CMU 15-445/645 (Fall 2019)

  28. 29 SIN GLE- N O DE VS. DISTRIBUTED A single-node txn only accesses data that is contained on one partition. → The DBMS does not need coordinate the behavior concurrent txns running on other nodes. A distributed txn accesses data at one or more partitions. → Requires expensive coordination. CMU 15-445/645 (Fall 2019)

  29. 30 TRAN SACTIO N CO O RDIN ATIO N If our DBMS supports multi-operation and distributed txns, we need a way to coordinate their execution in the system. Two different approaches: → Centralized : Global "traffic cop". → Decentralized : Nodes organize themselves. CMU 15-445/645 (Fall 2019)

  30. 31 TP M O N ITO RS Example of a centralized coordinator. Originally developed in the 1970-80s to provide txns between terminals and mainframe databases. → Examples: ATMs, Airline Reservations. Many DBMSs now support the same functionality internally. CMU 15-445/645 (Fall 2019)

  31. 32 CEN TRALIZED CO O RDIN ATO R P1 P2 Coordinator Partitions Commit Request Lock Request P3 P4 P1 P2 Acknowledgement Application Safe to commit? Server P3 P4 CMU 15-445/645 (Fall 2019)

  32. 33 CEN TRALIZED CO O RDIN ATO R Partitions Middleware Commit Request Query Requests Safe to commit? P1 P2 Application P1→ ID:1-100 Server P3 P4 P2→ ID:101-200 P3→ ID:201-300 P4→ ID:301-400 CMU 15-445/645 (Fall 2019)

  33. 34 DECEN TRALIZED CO O RDIN ATO R Partitions Commit Request Begin Request P1 P2 Query Request Safe to commit? Application Server P4 P3 CMU 15-445/645 (Fall 2019)

  34. 35 DISTRIBUTED CO N CURREN CY CO N TRO L Need to allow multiple txns to execute simultaneously across multiple nodes. → Many of the same protocols from single-node DBMSs can be adapted. This is harder because of: → Replication. → Network Communication Overhead. → Node Failures. → Clock Skew. CMU 15-445/645 (Fall 2019)

  35. 36 DISTRIBUTED 2PL Waits-For Graph Set A=2 Set B=7 T 1 T 2 Application Application Set B=9 Server Server Set A=0 A=1 A=2 B=8 B=7 NETWORK Node 2 Node 1 CMU 15-445/645 (Fall 2019)

  36. 37 CO N CLUSIO N I have barely scratched the surface on distributed database systems… It is hard to get right. More info (and humiliation): → Kyle Kingsbury's Jepsen Project CMU 15-445/645 (Fall 2019)

  37. 38 N EXT CLASS Distributed OLTP Systems Replication CAP Theorem Real-World Examples CMU 15-445/645 (Fall 2019)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend