Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2017/18 1
Traditional RDBMSs • RDBMSs: the traditional technology for storing structured data in web and business applications • SQL is good – Rich language and toolset – Easy to use and integrate – Many vendors • RDBMSs promise ACID guarantees Valeria Cardellini - SABD 2017/18 2 ACID properties • A tomicity – All included statements in a transaction are either executed or the whole transaction is aborted without affecting the database (“all or nothing” principle) • C onsistency – A database is in a consistent state before and after a transaction • I solation – Transactions cannot see uncommitted changes in the database (i.e., the results of incomplete transactions are not visible to other transactions) • D urability – Changes are written to a disk before a database commits a transaction so that committed data cannot be lost through a power failure. Valeria Cardellini - SABD 2017/18 3
RDBMS constraints • Domain constraints – Restricts the domain of each attribute or the set of possible values for the attribute • Entity integrity constraint – No primary key value can be null • Referential integrity constraint – To maintain consistency among the tuples in two relations: every value of one attribute of a relation should exist as a value of another attribute in another relation • Foreign key – To cross-reference between multiple relations: it is a key in a relation that matches the primary key of another relation Valeria Cardellini - SABD 2017/18 4 Pros and cons of RDBMS Pros Cons • Well-defined consistency • Performance as major constraint, scaling is difficult model • Limited support for complex • ACID guarantees data structures • Relational integrity • Complete knowledge of DB maintained through entity structure required to create and referential integrity ad hoc queries constraints • Commercial DBMSs are • Well suited for OLTP apps expensive OLTP : OnLine Transaction • Some DBMSs have limits on Processing fields size • Sound theoretical foundation • Data integration from • Stable and standardized multiple RDBMSs can be cumbersome DBMSs available • Well understood Valeria Cardellini - SABD 2017/18 5
RDBMS challenges • Web-based applications caused spikes – Internet-scale data size – High read-write rates – Frequent schema changes • Let’s scale RDBMSs – RDBMS were not designed to be distributed • Possible solutions: – Replication – Sharding Valeria Cardellini - SABD 2017/18 6 Replication • Primary backup with master/worker architecture • Replication improves read scalability • Write operations? Valeria Cardellini - SABD 2017/18 7
Sharding • Horizontal partitioning of data across many separate servers • Scales read and write operations • Cannot execute transactions across shards (partitions) • Consistent hashing is one form of sharding - Hash both data and nodes using the same hash function in a same ID space Valeria Cardellini - SABD 2017/18 8 Scaling RDBMSs is expensive and inefficient Source: Couchbase technical report Valeria Cardellini - SABD 2017/18 9
NoSQL data stores • NoSQL = Not Only SQL – SQL-style querying is not the crucial objective • Main features of NoSQL data stores – Support flexible schema • No requirement for fixed rows in a table schema – Scale horizontally • Partitioning of data and processing over multiple nodes – Provide scalability and high availability by replicating data in multiple nodes, often across datacenters – Multiprocessor support – Mainly utilize shared-nothing architecture • With exception of graph-based database Valeria Cardellini - SABD 2017/18 10 NoSQL data stores (2) • Main features of NoSQL data stores (continued) – Avoid unneeded complexity • E.g., elimination of join operations – Useful when working with Big data when the data’s nature does not require a relational model – Support weaker concurrency models than the standard ACID transaction model • Rather BASE: compromising reliability for better performance 11 Valeria Cardellini - SABD 2017/18
ACID vs BASE • Two design philosophies at opposite ends of the consistency-availability spectrum - Keep in mind the CAP theorem ! Pick two of Consistency, Availability and Partition tolerance • ACID: the traditional approach to address the consistency issue in RDBMS – A pessimistic approach: prevent conflicts from occurring • Usually implemented with write locks managed by the system – But ACID does not scale well when handling petabytes of data (remember of latency!) Valeria Cardellini - SABD 2017/18 12 ACID vs BASE (2) • BASE stands for B asically A vailable, S oft state, E ventual consistency – An optimistic approach • Lets conflicts occur, but detects them and takes action to sort the out • Approaches: • conditional updates: test the value just before updating • save both updates: record that they are in conflict and then merge them – Basically Available: the system is available most of the time and there could exist a subsystem temporarily unavailable – Soft state: data is not durable in the sense that its persistence is in the hand of the user that must take care of refresh them – Eventually consistent: the system eventually converge to a consistent state • Usually adopted in NoSQL databases Valeria Cardellini - SABD 2017/18 13
Consistency • Biggest change from a centralized RDBMS to a cluster-oriented NoSQL • RDBMS: strong consistency – Traditional RDBMS are CA systems (or CP systems, depending on the configuration) • NoSQL systems: mostly eventual consistency – AP systems Valeria Cardellini - SABD 2017/18 14 Consistency: an example • Ann is trying to book a room of the Ace Hotel in New York on a node located in London of a booking system • Pathin is trying to do the same on a node located in Mumbai • The booking system uses a replicated database with the master located in Mumbai and the slave in London • There is only a room available • The network link between the two servers breaks Pathin Ann London Mumbay Valeria Cardellini - SABD 2017/18 15
Consistency: an example • CA system: neither user can book any hotel room – No tolerance to network partitions • CP system: – Pathin can make the reservation – Ann can see the inconsistent room information but cannot book the room • AP: both nodes accept the hotel reservation – Overbooking! • Remember that the tolerance to this situation depends on the application type – Blog, financial exchange, shopping chart, … Valeria Cardellini - SABD 2017/18 16 Pessimistic vs. optimistic approach • Concurrency involves a fundamental tradeoff between: - Safety (avoiding errors such as update conflicts) and - Liveness (responding quickly to clients) • Pessimistic approaches often: - Severely degrade the responsiveness of a system - Leads to deadlocks, which are hard to prevent and debug Valeria Cardellini - SABD 2017/18 17
NoSQL cost and performance Source: Couchbase technical report Valeria Cardellini - SABD 2017/18 18 Pros and cons of NoSQL Pros Cons • Easy to scale-out • Do not provide ACID guarantees, less suitable for • Higher performance for OLTP apps massive data scale • No fixed schema, no • Allows sharing of data common data storage model across multiple servers • Limited support for • Most solutions are either aggregation (sum, avg, open-source or cheaper count, group by) • HA and fault tolerance • Performance for complex join is poor Valeria Cardellini - SABD 2017/18 provided by data replication • No well defined approach for • Supports complex data DB design (different structures and objetcs solutions have different data • No fixed schema, supportrs models) unstructured data • Lack of consistent model • Very fast retrieval of data, can lead to solution lock-in suitable for real-time apps 19
Barriers to NoSQL • Main barriers to NoSQL adoption – No full ACID transaction support – Lack of standardized interfaces – Huge investments already made in existing RDBMSs • A commercial example – AWS launched two NoSQL services (SimpleDB in 2007 and later DynamoDB in 2012) and one RDBMS service (RDS in 2009) Valeria Cardellini - SABD 2017/18 20 NoSQL data models • A number of largely diverse data stores not based on the relational data model Valeria Cardellini - SABD 2017/18 21
Recommend
More recommend