nosql data stores
play

NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. 2019/20 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica The reference Big


  1. Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL Data Stores Corso di Sistemi e Architetture per Big Data A.A. 2019/20 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2019/20 1

  2. Traditional RDBMSs • Relational DBMSs (RDBMSs) – Traditional technology for storing structured data in web and business applications • SQL is good – Rich language and toolset – Easy to use and integrate – Many vendors • RDBMSs promise ACID guarantees Valeria Cardellini - SABD 2019/20 2 ACID properties • A tomicity – All statements in a transaction are either executed or the whole transaction is aborted without affecting the database: “all or nothing” rule that is, transactions do not occur partially • C onsistency – A database is in a consistent state before and after a transaction; it refers to the correctness of a database • I solation – Transactions cannot see uncommitted changes in the database (i.e., the results of incomplete transactions are not visible to other transactions) • D urability – Changes are written to disk (i.e., non-volatile memory) before a database commits a transaction so that committed data cannot be lost if a system failure occurs Valeria Cardellini - SABD 2019/20 3

  3. RDBMS constraints • Domain constraints – Restrict the domain of each attribute or the set of possible values for the attribute • Entity integrity constraint – No primary key value can be null • Referential integrity constraints – To maintain consistency among the tuples in two relations: every value of one attribute of a relation should exist as a value of another attribute in another relation • Foreign key – To cross-reference between multiple relations: it is a key in a relation that matches the primary key of another relation Valeria Cardellini - SABD 2019/20 4 Pros and cons of RDBMS Pros Cons • Performance as major • Well-defined consistency constraint, scaling is difficult model • Limited support for complex • ACID guarantees data structures • Relational integrity • Complete knowledge of DB maintained through entity structure required to create and referential integrity ad hoc queries constraints • Commercial DBMSs are • Well suited for OLTP apps expensive OLTP : OnLine Transaction • Some DBMSs have limits on Processing fields size • Sound theoretical foundation • Data integration from multiple RDBMSs can be • Stable and standardized cumbersome DBMSs available • Well understood Valeria Cardellini - SABD 2019/20 5

  4. RDBMS challenges • Web-based applications cause spikes – Internet-scale data size – High read-write rates – Frequent schema changes • Let’s scale RDBMSs – But RDBMS were not designed to be distributed • How to scale RDBMSs? – Replication – Sharding Valeria Cardellini - SABD 2019/20 6 Replication • Primary backup with master/worker architecture • Replication improves read scalability • Write operations? Valeria Cardellini - SABD 2019/20 7

  5. Sharding • Horizontal partitioning of data across many separate servers • Read and write operations scale • Cannot execute transactions across shards (partitions) • Consistent hashing can be use to determine which server any shard is assigned to ⁃ Hash both data and server using the same hash function in the same ID space Valeria Cardellini - SABD 2019/20 8 Scaling RDBMSs is expensive and inefficient Source: Couchbase technical report Valeria Cardellini - SABD 2019/20 9

  6. NoSQL data stores • NoSQL = Not Only SQL – SQL-style querying is not the crucial objective Valeria Cardellini - SABD 2019/20 10 NoSQL data stores: main features • Support flexible schema – No requirement for fixed rows in a table schema – Well suited for Agile development process • Scale horizontally – Partitioning of data and processing over multiple nodes • Provide high availability – By replicating data in multiple nodes, often geo-distributed • Mainly utilize shared-nothing architecture – With exception of graph-based databases • Avoid unneeded complexity – E.g., elimination of join operations • Support weaker consistency models – BASE rather than ACID: compromising reliability for better performance Valeria Cardellini - SABD 2019/20 11

  7. ACID vs BASE • Two design philosophies at opposite ends of the consistency-availability spectrum - Keep in mind CAP theorem Pick two of Consistency, Availability and Partition tolerance • ACID: traditional approach for RDBMSs – Pessimistic approach: prevents conflicts from occurring • Usually implemented with write locks managed by the system • Leads to performance degradation and deadlocks (hard to prevent and debug) – Does not scale well when handling petabytes of data (remember of latency!) Valeria Cardellini - SABD 2019/20 12 ACID vs BASE (2) • BASE : B asically A vailable, S oft state, E ventual consistency – B asically A vailable: the system is available most of the time and there could exist a subsystem temporarily unavailable – S oft state: data is not durable that is, its persistence is in the hand of the user that must take care of refreshing it – E ventually consistent: the system eventually converges to a consistent state • Optimistic approach - Lets conflicts occur, but detects them and takes action to sort them out: how? • Conditional updates : test the value just before updating • Save both updates : record that they are in conflict and then merge them 13 Valeria Cardellini - SABD 2019/20

  8. NoSQL and consistency • Biggest change from RDBMS – RDBMS: strong consistency – Traditional RDBMS are CA systems (or CP systems, depending on the configuration) • Most NoSQL systems are eventual consistent – i.e., AP systems • But some NoSQL systems provide strong consistency or tunable consistency – E.g., Cassandra and MongoDB Valeria Cardellini - SABD 2019/20 14 NoSQL cost and performance Source: Couchbase technical report Valeria Cardellini - SABD 2019/20 15

  9. Pros and cons of NoSQL Pros Cons • No ACID guarantees, less • Easy to scale-out suitable for OLTP apps • Higher performance for • No fixed schema, no massive data scale common data storage model • Allows sharing of data • Lack of standardization across multiple servers (e.g., querying is unique for • Most solutions are either each NoSQL data store) open-source or cheaper • Limited support for aggregation ops (sum, avg, • HA and fault tolerance count, group by) provided by data replication Valeria Cardellini - SABD 2019/20 • Poor performance for • Supports complex data complex joins structures and objetcs • No well defined approach for • No fixed schema, supports DB design (different data unstructured data models) • Very fast retrieval of data, • Lack of model can lead to suitable for real-time apps solution lock-in 16 NoSQL data models • A number of largely diverse data stores not based on the relational data model Valeria Cardellini - SABD 2019/20 17

  10. NoSQL data models • Data model : set of constructs for representing information – Relational model: tables, columns and rows • Storage model : how the data store management system stores and manipulates data internally • A data model is usually independent of the storage model • Data models for NoSQL systems: – Aggregate-oriented models: key-value , document , and column-family – Graph-based models Valeria Cardellini - SABD 2019/20 18 Aggregates • Data as units having a complex structure – More structure than just a set of tuples – E.g., complex record with simple fields, arrays, records nested inside • Aggregate pattern in Domain-Driven Design – Collection of related objects that we treat as a unit – Unit for data manipulation and management of consistency • Advantages of aggregates – Easier for application programmers to work with – Easier for data store systems to handle operating on a cluster See http://thght.works/1XqYKB0 Valeria Cardellini - SABD 2019/20 19

  11. Transactions? • Relational databases have ACID transactions • Aggregate-oriented data stores: – Support atomic transactions, but only within a single aggregate – Don’t have ACID transactions that span multiple aggregates • In case of update over multiple aggregates: possible inconsistent reads – Take into account when deciding how to aggregate data • Graph databases tend to support ACID transactions Valeria Cardellini - SABD 2019/20 20 Key-value data model • Simple data model: data is represented as a schema- less collection of key-value pairs – Associative array (map or dictionary) as fundamental data model • Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Data model: – Set of <key, value> pairs – Value: aggregate instance • The aggregate is opaque to the data store – Just a big blob of mostly meaningless bit • Access to aggregate: lookup based on its key • Richer data models can be implemented on top Valeria Cardellini - SABD 2019/20 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend