outline
play

Outline Background & Motivation System Overview System Design - PowerPoint PPT Presentation

R-Store : A Scalable Distributed System for Supporting Real-time Analytics Feng Li, M. Tamer zsu, Gang Chen, Beng Chin Ooi @ICDE 2014 Presented by: Xiao Meng CS848, University of Waterloo Outline Background & Motivation System


  1. R-Store : A Scalable Distributed System for Supporting Real-time Analytics Feng Li, M. Tamer Özsu, Gang Chen, Beng Chin Ooi @ICDE 2014 Presented by: Xiao Meng CS848, University of Waterloo

  2. Outline • Background & Motivation • System Overview • System Design • RTOLAP in R-Store • Evaluation • Conclusion • Q & A

  3. Background & Motivation • Si Situation uation for l or large arge scal scale e data p data processi ocessing ng  Systems classified into 2 categories: OLTP, OLAP  Data periodically transport to OLAP through ETL • Dema emand nd  Time critical decision making (RTOLAP) - the freshness of OLAP results - Fully RTOLAP entail executing query directly on OLTP data  OLAP & OLTP processed by one integrated system

  4. Background & Motivation • Prob Problem em on on si simple co le combinatio ination  Resource contention - OLTP query blocked by OLAP  Inconsistency - Long running OLAP may access same data sets several times, updates by OLTP could lead to incorrect OLAP results • So Solut ution ion – R-Stor Store  Resource contention - Computation resource isolation  Inconsistency - Multi-versioning storage system

  5. System Overview – A A glimpse of R-Store • OLAP LAP quer query data b y data based ased on on timest estamp of of quer query y sub ubmission ission fr from om mul multi ti-ve versi rsion onin ing stor orag age sys ystem tem – Modified HBase as storage – Mapreduce job for query execution • Per Period odica ically lly mater ateria ialize lize real eal-tim time e data data into nto data cub data cube – Fully HBaseScan every time is time-consuming • Entire table is scanned & shuffled during MR – Streaming Mapreduce to maintain data cube

  6. System Overview – R-Store Architecture OLTP submitted to KV Store • OLAP query processed by • MapReduce – Scan on HBase Refresh data cube through • streaming MapReduce MetaStore to generate query • timestamp T Q & metadata

  7. System Design – A Glimpse of HBase

  8. System Design – Storage Design based on HBase • Ext Extend end Scan Scan to 2 to 2 ver versi sion ons – FullScan for querying data cube – IncrementalScan for querying real-time data • Infinit nfinite e ver versi sion ons of s of data to data to mainta aintain in quer query co y consist nsistency ency – Compaction to remove stale versions – Global compaction  Immediately following data cube refresh – Local compaction  Compact old versions not accessed by any scan process

  9. System Design – IncrementalScan in detail • Tar Target get: Find out changes since last data cube materialization • Met etho hod – Take 2 timestamps as input 𝑈 𝐸𝐷 & 𝑈 𝑅 , return the values with largest timestamp before 𝑈 𝐸𝐷 & 𝑈 𝑅 • Implem ementa entation tions – Naïve: Accessing memstore & storefile in parallel – Adaptive: Maintain key modified since last materialization, first scan memstore , scan or random access keys based on cost

  10. System Design – Compaction in detail • Glob obal co al compactio ction – Similar to Hbase’s default, retain only one version of each key – Triggered by data cube’s refresh completion • Loca Local l com compactio ction – Compacted data stored in different file in case block scan process – Files can be removed when not accessed by any scan – Triggered when #tuple/#key exceeds threshold

  11. System Design – Data cube • Define a efine a dat data cub a cube f e for or “Customer Profiles” • Dim imensions: ensions: age, age, inco ncome, b e, buys uys

  12. System Design – Data cube maintenance • Re-computation – First run – FullScan on one region, generate a KV pair for each cuboid in mapper, aggregate & output in reducer • Incremental Update – Consequent runs – Propagation step to computes change & update step to update cube – Streaming system updates cube inside & periodically materialize it into storage

  13. System Design – HStreaming for cube maintenance • Each mapper responsible for processing update within a key range – Maintain KVs locally, cache hot keys in memory – For updates, emit 2 KV pair for each cubiod(+, -) • Reducer cache the output KV of mapper and invoke reduce every 𝑋 𝑠 , refresh cube every 𝑋 𝑑𝑣𝑐𝑓

  14. System Design – Data Flow of R-Store 1. Updates arrives Hbase-R 2. stream updates to a Hstreaming mapper 3. Reducer periodically materialize local data cube to Hbase-R & notifies Metastore

  15. RTOLAP in R-Store – Query Processing • Map • Reduce • Tag the values with ‘Q’ ‘+’, ‘ - ’ • Do calculation based on aggregation function & three values

  16. Evaluation • Cluster of 144 nodes  – Intel X3430 2.4 GHz processor  – 8 GB of memory  – 2x500 GB SATA disks  – gigabit Ethernet • TPC-H data

  17. Evaluation - Performance of Maintaining Data cube • Hstreaming with 10 nodes have higher throughput than 40 Hbase-R nodes • 1.6 billion keys, 1% updated, update algorithm fast enough, • latency equals to Hbase-R input speed

  18. Evaluation - Performance of RT querying • Small key range updates scans fewer data in Hbase-R, process fewer data

  19. Evaluation - Performance of OLTP

  20. Related Work • Database – C-Store(VLDB 05) • Main-memory database – HyPer(ICDE 11), HYRISE(VLDB 10) • Druid(SIGMOD 14)

  21. Conclusion • Multi-version concurrent control to support RTOLAP • Data cube to reduce storage requirement & improve performance • Streaming system to refresh data cube • Available at https://github.com/lifeng5042/RStore

  22. Q & A

  23. Backup – OLAP Cube • A multi-dimensional generalization of a two- or three-dimensional spreadsheet. Hypercube for dataset with more than three d’s. • Dimensions: Product, time, cities… • Cells: each cell of the cube holds a number that represents some measure of the business, e.g. sales, profits… • Slicer: the dimension held constant for all cells so that multi-dimensional information can be shown in a 2D physical space of a spreadsheet.

  24. Backup – OLAP Cube • Data cube can be viewed as a lattice of cuboids • The bottom-most cuboid is the base cuboid • The top-most cuboid (apex) contains only one cell • How many cuboids in an n-dimensional cube with L levels?  n   T ( L 1 ) i  1 i

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend