Modeling Analytics for Computational Storage Veronica Lagrange, - PowerPoint PPT Presentation

Modeling Analytics for Computational Storage Veronica Lagrange, Harry Li, Anahita Shayesteh Memory Solutions Lab 07 April 2020 Version 1.2 Samsung Semiconductor, Inc. ICPE 2020

Agenda Modeling Analytics for Computational Storage  Motivation  Near storage opportunities  Deconstruction of “big data” queries  Push down to Near Storage  Workload: TPC-DS  Modeling Methodologies and Results 1 ICPE 2020

Motivation HD Server 2 ICPE 2020

Motivation SSD Server SSD 3 ICPE 2020

Motivation: Near storage OLAP SSD Read IN all that HAY… Server SSD 4 ICPE 2020

Motivation: Near storage OLAP SmartSSD Read IN just needle. Server SmartSSD 5 ICPE 2020

Near storage opportunities • Compression/Decompression; • Encoding/Decoding; • Filter; • Projection; • Some aggregates (SUM, COUNT); • SORT; • Some JOINs. 6 ICPE 2020

Deconstruction of “big data” queries TPC-DS Q44: select asceding.rnk, i1.i_product_name best_performing, i2.i_product_name worst_performing “List the best and worst performing from(select * products measured by net profit. “ from (select item_sk,rank() over (order by rank_col asc) rnk For a specific store. from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col from store_sales ss1 where ss_store_sk = 2 group by ss_item_sk having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) rank_col from store_sales where ss_store_sk = 2 and ss_hdemo_sk is null group by ss_store_sk))V1)V11 where rnk < 11) asceding, (select * from (select item_sk,rank() over (order by rank_col desc) rnk from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col from store_sales ss1 where ss_store_sk = 2 group by ss_item_sk having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) rank_col from store_sales where ss_store_sk = 2 and ss_hdemo_sk is null group by ss_store_sk))V2)V21 where rnk < 11) descending, item i1, item i2 where asceding.rnk = descending.rnk and i1.i_item_sk=asceding.item_sk and i2.i_item_sk=descending.item_sk order by asceding.rnk limit 100; 7 ICPE 2020

Executive Summary 2 ICPE 2020

Push down to Near Storage Operations pushed down:  SCAN: I/O plus data transformation  FILTER: row selection  PROJECTION: column selection 9 ICPE 2020

Workload: TPC-DS Two clusters: SPARK-SQL • • Presto TPC-DS sf10,000 (10TB dataset) 99 TPC-DS queries have different characteristics and performance behavior. 10 ICPE 2020

Parquet File Format Two 8-node Hadoop clusters: • SPARK-SQL • Presto One file format – PARQUET: • Columnar • Designed for OLAP applications • READ optimized • Self-contained METADATA • Existing Parquet Readers can FILTER/PROJECT certain datatypes using statistics in METADATA 10 ICPE 2020

Modeling methodologies 11 ICPE 2020

Modeling methodologies SPARK-SQL modeling: Timestamp 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Note Stage-0 Read dimension table: Scan,Filter,Project, Aggregate Stage-1 Read fact table: Scan,Filter,Project,Aggregate Stage-2 Read fact table: Scan,Filter,Project,Aggregate Stage-3 Sort, Aggregate Stage-4 Sort, Aggregate Stage-5 Join 12 ICPE 2020

Modeling methodologies SPARK-SQL modeling: Timestamp 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Note Stage-0 Read dimension table: Scan,Filter,Project, Aggregate Stage-1 Read fact table: Scan,Filter,Project,Aggregate Stage-2 Read fact table: Scan,Filter,Project,Aggregate Stage-3 Sort, Aggregate Stage-4 Sort, Aggregate Stage-5 Join Timestamp 0 1 2 3 4 5 6 7 8 9 10 11 Note Stage-0 Read dimension table: Scan,Filter,Project, Aggregate Stage-1 Read fact table: Scan,Filter,Project,Aggregate Stage-2 Read fact table: Scan,Filter,Project,Aggregate Stage-3 Sort, Aggregate Stage-4 Sort, Aggregate Stage-5 Join 13 ICPE 2020

Modeling methodologies Presto modeling: • Run query with original tables. Repeat query with model tables. Presto generates same query plan in both cases. • 14 ICPE 2020

Modeling Results Near Storage Speedup 10TB dataset size 100 Geometric Mean: - Presto: 3.76x - SPARK-SQL: 2.80x SPEEDUP (LOG SCALE) 10 1 Q4 Q9 Q13 Q28 Q44 Q49 Q51 Q56 Q72 Q75 Q76 Q88 Presto SPARK-SQL 15 ICPE 2020

Modeling Results 16 ICPE 2020

Modeling Results Presto Q44 at sf10T is the best speed up observed. Total bytes READ much smaller with • Model – must use LOG SCALE Avg CPU utilization 4x smaller • Response time decreases from 18+ • minutes to 19 seconds Presto plan for Q44 does not scale • 17 ICPE 2020

Modeling Results 18 ICPE 2020

Conclusion Modeling Analytics for Computational Storage  Near Storage optimizations for OLAP NOT universal  Some queries see significant speedup from Near Storage opportunities  We covered only basic operations (“low hanging fruit”)  Other Operations also amenable to Push down to Near Storage Questions ? Near Storage Speedup 10TB dataset size Geometric Mean: 100 SPEEDUP (LOG SCALE) - Presto: 3.76x - SPARK-SQL: 2.80x 10 1 Q4 Q9 Q13 Q28 Q44 Q49 Q51 Q56 Q72 Q75 Q76 Q88 19 ICPE 2020 Presto SPARK-SQL

Modeling Analytics for Computational Storage Veronica Lagrange, - PowerPoint PPT Presentation

Modeling Analytics for Computational Storage Veronica Lagrange, Harry Li, Anahita Shayesteh Memory Solutions Lab 07 April 2020 Version 1.2 Samsung Semiconductor, Inc. ICPE 2020 Agenda Modeling Analytics for Computational Storage

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Why choice modeling? Elea McDonnell Feit Instructor DataCamp Marketing Analytics in R: Choice

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Computational Modeling CT @ VT Computational Modeling The third pillar of science and

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

When Malware is Packin Heat; Limits of Machine Learning Classifiers Based on Static Analysis

Ascent timescales at the Onset of the Oruanui, NZ Supereruption Madison Myers University of

Trip Report FINAL MEETING AND SUMMER SCHOOL OF DFG PRIORITY PROGRAM ALGORITHM ENGINEERING DFG

p(E|H p(E|H p ) p ) p(E|H p(E|H d ) d ) Need for testing In forensic voice comparison,

Fronthaul Compression for Cloud Radio Access Networks O. Simeone New Jersey Institute of

Detect New Physics with Deep Learning Trigger at the LHC Zhenbin Wu (UIC) Thong Nguyen

Efficient compression of SIDH public keys Craig Costello 1 David Jao 2 Patrick Longa 1 Michael

Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad

Modeling Analytics for Computational Storage Veronica Lagrange, - PowerPoint PPT Presentation

Modeling Analytics for Computational Storage Veronica Lagrange, Harry Li, Anahita Shayesteh Memory Solutions Lab 07 April 2020 Version 1.2 Samsung Semiconductor, Inc. ICPE 2020 Agenda Modeling Analytics for Computational Storage

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Why choice modeling? Elea McDonnell Feit Instructor DataCamp Marketing Analytics in R: Choice

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Computational Modeling CT @ VT Computational Modeling The third pillar of science and

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN &amp; Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

When Malware is Packin Heat; Limits of Machine Learning Classifiers Based on Static Analysis

Ascent timescales at the Onset of the Oruanui, NZ Supereruption Madison Myers University of

Trip Report FINAL MEETING AND SUMMER SCHOOL OF DFG PRIORITY PROGRAM ALGORITHM ENGINEERING DFG

p(E|H p(E|H p ) p ) p(E|H p(E|H d ) d ) Need for testing In forensic voice comparison,

Fronthaul Compression for Cloud Radio Access Networks O. Simeone New Jersey Institute of

Detect New Physics with Deep Learning Trigger at the LHC Zhenbin Wu (UIC) Thong Nguyen

Efficient compression of SIDH public keys Craig Costello 1 David Jao 2 Patrick Longa 1 Michael

Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage