Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, - PowerPoint PPT Presentation

BigDataBench-‑S: ¡An ¡Open-‑source ¡ Scien6fic ¡Big ¡Data ¡Benchmark ¡Suite ¡ Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan INSTITUTE O OF C COMPUTING T TECHNOLOGY

Big ¡Data ¡in ¡Scien,fic ¡Domains Astronomy ¡ High ¡Energy ¡Physics Bioinforma,cs ¡ ~10,000,000 events each night for 10 years 100 ¡PB ¡ 40 ¡PB ¡gene ¡data ¡in ¡EBI ¡in ¡2014 ¡ 140 ¡PB ¡up ¡to ¡2014 ¡ 15 ¡PB ¡new ¡data ¡per ¡year ¡ 2017 2 HPBDC

Challenges ¡to ¡Data ¡Management ¡and ¡Analy,cs n Data ¡Management ¡ n System ¡Requirements ¡ ¡ n PB ¡level ¡data ¡storage ¡ Good ¡Scalability ¡ • Effec,ve ¡Data ¡Organiza,on ¡ • n Low-‑latency ¡DBMS ¡opera,ons ¡processing ¡ ¡& ¡efficient ¡query ¡processing ¡ ¡ n Data ¡Analy,cs ¡ n Support ¡for ¡complex ¡analy,cs ¡ • Linear ¡Algebra ¡ • Machine ¡Learning ¡ • Flexible ¡support ¡for ¡various ¡types ¡ of ¡complex ¡analy,c ¡opera,ons ¡ • … ¡ • In-‑situ ¡data ¡processing ¡ n Different ¡types ¡of ¡analy,c ¡opera,ons ¡on ¡ the ¡same ¡data ¡set ¡ • Low ¡cost ¡data ¡sharing ¡ 2017 HPBDC

Big ¡Data ¡Systems n Scalable ¡storage ¡system: ¡ ¡ n Hadoop ¡Distributed ¡File ¡System ¡ n Various ¡subsystems ¡for ¡different ¡types ¡of ¡analy,c ¡opera,ons ¡ n General-‑purpose ¡frameworks ¡ • Spark, ¡MapReduce, ¡Flink, ¡… ¡ n DBMS ¡components ¡ • Hive, ¡SparkSQL, ¡… ¡ n Machine ¡Learning ¡ • Mahout, ¡Spark ¡Mllib… ¡ n … ¡ 2017 HPBDC

Do ¡current ¡big ¡data ¡management ¡and ¡analy@c ¡systems ¡ perform ¡well ¡in ¡the ¡context ¡of ¡scien@fic ¡big ¡data ¡? A ¡Comprehensive ¡Scien@fic ¡Big ¡Data ¡ Benchmark ¡Suite 2017 HPBDC

Exis,ng ¡Scien,fic ¡Benchmarks n SS-‑DB: ¡[Stanford, ¡xldb10] ¡ n Simulate ¡an ¡astronomical ¡data ¡management ¡scenario ¡ n Queries ¡including ¡raw ¡data ¡cooking ¡and ¡observa,on ¡data ¡ analysis ¡ Only ¡consider ¡one ¡scenario ¡ Only ¡include ¡DBMS ¡opera@ons ¡ 2017 HPBDC

Exis,ng ¡Scien,fic ¡Benchmarks n GenBase ¡[MIT, ¡Sigmod14] ¡ n Simulate ¡genomics ¡research ¡ n Five ¡mixed ¡data ¡management ¡and ¡analy,cs ¡workloads ¡ • Data ¡selec,on ¡ à ¡data ¡analy,cs ¡ à ¡results ¡extrac,on ¡ ¡ Also ¡only ¡one ¡scenario ¡ ¡ Unable ¡to ¡be ¡used ¡to ¡compare ¡subsystems ¡with ¡ same ¡func@ons ¡ • SparkSQL ¡vs ¡Hive? ¡ • Mahout ¡vs ¡Spark ¡MLlib? ¡ 2017 HPBDC

BigDataBench-‑S n A ¡new ¡scien,fic ¡big ¡data ¡benchmark ¡suite ¡for ¡current ¡ big ¡data ¡analy,cs ¡systems ¡ n Various ¡representa,ve ¡scien,fic ¡analy,c ¡workloads ¡from ¡ different ¡typical ¡scien,fic ¡research ¡areas ¡ n Comparison ¡among ¡various ¡components ¡designed ¡for ¡the ¡ same ¡opera,on ¡types ¡ 2017 HPBDC

Methodology Benchmark ¡ Decomposi ,on ¡ n Inherit ¡from ¡BigDataBench ¡[Wang, ¡HPCA2014] ¡ ¡ ¡ ¡ Benchmark ¡ Scien,fic ¡ Benchmark ¡ Benchmark ¡ Decomposi,on ¡ Subset ¡1 ¡ Domain ¡1 Specifica,on ¡1 ¡ ¡ ¡ Workloads ¡ Typical ¡Opera,on ¡Analysis ¡ ¡ With ¡ Scien,fic ¡ Benchmark ¡ Benchmark ¡ DataSets ¡ Workload ¡Pa^ern ¡Analysis Diverse ¡ Subset ¡i Selec,on ¡ Domain ¡i Specifica,on ¡i Implementa ¡ Data ¡Model ¡Analysis ,ons ¡ ¡ Workloads ¡With ¡ Scien,fic ¡ Benchmark ¡ Benchmark ¡ Diverse ¡ Subset ¡N ¡ Domain ¡N Specifica,on ¡N Implementa,ons 2017 HPBDC

Data ¡Analysis ¡in ¡typical ¡Scien,fic ¡Search ¡Areas n High ¡Energy ¡Physics ¡ n LHC ¡Events ¡Discrimina,on: ¡Classifica,on ¡and ¡Regression ¡ n Astronomy ¡ n Telescope ¡Image ¡Analysis ¡ n Genomics ¡ n Microarray ¡Data ¡Analysis 2017 HPBDC

Data ¡Flow Online ¡Events ¡ Acquisi,on Offline ¡Data ¡ Reconstruc,on Data ¡Analysis Result ¡ [J.-R. Vlimant, 2016] Extrac,on 2017 HPBDC

Data ¡Flow 2017 HPBDC

Overview 3 ¡data ¡sets, ¡17 ¡workloads 2017 HPBDC

More ¡Scien,fic ¡Domains n Gravita,onal ¡Waves ¡ n Neuroscience ¡ n … ¡ 2017 HPBDC

Comparison ¡Study n Performance ¡comparison ¡between ¡numbers ¡of ¡widely-‑used ¡big ¡ data ¡analy,cs ¡systems ¡using ¡a ¡subset ¡of ¡BigDataBench-‑S ¡ n Hadoop ¡(MapReduce ¡and ¡Tez), ¡Spark ¡ n Both ¡DBMS ¡queries ¡and ¡complex ¡analy,cs ¡workloads ¡ n Different ¡data ¡formats: ¡Row-‑based ¡vs ¡Column-‑based ¡ n Different ¡data ¡sizes ¡ 2017 HPBDC

Comparison ¡Study n Data ¡Set ¡ n Simulated ¡microarray ¡data ¡using ¡GenBase ¡data ¡generator ¡ n Schema ¡ CREATE TABLE geo( CREATE TABLE go( geneid INT, geneid INT, Matrix ¡data patientid INT, goid INT, expr_value FLOAT); belongs INT); CREATE TABLE genes( CREATE TABLE patients( geneid INT, patientid INT, target INT, age INT, pos BIGINT, gender INT, Meta ¡data len INT, zipcode INT, func INT); disease INT, response FLOAT); 2017 HPBDC

Queries n Query ¡1: ¡Selec,on ¡ n Select ¡data ¡based ¡from ¡matrix ¡table ¡based ¡on ¡condi,ons ¡on ¡the ¡ metadata ¡table ¡ • Map ¡join ¡based ¡data ¡filter SELECT ¡geo.* ¡FROM ¡genes ¡ JOIN ¡geo ¡ON ¡(geo.geneid=genes.geneid) ¡WHERE ¡genes.func ¡< ¡X; 2017 HPBDC

Queries n Query ¡2: ¡Aggrega,on ¡ n An ¡aggregated ¡opera,on ¡on ¡all ¡data ¡in ¡a ¡matrix ¡data ¡table ¡ SELECT ¡geneid, ¡avg(expr_value) ¡as ¡avg_expr_value ¡FROM ¡geo ¡ GROUP ¡BY ¡geneid; 2017 HPBDC

Queries n Query ¡3: ¡Join ¡ n Join ¡data ¡from ¡the ¡geo ¡and ¡go ¡tables ¡ ¡ SELECT ¡go.goid ¡AS ¡go_col, ¡go.pid ¡AS ¡pid, ¡ ¡go.belongs ¡AS ¡cat, ¡gp.ev ¡AS ¡val ¡ FROM ¡ (SELECT ¡g.geneid ¡AS ¡gid, ¡ p.pa@en@d ¡AS ¡pid,g.expr_value ¡AS ¡ev ¡FROM ¡geo ¡g, ¡pa@ents ¡p ¡ WHERE ¡p.pa@en@d ¡< ¡5 ¡ AND ¡g.pa@en@d ¡= ¡p.pa@en@d ¡) ¡gp, ¡go ¡ WHERE ¡go.geneid ¡= ¡gp.gid; 2017 HPBDC

Join ¡Plan ¡in ¡Spark Job1 Stage3 go.geneid = Shu ffl eJoin gp.geneid g.patientid = p.patendid Stage1 Stage2 Job0 go MapJoin Stage0 Filter geo patients 2017 HPBDC

Complex ¡Analy,cs n Covariance ¡ n Analyze ¡the ¡relevance ¡of ¡mul,dimensional ¡data ¡ n SVD ¡ n Eliminate ¡the ¡interference ¡data ¡in ¡raw ¡data ¡ n QR ¡Decomposi,on ¡ n Common ¡matrix ¡decomposi,on ¡used ¡in ¡linear ¡regression, ¡eigenvalue ¡ calcula,on ¡… ¡ 2017 HPBDC

Compare ¡with ¡GenBase GenBase BigDataBench-‑S Workload ¡Category Mixed Either ¡data ¡queries ¡or ¡complex ¡ analy,cs Workload ¡Number 5 ¡mixed ¡workloads 3 ¡data ¡manipula,on ¡queries ¡ 3 ¡complex ¡analy,cs ¡workloads Supported ¡Systems Tradi&onal ¡row ¡and ¡ Large-‑scale ¡data ¡analy,cs ¡ column ¡stores ¡+ ¡R/ systems, ¡including ¡ ¡ Madlib, ¡ ¡ Hadoop ¡(MapReduce ¡+ ¡Tez), ¡ Hadoop, ¡ Spark ¡SciDB 2017 HPBDC

Experiments n Configura,ons ¡ Config Node Number 10 ¡Huawei ¡RH2285 ¡servers CPU Intel ¡Xeon ¡E5645, ¡12 ¡cores Memory 32 ¡GB Disk 1TB ¡SATA 2017 HPBDC

Experiments n Configura,ons ¡ n 10 ¡servers, ¡each ¡with ¡12 ¡cores, ¡32 ¡GB ¡memory ¡and ¡1TB ¡disk Hadoop ¡ Tez Spark MapReduce Version 2.7.1 0.8.3 2.0.1 Query ¡ Hive ¡2.0.0 Hive ¡2.0.0 SparkSQL Processing Machine ¡ Mahout ¡ MLlib Learning 2017 HPBDC

A ¡Brief ¡View ¡of ¡Execu,on ¡Model MapReduce Map ¡ Map ¡ Reduce ¡ Reduce ¡ HDFS ¡ HDFS ¡ Map ¡ HDFS ¡ Map ¡ Reduce ¡ Reduce ¡ Map ¡ Map ¡ Tez Map ¡ Reduce ¡ Reduce ¡ HDFS ¡ Map ¡ HDFS ¡ Reduce ¡ Reduce ¡ Map ¡ 2017 HPBDC

A ¡Brief ¡View ¡of ¡Execu,on ¡Model n Spark iter. ¡1 ¡ iter. ¡2 ¡ . ¡ ¡. ¡ ¡. ¡ Input ¡ query ¡1 ¡ one-‑time ¡ processing ¡ query ¡2 ¡ query ¡3 ¡ Input ¡ . ¡ ¡. ¡ ¡. ¡ [ M ¡Zaharia, ¡2012 ] 2017 HPBDC

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, - PowerPoint PPT Presentation

BigDataBench-S: An Open-source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan

Htel Splendide Royal Junior Suite Junior Suite Junior Suite Suite Suite Suite Suite Suite

Presidential Suite Presidential Suite Presidential Suite Presidential Suite Presidential Suite

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

A Benchmark Suite for Formal Verification of Analog Circuits Felix Salfelder, Lars Hedrich

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

General Program Synthesis Benchmark Suite Thomas Helmuth Lee Spector Hampshire College &

Medicaid Benchmark Options Analysis Stakeholder Advisory Committee July 23, 2012 Overview

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

SCBench: A Benchmark Design Suite for SystemC Verification and Validation Bin Lin Department of

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

CMBX Indices The New US Commercial Mortgage Backed Credit Default Swap Benchmark Indices March

ABX Indices The New US Asset Backed Credit Default Swap Benchmark Indices January 2006 CDS

Benchmark and comparison of real-time solutions based on embedded Linux Peter Feuerer August 8,

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, - PowerPoint PPT Presentation

BigDataBench-S: An Open-source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan

Htel Splendide Royal Junior Suite Junior Suite Junior Suite Suite Suite Suite Suite Suite

Presidential Suite Presidential Suite Presidential Suite Presidential Suite Presidential Suite

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

A Benchmark Suite for Formal Verification of Analog Circuits Felix Salfelder, Lars Hedrich

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

General Program Synthesis Benchmark Suite Thomas Helmuth Lee Spector Hampshire College &amp;

Medicaid Benchmark Options Analysis Stakeholder Advisory Committee July 23, 2012 Overview

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

SCBench: A Benchmark Design Suite for SystemC Verification and Validation Bin Lin Department of

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

CMBX Indices The New US Commercial Mortgage Backed Credit Default Swap Benchmark Indices March

ABX Indices The New US Asset Backed Credit Default Swap Benchmark Indices January 2006 CDS

Benchmark and comparison of real-time solutions based on embedded Linux Peter Feuerer August 8,

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Spark &amp; sparklyr part II Spark &amp; sparklyr part II Programming for Statistical

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

General Program Synthesis Benchmark Suite Thomas Helmuth Lee Spector Hampshire College &

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical