Trends and Challenges in Big Data Ion Stoica November 14, 2016 - PowerPoint PPT Presentation

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS’16 PDSW-DISCS’16 UC BERKELEY

Before starting… Disclaimer: I know little about HPC and storage More collaboration than ever between HPC, Distributes Systems, Big Data / Machine Learning communities Hope this talk will help a bit in bringing us even closer 2

Big Data Research at Berkeley A lgorithms AMPLab (Jan 2011- Dec 2016) • Mission: “ Make sense of big data ” • 8 faculty, 60+ students M achines P eople 3

Big Data Research at Berkeley A lgorithms AMPLab (Jan 2011- Dec 2016) • Mission: “ Make sense of big data ” • 8 faculty, 60+ students M achines P eople Goal: Next generation of open source data analytics stack for industry & academia Berkeley Data Analytics Stack (BDAS)

BDAS Stack Streaming Sample SparkR GraphX Processing MLBase BlinkDB Spark Clean Velox SparkSQL MLlib Spark Core Mgmnt Res. Mesos Hadoop Yarn Mesos Succinct Storage HDFS, S3, Ceph, … Tachyon 3 rd party BDAS Stack

Several Successful Projects Apache Spark : most popular big data execution engine • 1000+ contributors • 1000+ orgs; o ff ered by all major clouds and distributors Apache Mesos : cluster resource manager • Manages 10,000+ node clusters • Used by 100+ organizations (e.g., Twitter, Verizon, GE) Alluxio (a.k.a Tachyon): in-memory distributed store • Used by 100+ organizations (e.g., IBM, Alibaba)

This Talk Reflect on how • application trends, i.e., user needs & requirements • hardware trends have impacted the design of our systems How we can use these lessons to design new systems

2009 2009

2009: State-of-the-art in Big Data Apache Hadoop • Large scale, flexible data processing engine • Batch computation (e.g., 10s minutes to hours) • Open Source Getting rapid industry traction: • High profile users: Facebook, Twitter, Yahoo!, … • Distributions: Cloudera, Hortonworks • Many companies still in austerity mode

2009: Application Trends Iterative computations, e.g., Machine Learning • More and more people aiming to get insights from data Interactive computations, e.g., ad-hoc analytics • SQL engines like Hive and Pig drove this trend 10

2009: Application Trends Despite huge amounts of data, many working sets in big data clusters fit in memory 11

2009: Application Trends Facebook Microso fu Yahoo! Memory (GB) (% jobs) (% jobs) (% jobs) 8 69 38 66 16 74 51 81 32 96 82 97.5 64 97 98 99.5 128 98.8 99.4 99.8 192 99.5 100 100 256 99.6 100 100 *G Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, ”Disk-Locality in Datacenter Computing Considered Irrelevant”, HotOS 2011 12

2009: Application Trends Facebook Microso fu Yahoo! Memory (GB) (% jobs) (% jobs) (% jobs) 8 69 38 66 16 74 51 81 32 96 82 97.5 64 97 98 99.5 128 98.8 99.4 99.8 192 99.5 100 100 256 99.6 100 100 *G Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, ”Disk-Locality in Datacenter Computing Considered Irrelevant”, HotOS 2011 13

2009: Hardware Trends Memory still riding the Moore’s law 1.00E+09 1.00E+08 1.00E+07 1.00E+06 Cost ($/GB) 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 14 http://www.jcmit.com/memoryprice.htm

2009: Hardware Trends Memory still riding the Moore’s law I/O throughput and latency stagnant • HDD dominating data clusters as storage of choice • Many deployments as low as 20MB/sec per drive 15

Applications Requirements: • ad-hoc queries • ML algos Enabler: In-memory • working sets processing fit in memory Multi-stage BSP model Memory growing with Moore’s Law I/O performance stagnant (HDDs) Hardware 2009

2009: Our Solution: Apache Spark In-memory processing • Great for ad-hoc queries Generalizes MapReduce to multi-stage computations • Implement BSP model Share data between stages via memory • Great for iterative computations, e.g., ML algorithms

2009: Technical Solutions Low-overhead resilience mechanisms à Resilient Distributed Datasets (RDDs) E ff iciently support for ML algos à Powerful and flexible APIs • map/reduce just two of over 80+ APIs

2012 2012

2012: Application Trends People started to assemble e2e data analytics pipelines Advanced Data Ad-hoc Raw ETL Analytics Products exploration Data Need to stitch together a hodgepodge of systems • Di ff icult to manage, learn, and use

Applications Requirements: Requirements: • ad-hoc • build e2e queries big data • ML algos pipelines Enabler: In-memory Unified • working sets processing platform: fit in memory • SQL Multi-stage • ML BSP model • Graphs Memory • Streaming growing with Moore’s Law I/O performance stagnant (HDDs) Hardware 2009 2012

2012: Our Solution: Unified Platform Support a variety of workloads Support a variety of input sources Provide a variety of language bindings Spark SQL Spark Streaming MLlib GraphX interactive real-time machine learning graph Spark Core Python, Java, Scala, R a …

201 2014

2014: Application Trends New users, new requirements Spark early adopters Data Engineers Data Scientists Users Statisticians R users Understands PyData … MapReduce & functional APIs

2014: Hardware Trends Memory capacity still growing fast 1.00E+09 1.00E+08 1.00E+07 1.00E+06 Cost ($/GB) 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 http://www.jcmit.com/memoryprice.htm

2014: Hardware Trends Memory capacity still growing fast Many clusters and datacenters transitioning to SSDs • Orders of magnitude improvements in I/O and latency • DigitalOcean: SSD only instances since 2013 CPU performance growth slowing down

Applications Requirements: Requirements: Requirements: • ad-hoc • build e2e • new users: queries big data data • ML algos pipelines scientists & analysts Enabler: In-memory Unified • Improved API: DataFrame • working sets processing platform: performance Storage rep.: fit in memory • SQL • Binary format Multi-stage • ML • Columnar BSP model • Graphs Memory Memory still • Streaming Code generation growing with growing fast Moore’s Law I/O perf. I/O performance improving stagnant (HDDs) Hardware CPU stagnant 2009 2014 2012

pdata.map(lambda x: (x.dept, [x.age, 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect() data.groupBy(“dept”).avg(“age”)

DataFrame API DataFrame logically equivalent to a relational table Operators mostly relational with additional ones for statistical analysis, e.g., quantile, std, skew Popularized by R and Python/pandas, languages of choice for Data Scientists

DataFrames in Spark Make DataFrame declarative, unify DataFrame and SQL Python Java/Scala R DataFrame and SQL share same DF DF DF • query optimizer, and • execution engine Logical Plan Tightly integrated with rest of Spark • ML library takes DataFrames as input & output Every optimizations automatically applies to • Easily convert RDDs ↔ DataFrames Execution SQL, and Scala, Python and R DataFrames

One Query Plan, One Execution Engine DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 10 Time for aggregation benchmark (s) 31

One Query Plan, One Execution Engine DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala 0 2 4 6 8 10 Time for aggregation benchmark (s) 32

What else does DataFrame enable? Typical DB optimizations across operators: • Join reordering, pushdown, etc Compact binary representation: • Columnar, compressed format for caching Whole-stage code generation: • Remove expensive iterator calls • Fuse across multiple operators

TPC-DS Spark 2.0 vs 1.6 – Lower is Better 600 Time (1.6) 500 Time (2.0) Runtime (seconds) 400 300 200 100 0

201 2016 (What (What’s Ne s Next?) xt?)

What’s Next? Application trends Hardware trends Challenges and techniques 36

Application Trends Data only as valuable as the decisions and actions it enables What does it mean? • Faster decisions better than slower decisions • Decisions on fresh data better than on stale data • Decisions on personal data better than on aggregate data 37

Application Trends Real-time decisions decide in ms the current state of the environment on live data privacy, confidentiality, integrity with strong security 38

Trends and Challenges in Big Data Ion Stoica November 14, 2016 - PowerPoint PPT Presentation

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS16 PDSW-DISCS16 UC BERKELEY Before starting Disclaimer: I know little about HPC and storage More collaboration than ever between HPC, Distributes Systems, Big

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

E-TRENDS ARABNET 2014 IYAD KAMAL IY AD@ ARAMEX.COM IY AD KAMAL @ IY ADKAM E-TRENDS

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

A Publisher as Advocate for Change Curriculum Development from the Vantage Point of Publisher

Retirement Villages Stakeholder Forum 26 November 2019

ensemble Learning Instructor: Prof. Shuai Huang Industrial and Systems Engineering University of

Assembly (part 1) arrays include elements sizeof includes elements HW generally due next week

Exam 1 Review Phys 222 Supplemental Instruction Do you know all the following? Electric

Introduction Katerina Fragkiadaki Course logistics This is a seminar course. There will be

pFlogger: The Parallel Fortran Logging Utility Tom Clune 1 and Carlos Cruz 1,2 1 NASA Goddard

Dynamic Data Quality for Static Blockchains Alan G. Labouseur, Ph.D. Alan.Labouseur@Marist.edu

Trends and Challenges in Big Data Ion Stoica November 14, 2016 - PowerPoint PPT Presentation

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS16 PDSW-DISCS16 UC BERKELEY Before starting Disclaimer: I know little about HPC and storage More collaboration than ever between HPC, Distributes Systems, Big

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

E-TRENDS ARABNET 2014 IYAD KAMAL IY AD@ ARAMEX.COM IY AD KAMAL @ IY ADKAM E-TRENDS

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

A Publisher as Advocate for Change Curriculum Development from the Vantage Point of Publisher

Retirement Villages Stakeholder Forum 26 November 2019

ensemble Learning Instructor: Prof. Shuai Huang Industrial and Systems Engineering University of

Assembly (part 1) arrays include elements sizeof includes elements HW generally due next week

Exam 1 Review Phys 222 Supplemental Instruction Do you know all the following? Electric

Introduction Katerina Fragkiadaki Course logistics This is a seminar course. There will be

pFlogger: The Parallel Fortran Logging Utility Tom Clune 1 and Carlos Cruz 1,2 1 NASA Goddard

Dynamic Data Quality for Static Blockchains Alan G. Labouseur, Ph.D. Alan.Labouseur@Marist.edu

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is