Sunrise or Sunset: Exploring the Design Space of Big Data Software - - PowerPoint PPT Presentation

▶

Feb 01, 2023 584 likes •726 views

Sunrise or Sunset: Exploring the Design Space of Big Data Software Stacks HPBDC 2017 3rd IEEE International Workshop on High-Performance Big Data Computing May 29, 2017 gcf@indiana.edu http://www.dsc.soic.indiana.edu/,

SLIDE 1

HPBDC 2017 3rd IEEE International Workshop on High-Performance Big Data Computing May 29, 2017 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington

Sunrise or Sunset: Exploring the Design Space of Big Data Software Stacks

1

SLIDE 2

2 Spidal.org

The panel will discuss the following three issues:
Are big data software stacks mature or not?

– If yes, what is the new technology challenge? – If not, what are the main driving forces for new-generation big data software stacks? – What chances are provided for the academia communities in exploring the design spaces of big data software stacks?

Panel Tasking

2

SLIDE 3

3 Spidal.org

The solutions are numerous and powerful

– One can get good (and bad) performance in an understandable reproducible fashion – Surely need better documentation and packaging

The problems (and users) are definitely not mature (i.e. not

understood, and key issues not agreed) – Many academic fields are just starting to use big data and some are still restricted to small data – e.g. Deep Learning is not understood in many cases outside the well publicized commercial cases (voice, translation, images)

In many areas, applications are pleasingly parallel or involve

MapReduce; performance issues are different from HPC – Common is lots of independent Python or R jobs

Are big data software stacks mature or not?

SLIDE 4

4 Spidal.org

Yes if you value ease of programming over performance.

– This could be the case for most companies where they can find people who can program in Spark/Hadoop much more easily than people who can program in MPI. – Most of the complications including data, communications are abstracted away to hide the parallelism so that average programmer can use Spark/Flink easily and doesn't need to manage state, deal with file systems etc. – RDD data support very helpful

For large data problems involving heterogeneous data sources such as

HDFS with unstructured data, databases such as HBase etc

Yes if one needs fault tolerance for our programs.

– Our 13-node Moe “big data” (Hadoop twitter analysis) cluster at IU faces such problems around once per month. One can always restart the job, but automatic fault tolerance is convenient.

Why use Spark Hadoop Flink rather than HPC?

SLIDE 5

5 Spidal.org

The performance of Spark, Flink, Hadoop on classic parallel

data analytics is poor/dreadful whereas HPC (MPI) is good

One way to understand this is to note most Apache systems

deliberately support a dataflow programming model

e.g. for Reduce, Apache will launch a bunch of tasks and

eventually bring results back – MPI runs a clever AllReduce interleaved “in-place” tree

Maybe can preserve Spark, Flink programming model but

change implementation “under the hood” where optimization important.

Note explicit dataflow is efficient and preferred at coarse

scale as used in workflow systems – Need to change implementations for different problems

Why use HPC and not Spark, Flink, Hadoop?

SLIDE 6

6 Spidal.org

HPC Runtime versus ABDS distributed Computing Model on Data Analytics

Hadoop writes to disk and is slowest; Spark and Flink spawn many processes and do not support AllReduce directly; MPI does in-place combined reduce/broadcast and is fastest

SLIDE 7

7 Spidal.org

MDS execution time on 16 nodes with 20 processes in each node with varying number

f points

MDS execution time with 32000 points on varying number of nodes. Each node runs 20 parallel tasks

Multidimensional Scaling MDS Results with Flink, Spark and MPI

MDS performed poorly on Flink due to its lack of support for nested iterations. In Flink and Spark the algorithm doesn’t scale with the number of nodes.

SLIDE 8

8 Spidal.org

Parallelism of 2 and using 8 Nodes

Intel Haswell Cluster with 2.4GHz Processors and 56Gbps Infiniband and 1Gbps Ethernet

Parallelism of 2 and using 4 Nodes

Intel KNL Cluster with 1.4GHz Processors and 100Gbps Omni- Path and 1Gbps Ethernet

Twitter Heron Streaming Software using HPC Hardware

Small messages Large messages

SLIDE 9

9 Spidal.org

Knights Landing KNL Data Analytics: Harp, Spark, NOMAD Single Node and Cluster performance: 1.4GHz 68 core nodes

8 16 32 64 128 256 1,000 2,000 3,000 4,000 163 84 44 27 42 36 4,382 2,027 1,259 555 396 307 Number of Threads Time Per Iteration (s) Harp-DAAL-Kmeans Spark-Kmeans

8 16 32 64 128 256 50 100 150 200 250 300 105 57 32 21 19 20 291 149 80 55 59 67 Number of Threads Time Per Iteration (s) Harp-DAAL-SGD NOMAD-SGD 8 16 32 64 128 256 500 1,000 1,500 2,000 2,500 3,000 3,500 89 44 45 42 44 43 3,341 1,766 1,635 1,389 1,380 1,371 Number of Threads Time Per Iteration (s) Harp-DAAL-ALS Spark-ALS

10 20 30 500 1,000 1,500 2,000 2,500 168 85 67 2,635 1,473 987 Number of Nodes Time Per Iteration (s) Harp-DAAL-Kmeans Spark-Kmeans 10 15 20 25 30

10 19.8 25.1 10 17.9 26.7

Speedup Harp-DAAL-Kmeans Spark-Kmeans 10 20 30 20 40 60 80 36 14 10 75 38 27 Number of Nodes Time Per Iteration (s) Harp-DAAL-SGD NOMAD-SGD 10 15 20 25 30 35

10 25.7 36 10 19.9 27.9

Speedup Harp-DAAL-SGD NOMAD-SGD 2 4 6 200 400 600 800 1,000 1,200 1,400 33.2 38.3 51.2 1,338 1,302 1,409 Number of Nodes Time Per Iteration (s) Harp-DAAL-ALS Spark-ALS 1 2 3 4 5 6

1.4 1.2 0.9 1.8 1.9 1.7

Speedup Harp-DAAL-ALS Spark-ALS

Strong Scaling Single Node Core Parallelism Scaling Strong Scaling Multi Node Parallelism Scaling - Omnipath Interconnect

Kmeans SGD ALS

SLIDE 10

10 Spidal.org

Google likes to show a timeline
2002 Google File System GFS ~HDFS
2004 MapReduce Apache Hadoop
2006 Big Table Apache Hbase
2008 Dremel Apache Drill
2009 Pregel Apache Giraph
2010 FlumeJava Apache Crunch
2010 Colossus better GFS
2012 Spanner horizontally scalable NewSQL database ~CockroachDB
2013 F1 horizontally scalable SQL database
2013 MillWheel ~Apache Storm, Twitter Heron
2015 Cloud Dataflow Apache Beam with Spark or Flink (dataflow) engine
Functionalities not identified: Security, Data Transfer, Scheduling,

serverless computing

Components of Big Data Stack

SLIDE 11

11 Spidal.org

Integrate systems that offer full capabilities

– Scheduling – Storage – “Database” – Programming Model (dataflow and/or “in-place” control-flow) and corresponding runtime – Analytics – Workflow – Function as a Service and Event-based Programming

For both Batch and Streaming
Distributed and Centralized (Grid versus Cluster)
Pleasingly parallel (Local machine learning) and Global machine

learning (large scale parallel codes)

What is the new technology challenge?

SLIDE 12

12 Spidal.org

Applications ought to drive new-generation big data software stacks but

(at many universities) academic applications lag commercial use in big data area and needs are quite modest – This will change and we can expect big data software stacks to become more important

Note University compute systems historically offer HPC and not Big Data

Expertise. – We could anticipate users moving to public clouds (away from university systems) but – Users will still want support

Need a Requirements Analysis that builds in application changes that

might occur as users get more sophisticated

Need to help (train) users to explore big data opportunities

What are the main driving forces for new- generation big data software stacks?

SLIDE 13

13 Spidal.org

We need more sophisticated applications to probe some of the

most interesting areas

But most near term opportunities are in pleasing parallel (often

streaming data) areas – Popular technology like DASK http://dask.pydata.org for parallel NumPy is pretty inefficient

Clarify when to use dataflow (and Grid technology) and
When to use HPC parallel computing (MPI)
Likely to require careful consideration of grain size and data

distribution

Certain to involve multiple mechanisms (hidden from user) if want

highest performance (combine HPC and Apache Big Data Stack)

What chances are provided for the academia communities in exploring the design spaces

f big data software stacks?