Ibis Data Serialization in Apache Spark
By Dadepo Aderemi and Mathijs Visser
Supervisors:
- dr. Jason Maassen (eScience Center)
Adam Belloum (UvA)
Ibis Data Serialization in Apache Spark By Dadepo Aderemi and - - PowerPoint PPT Presentation
Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience Center) Adam Belloum (UvA) We live in a big data world - Increase in data generation: IoT, mobile devices, social media,
Supervisors:
Adam Belloum (UvA)
mobile devices, social media, logs from large scale software etc.
software tools.
2
Image source: https://towardsdatascience.com/what-is-big-data-lets-answer-this-question-933b94709caf
business but in Science
Modeling, Medical and Pharmaceutical research etc.
2008 of Nature magazine talked about the challenges of dealing with big data.
cannot be managed speedily using traditional approaches.
3
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
4
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
5
6
Kubernetes, standalone, or in the cloud.
7
8
“...the mechanism for converting (graphs of) data (Java
(e.g., a stream of bytes, or XML)...”
serialization techniques? Sub questions:
Ibis serialization?
9
10
computing.1
[1] https://www.cs.vu.nl/ibis/
11
Direct Memory Access (RDMA)
using Kryo serialization.
12
[1] “High-performance design of apache spark with RDMA and its benefitson various workloads”. In:2016 IEEE International Conference on Big Data (BigData). IEEE. 2016, pp. 253–262 [2] Accelerating spark with rdma for big data processing: Early experiences”. In:2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.IEEE. 2014, pp. 9–16
13
14
15
16
17
Source: https://spark.apache.org/docs/latest/cluster-overview.html
18
RDD (Resilient Distributed Dataset) DataFrames Datasets
19
Source: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
20
21
22
[1] https://github.com/EsotericSoftware/kryo
23
24
25
26
27
28
HDFS Worker Node 1 Yarn Worker Node 2 Spark
29
30
31
32
33
34
35
36
37
38
serialization techniques?
extensively use serialization
benchmarks
39
40
41