Big Data overview, issues, challenges and opportunities C. Onime - - PowerPoint PPT Presentation

big data
SMART_READER_LITE
LIVE PREVIEW

Big Data overview, issues, challenges and opportunities C. Onime - - PowerPoint PPT Presentation

Big Data overview, issues, challenges and opportunities C. Onime (onime@ictp.it) 1 Outline Interactive session Introduction to Big-Data Issues/challenges Taxonomy classifications Conclusion Opportunities and future 2


slide-1
SLIDE 1

Big Data

  • verview, issues, challenges and opportunities
  • C. Onime

(onime@ictp.it)

1

slide-2
SLIDE 2

Outline

  • Interactive session

– Introduction to Big-Data – Issues/challenges – Taxonomy classifications

  • Conclusion

– Opportunities and future

2

slide-3
SLIDE 3

Pre-exercise

  • Before providing a formal definition, let’s try

answer the questions:

– What exactly is Big-Data? – Can you identify it?

3

slide-4
SLIDE 4

Definition(s)

  • The term Big-Data by definition is used for data that is

“massive” in one of the following areas:

– Volume: quantity – Velocity: generated at high speed – Variety: wide spread from diverse sources and types. – Variability: constantly changing meaning – Veracity: making data accurate (removing bad data) – Visualization: presenting and conveying meaning – Value: applying findings and taking action

4

slide-5
SLIDE 5

Big-Data examples

  • Astronomical Image data from a telescope

exceeds 1TB/day

  • Environmetal monitoring
  • Government: Census, National Health

Records/Systems, etc.

  • Industry: Amazon, Google, Ebay...

5

slide-6
SLIDE 6

World wide storage

6

slide-7
SLIDE 7

Another forecast

  • 0.076 ZB = 76 EB
  • 76 EB = 76M PB
  • Current estimate is that

82% of global IP traffic will be video by 2020

Clement Onime - onime@ictp.it 7

slide-8
SLIDE 8

Preamble

  • So what is driving Big Data?

– Mainly industry related paradigms & applications

  • Data mining, Business Intelligence, Knowledge

Management and now Big Data Management

Clement Onime- onime@ictp.it 8

slide-9
SLIDE 9

Data Mining

  • A process of analyzing data from different

perspectives and summarizing it into useful information, [...] which allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified.

Clement Onime- onime@ictp.it 9

slide-10
SLIDE 10

Business Intelligence

  • A process of finding, gathering, aggregating

and analyzing information for decision-

  • making. It makes use of a set of technologies

that allow the acquisition and analysis of data to improve company decision making and work flows.

Clement Onime- onime@ictp.it 10

slide-11
SLIDE 11

Knowledge Management

  • A business process that formalizes the management and

use of an enterprise’s intellectual assets.“ KM promotes a collaborative and integrative approach to the creation, capture, organization, access and use of information assets, including the tacit, un-captured knowledge of people.

  • A systematic process of finding, selecting, organizing,

distilling and presenting information in a way that improves an employee’s comprehension in a specific area of interest which supports an organization to gain insight and understanding from its own experience.

Clement Onime- onime@ictp.it 11

slide-12
SLIDE 12

Big Data Management

Clement Onime- onime@ictp.it 12

slide-13
SLIDE 13

Other drivers

  • Scientific Research

– High Performance Computing (LHC, SKA, Genomics)

  • Improvements in hardware technology

– Heading towards Nano-circuits, clocking resolutions, etc

  • Improvements in computing platforms

– Networks: always connected devices, capacity; Clouds: anytime, anywhere on-demand metered access to resources

  • Every user is a now a provider/consumer

– Social networking

Clement Onime - onime@ictp.it 13

slide-14
SLIDE 14

Issues and challenges

  • Perspectives – backgrounds, use cases
  • Taxonomies, ontologies, schemas, workflow
  • Bits – raw data formats and storage methods
  • Cycles – algorithms and analysis
  • Infrastructure (screws) to support Big Data

– From presentation by Michael Cooper & Peter Mell of NIST

Clement Onime- onime@ictp.it 14

slide-15
SLIDE 15

Perspectives

Clement Onime- onime@ictp.it 15

slide-16
SLIDE 16

Six dimensional Taxonomy

Big Data

Data Mapping

Compute infrastructure Storage Infrastructure Analytics Visualisation Security & Privacy Clement Onime - onime@ictp.it 16

slide-17
SLIDE 17

Clement Onime - onime@ictp.it 17

BATCH NEAR-REAL-TIME REAL-TIME

STRUCTURED SEMI STRUCTURED UNSTRUCTURED LARGE SCALE SCIENCE (HEP, Genomics) VISUAL MEDIA (Video scene detection, image understanding) FINANCIAL (High speed training) RETAIL (Sentiment & behaviour analysis) SOCIAL NETWORKING (Trend analysis, query processing) SENSOR DATA (ID, long term trends , weather) NETWORK SECURITY (ID, malwares/virus attacks)

Data Mapping examples

slide-18
SLIDE 18

Compute infrastructure

Compute Infrastructure Batch Map Reduce Hadoop S4 Bulk synchronous parallel Hama Giraph Pregel Streaming Storm Spark

Clement Onime - onime@ictp.it 18

slide-19
SLIDE 19

Overview of Hadoop MapReduce

Clement Onime- onime@ictp.it 19

slide-20
SLIDE 20

Hadoop 2.0 Ecosystem

Clement Onime- onime@ictp.it 20

slide-21
SLIDE 21

Clement- Onime onime@ictp.it 21

Master node (Nimbus) Worker process Worker process Supervisor Worker node Zookeeper framework Task Task Executor Task Task Executor Worker process Supervisor Worker node

Storm Cluster

slide-22
SLIDE 22

Storm Basics

  • Tuple

– Key-value pairs

  • Streams

– Sequence of tuples pairs

  • Spout

– Source of streams

  • Bolt

– Processing element – (filers, join, transform, e.t.c)

Clement- Onime onime@ictp.it 22

slide-23
SLIDE 23

Storm topology

  • Graph of Computation

– Network of spouts and bolt – Parallel & cyclic execution

  • Groupings

– Shuffle, all, Global, fields

  • Example:

– Twitter analytics: spout, bolts: parse, count, ranks, report

Clement- Onime onime@ictp.it 23

Spout Spout Bolt Bolt Bolt Bolt

slide-24
SLIDE 24

Storage infrastructure

Storage Infrastructure Relational (SQL) Examples (Oracle, MySQL, PostgreSQL, etc) NoSQL Document oriented Examples (MongoDB, CouchDB, CouchBase) Key-value stores In memory (Memcached, Redis, Aerospike) Dynamo inspired (Cassandra, Riak, Voldemart) Big-Table Examples (Hbase, Cassandra) Graph oriented Examples (Giraph, Neo4j, OrientDB) NewSQL In memory Examples (Hstore, VoltDB)

Clement Onime - onime@ictp.it 24

slide-25
SLIDE 25

Clement Onime - onime@ictp.it 25

Infrastructure mapping

BATCH NEAR-REAL-TIME REAL-TIME

STRUCTURED SEMI STRUCTURED UNSTRUCTURED SQL (MySQL, PostgreSQL, SQL-lite) NoSQL (MongoDB, CrouchDB, Cassandra) Neo4j Storm, Kinesis Shark, Spark VoltDB Titan Redis Aerospike

slide-26
SLIDE 26

Storage complexity/size

Clement Onime- onime@ictp.it 26

slide-27
SLIDE 27

Analytics

Machine learning algorithm Supervised Regression (Polynomials, MARS) Classification (Decision trees, Naïve Bayes, Support vector machines) Un-supervised Clustering (K-means, Gaussian mixtures) Reduction (Principle component analysis) Semi-supervised Active Co-training Re-enforcement Markov decision process Q-Learning

Clement Onime - onime@ictp.it 27

slide-28
SLIDE 28

Comparison of Data analysis paradigms

Statistics Machine learning Model Network, Graphs Data point Examples/instances Response Label Parameters Weights Covariate Feature Fitting/Estimation Learning Test set performance Generalization Regression/Classification Supervised Learning Density estimation, Clustering Unsupervised Learning

Clement Onime- onime@ictp.it 28

slide-29
SLIDE 29

Visualisation

Visualisation Spatial layout Charts / plots Line/ bar charts Scatter plots Trees / graphs Tree maps Arc diagrams Abstract or summary Binning Data cubes, histograms Clustering Hierarchical aggregation Interactive or real-time Deep zoom MS Pivot viewer, Tableau Mixed reality AR systems / tools

Clement Onime - onime@ictp.it 29

slide-30
SLIDE 30

Clement Onime - onime@ictp.it 30

𝐹𝑁𝑆 = න(𝑆 + 𝑊) where

Mixed Reality Environments

slide-31
SLIDE 31

Virtual Reality (VR) CAVE

  • Computer generated virtual

environment

  • Creates a completely virtual

environment that is without real objects

  • Portable

– Headsets, wearable devices – Custom and typically not cost effective

Augmented Reality(AR)

  • Real-time integration of

computer generated information into a 3D world.

  • Blends into real world and

supports real objects

  • Mobile

– Commodity devices: smart- phones and tablets – Cost effective

Clement Onime - onime@ictp.it 31

VR and AR

slide-32
SLIDE 32

VR Environments AR Environments

Clement Onime - onime@ictp.it 32

Some Examples

slide-33
SLIDE 33

AR Cubicle

AR immersive cubicle

Clement Onime - onime@ictp.it 33

User 180° horizontal by 3 markers on walls and 90° vertical by marker on floor

slide-34
SLIDE 34

Security and privacy

Security and privacy Infrastructure Secure computations Best practices Data privacy Privacy preservation Cryptography Access control Data management Secure storage Transaction logs and Audits Provenance Integrity and reactive security End-point security Real-time monitoring

Clement Onime - onime@ictp.it 34

slide-35
SLIDE 35

Public Key Cryptography

  • Asymmetric cryptography

– A pair of keys: one public and the other private – Useful for authentication and encryption – Depends mainly on the impracticability of computing the equivalent private key from its public component. – Public key may be freely exchanged without secure channels such as public key servers, etc.. – Computationally intensive mathematical algorithms

Clement Onime - onime@ictp.it 35

slide-36
SLIDE 36

Clement Onime - onime@ictp.it 36

Digital Certificates

  • Similar to travel passport

– Provides forgery resistant identifying information

  • Name of holder
  • Serial number
  • Expiration date
  • Copy of holder’s public key (used for encryption)
  • Digital signature of issuing authority (CA)
slide-37
SLIDE 37

Clement Onime - onime@ictp.it 37

Trusted certificates Server Client Client hello reply + certificate Trusted certificates Key exchange + certificate Client OK Server OK Encrypted messages

SSL Transport

slide-38
SLIDE 38

Data colouring

Clement Onime - onime@ictp.it 38

slide-39
SLIDE 39

Conclusion

  • The potentiality of Big Data is now all around us in our

everyday lives. Every device will be connected and constantly generating data.

  • Good mapping of big-data is fundamental to

understanding/selecting infrastructure (compute & storage) , analytics, visualization and protection (security and privacy).

  • New frontiers such as “Data Science” is bringing many
  • f the ideas/techniques from Big-Data Analytics to

almost any field or discipline.

Clement Onime- onime@ictp.it 39

slide-40
SLIDE 40

2017 opportunities @ ICTP

  • Workshop on Open Source Solutions for the Internet of

Things – June 28 – July 7th, 2017

  • The CODATA RDA Advanced School of Research Data

Science for Extreme sources of Data, Bioinformatics and IoT/Big-Data Analytics – July 3rd -28th, 2017

  • Two other CODATA/RDA schools on Data Science

– South Africa and Brazil , maybe a HPC school in Mexico

  • Masters degree in HPC, Trieste, Italy
  • Graduate studies @ East African Institute for Fundamental

Research (EAIFR), Kigali, Rwanda

Clement Onime- onime@ictp.it 40

slide-41
SLIDE 41

References

  • Michael Cooper & Peter Mell, “Tackling Big Data”, NIST

Information Technology Laboratory, 2010

  • Big Data Working Group, “Big Data Taxonomy”, Cloud

Security Alliance, 2014

  • M. Bornschlegl et al, “IVIS4BigData: A Reference Model

for Advanced Visual Interfaces Supporting Big Data Analysis in Virtual Research Environments”, 2016

  • S. Rajendran, Apache Storm: A scalable distributed &

fault tolerant real time computation system, 2015

Clement Onime- onime@ictp.it 41

slide-42
SLIDE 42

That’s all folks!!

questions

Clement Onime- onime@ictp.it 42