A S A Stat tate-of of-the the-art art Revi view w on on Big - - PowerPoint PPT Presentation

a s a stat tate of of the the art art revi view w on on
SMART_READER_LITE
LIVE PREVIEW

A S A Stat tate-of of-the the-art art Revi view w on on Big - - PowerPoint PPT Presentation

A S A Stat tate-of of-the the-art art Revi view w on on Big Da Bi g Data ta Tec echnol hnologi ogies es Seman mantic ic technol ologie ogies for r Big D g Data: a: Volume, ume, Velocity ocity, , Vari riety ety and nd


slide-1
SLIDE 1

A S A Stat tate-of

  • f-the

the-art art Revi view w on

  • n

Bi Big Da g Data ta Tec echnol hnologi

  • gies

es

Seman mantic ic technol

  • logie
  • gies for

r Big D g Data: a: Volume, ume, Velocity

  • city,

, Vari riety ety and nd Veracit city y @ ICIST 2019

Mark rko

  • Jelić, BSc EE & CS

Junior researcher, The Mihajlo Pupin Institute Dea Pujić, BSc EE & CS Junior researcher, The Mihajlo Pupin Institute Hajira Jabeen, PhD Senior researcher, Computer Science Institute, University of Bonn

slide-2
SLIDE 2

Acknowledgment/Context

LAMBD MBDA1 (Learning, Applying, Multiplying, Big Data Analytics) is a twinning2 H2020 project The main goal of the project is to provid vide different knowl wledge dge trans ansfer fer instr trum uments ents (mentorships, brainstorming sessions, school type activities) and different types of twin inni ning ng rela lati tionshi

  • nships

ps (institution to institution, institution to network) The specifi fic focus us of the knowledge transfer process is placed on the Big g data a domain in and corresponding technol nologie

  • gies

s and service ices

1 https://project-lambda.org/ 2 https://ec.europa.eu/neighbourhood-enlargement/tenders/twinning_en

2

slide-3
SLIDE 3

What is Big data?

Big g data a is used more as a buz uzzwor zword then a precisel ecisely y defi fined ed scientific obje bject ct or phe henomena

  • mena

Generally used when referring to data a loads ds that the moder ern-day day IT inf nfras astru tructur cture cannot cope with at all or in n an n eff ffici icient ent manner nner More precisely, Big data is usually used when referring to data a sets ts that are sized in the order der of magn gnitud tude

  • f exab

xabytes tes (1018 B) or greater The introduction of US social security in 1937 1937 is considered by some as the start art of the he Big d g data a era but this term has gained most t of its popular arity y jus ust rec ecentl ently following the development of data heavy applications

3

* Illustrations by https://www.freepik.com/macrovector

slide-4
SLIDE 4

Nature of Big data

Big data is often characterized trough so so-called V’s of Big data that capture its complex nature

Volume – amoun unt of data that has to be captured, stored, processed and displayed Velocity – the rate at which the data is being generated, or analyzed Variety – differ ferences ences in data struc uctur ture (format) or diffe fferences ences in data sour urces es themselves Veracity – truthfulness (uncertain tainty) of data Validity – suita itabi bili lity ty of the selected dataset for a given application Volatility – tempo poral al validity lidity and fluency of the data Value – (useful) info formatio rmation extracted from the data Visualization – properly displayin playing and showcasing information Vulnerability – security urity and priva vacy concerns associated Variability – the changin ging meanin ning of data

4

3V’s 5V’s 7V’s 10V’s

slide-5
SLIDE 5

Big data challenges

5

Storing Processing Analytics Visualization Heterogeneity

+ +

Uncertainty of data

+ +

Scalability

+ + +

Timeliness

+ + +

Fault tolerance

+ +

Data security

+ +

Visualization

+

The core e technol nolog

  • gica

ical l challe lenges nges working with Big data that stem from

  • m

its comple lex natur ure are:

Heterogeneity – differences in structure Uncertainty – data reliability Scalability – sizing the workflow and infrastructure Timeliness – real-time requirements Fault tolerance – sensitivity to errors Data security – privacy issues, data leaks Visualization – displaying of information

slide-6
SLIDE 6

Big data Storage

No-SQL (not only SQL) databases

6

Key-value stores

Hazelcast Redis Membrane/Cocuhbase Riak Voldemort Infinispan

Wide-column

Apache Hbase Hypertable Apache Cassandra

Document oriented

MongoDB Apache CouchDB Terrastore RavenDB

Graph oriented

Neo4J Infinite-Graph InfoGrid HypergraphDB AllegroGrap BigData

Knowledge graphs

* Illustrations by https://aws.amazon.com/neptune/ and https://lod-cloud.net/versions/2014-08-30/lod-cloud.svg

slide-7
SLIDE 7

Big data analytics

Process essin ing the data and applying infer ferenc ence (i.e. trough machine ne learni rning ng) on Big data is key for proper knowledg wledge (value) extract action ion

7 linear regression logistic regression SVM naive Bayes discriminant analysis survival regression isotonic regression decision trees random forest gradient boosting tree isolation forest bagging CART C4.5 generalized linear model ensembles XGboost NN kNN drift classifier model-fitting Apache Spark + + + + + + + + + + H2O + + + + + + + + R + + + + + + + + + + + MOA + + + + Scikit - Learn + + + + + + + + + + + + + + + Bigml + + + + + + + Weka + + + + + +

Systematization of regression and classification learning algorithms in Big data tools

slide-8
SLIDE 8

Big data analytics

If the data is not already eady labele eled i.e. separated into appropriate classes, clust stering ering algor gorithms ithms need to be applied first in order to determine adequate class limits

8

Systematization of clustering learning algorithms in Big data tools

K-means G-means Gaussian mixture PIC LDA aggregator PAM CLARA Fuzzy clustering Model-based Hierarhical Dencity based Afinity propagation Apache Spark

+ + + + +

H2O

+ +

R

+ + + + + + +

Giraph

+ +

BigML

+ + +

slide-9
SLIDE 9

Big data visualization

JavaScript libraries (open source)

Chart.js Leaflet Chartist.js n3-charts Sigma JS Polymaps Processing.js Dyagraph

Timelines

Timeline JS

9

Chart tools

Fusion Charts Chart.js Chartist.js n3-charts Canvas

Map tools

Leaflet Polymaps

Images

Processing.js

Graphs and networks

Sigma JS

Multi-purpose

D3.js .js Ember-charts Googl gle charts

Non-web

Cuttlefish Cytoscape Gephi Graphwiz Graph-tool

Cross-platform

NodeXL Pajek SocNetV Sentinel Visualizer Statnet Tulip Visone

Commertial (desktop)

Tableau Infogram

slide-10
SLIDE 10

Questions?

Thank you for your attention!

Look for the full paper “A State te-of

  • f-the

the-art art Revie iew on Big Data a Technol nologie gies” in the ICIST 2019 proceedings after April 15th!

10