Big Data Learning in Prac.ce Isaac Triguero School of Computer - - PowerPoint PPT Presentation

big data learning in prac ce
SMART_READER_LITE
LIVE PREVIEW

Big Data Learning in Prac.ce Isaac Triguero School of Computer - - PowerPoint PPT Presentation

Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016 Outline q What is Big data? q


slide-1
SLIDE 1

Big Data Learning in Prac.ce

12th September 2016

Isaac Triguero

School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html

slide-2
SLIDE 2

Outline

q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions

2

slide-3
SLIDE 3

There is no a standard defini.on!

“Big Data” involves data whose volume, diversity and complexity requires new techniques, algorithms and analyses to extract valuable knowledge (hidden) .

What is Big Data?

Data Intensive applica.ons

3

slide-4
SLIDE 4

What is Big Data? The 5V’s definiKon

4

slide-5
SLIDE 5

Big data has many faces

5

slide-6
SLIDE 6

Outline

q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions

6

slide-7
SLIDE 7
  • Problem statement: scalability to big data sets.
  • Example:

– Explore 100 TB by 1 node @ 50 MB/sec = 23 days – ExploraKon with a cluster of 1000 nodes = 33 minutes

  • Solu.onè Divide-And-Conquer

How to deal with data intensive applicaKons?

What happens if we have to manage 1000 or 10000 TB?

7

slide-8
SLIDE 8

MapReduce

  • Parallel Programming model
  • Divide & conquer strategy

§ divide: parKKon dataset into smaller, independent chunks to be processed in parallel (map) § conquer: combine, merge or otherwise aggregate the results from the previous step (reduce)

  • Based on simplicity and transparency to the

programmers, and assumes data locality.

  • Becomes popular thanks to the open-source project

Hadoop! (Used by Google, Facebook, Amazon, …)

8

slide-9
SLIDE 9

TradiKonal HPC way of doing things

worker nodes

(lots of them)

central storage CommunicaKon network (Infiniband) Network for I/O OS OS OS OS OS Limited I/O c c c c c input data (relaKvely small) Lots of computa.ons Lots of communica.on

Source: Jan Fos.er. Introduc.on to MapReduce and its Applica.on to Post-Sequencing Analysis

slide-10
SLIDE 10

Data-intensive jobs

Low compute intensity

Fast communicaKon network (Infiniband) Network for I/O OS OS OS OS OS a

Limited communicaKon

central storage

input data (lots of it) b c d e f g h i j a b c d e f g h i j Lots of I/O doesn’t scale

slide-11
SLIDE 11

Data-intensive jobs

Low compute intensity

CommunicaKon network

Limited communicaKon

input data (lots of it) e j b c g j a c h i b e g i d f f h a d Solu.on: store data on local disks of the nodes that perform computaKons on that data (“data locality”)

slide-12
SLIDE 12

Hadoop

h<p://hadoop.apache.org/

  • Hadoop is:

– An open-source framework wri<en in Java – Distributed storage of very large data sets (Big Data) – Distributed processing of very large data sets

  • This framework consists of a number of modules

– Hadoop Common – Hadoop Distributed File System (HDFS) – Hadoop YARN – resource manager – Hadoop MapReduce – programming model

12

slide-13
SLIDE 13
  • Automa.c paralleliza.on:

– Depending on the size of the input data è there will be mulKple MAP tasks! – Depending on the number of Keys <k, value> è there will be mulKple REDUCE tasks!

  • Scalability:

– It may work over every data center or cluster of computers.

  • Transparent for the programmer

– Fault-tolerant mechanism. – AutomaKc communicaKons among computers

Hadoop MapReduce: Main CharacterisKcs

13

slide-14
SLIDE 14

Data Sharing in Hadoop MapReduce

  • iter. 1
  • iter. 2

. . . Input

HDFS read HDFS write HDFS read HDFS write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . .

HDFS read

Slow due to replicaKon, serializaKon, and disk IO

14

slide-15
SLIDE 15

Paradigms that do not fit with Hadoop MapReduce

  • Directed Acyclic Graph (DAG) model:

– The DAG defines the dataflow of the applicaKon, and the verKces of the graph defines the operaKons on the data.

  • Graph model:

– More complex graph models that be<er represent the dataflow of the applicaKon. – Cyclic models -> IteraKvity.

  • Itera.ve MapReduce model:

– An extented programming model that supports iteraKve MapReduce computaKons efficiently.

15

slide-16
SLIDE 16

GIRAPH (APACHE Project) (h<p://giraph.apache.org/) Itera8ve graph processing GPS - A Graph Processing System, (Stanford) h<p://infolab.stanford.edu/gps/ Amazon's EC2 Distributed GraphLab (Carnegie Mellon Univ.) h<ps://github.com/graphlab-code/ graphlab Amazon's EC2 HaLoop (University of Washington) h<p://clue.cs.washington.edu/node/14 h<p://code.google.com/p/haloop/ Amazon’s EC2 Twister (Indiana University) h<p://www.iteraKvemapreduce.org/ Private Clusters PrIter (University of Massachuse<s Amherst, Northeastern University- China) h<p://code.google.com/p/priter/ Private cluster and Amazon EC2 cloud GPU based plauorms Mars Grex Spark (UC Berkeley) h<p://spark.incubator.apache.org/ research.html

New plauorms to overcome Hadoop’s limitaKons

16

slide-17
SLIDE 17

Big data technologies

17

slide-18
SLIDE 18

What is Spark?

Efficient

  • General execuKon graphs
  • In-memory storage

Usable

  • Rich APIs in Java, Scala,

Python

  • InteracKve shell

Fast Fast and Expr Expressive essive Cluster Computing Engine Compatible with Apache Hadoop

2-5× less code

Up to 10× faster on disk,

1 × in memory

18

slide-19
SLIDE 19

Spark Goal

  • Provide distributed memory abstracKons for clusters

to support apps with working sets

  • Retain the aZrac.ve proper.es of MapReduce:

– Fault tolerance (for crashes & stragglers) – Data locality – Scalability

Ini.al Solu.on: augment data flow model with “resilient distributed datasets” (RDDs)

19

slide-20
SLIDE 20

RDDs in Detail

  • An RDD is a fault-tolerant collecKon of elements that

can be operated on in parallel.

  • There are two ways to create RDDs:

– Parallelizing an exisKng collecKon in your driver program – Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Hbase.

  • Can be cached for future reuse

20

slide-21
SLIDE 21

OperaKons with RDDs

  • TransformaKons (e.g. map, filter, groupBy, join)

– Lazy operaKons to build RDDs from other RDDs

  • AcKons (e.g. count, collect, save)

– Return a result or write it to storage

Transformations (define a new RDD)

map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey …

21

slide-22
SLIDE 22

Spark vs. hadoop

22

274 157 106 197 121 87 143 61 33 50 100 150 200 250 300 25 50 100 Iteration time (s) Number of machines Hadoop HadoopBinMem Spark

K-Means

[Zaharia et. al, NSDI’12]

Lines of code for K- Means Spark ~ 90 lines – Hadoop ~ 4 files, > 300 lines

slide-23
SLIDE 23

DataFrame (Spark 1.3+)

  • Equivalent to a table in a relaKonal database (data frame in R/

Python)

  • Avoid Java serializaKon performed by RDDs.
  • API natural for developers who are familiar with building query

plans (e.g. SQL expressions).

Datasets (Spark 1.6+)

  • Best of both DataFrame and RDDs.
  • FuncKonal transformaKons (map, flatMap, filter, etc)
  • Spark SQL’s opKmised execuKon engine.

Apache Spark – new collecKons

23

slide-24
SLIDE 24

h<ps://flink.apache.org/

Flink

24

slide-25
SLIDE 25

Big Data: Technology and Chronology

2001-2010 2010-2016

Big Data

2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean 2008 Hadoop Yahoo! Doug Cufng 2010 Spark U Berckeley Apache Spark

  • Feb. 2014

Matei Zaharia 2009-2013 Flink TU Berlin Flink Apache (Dec. 2014) Volker Markl

2010-2016:

Big Data AnalyKcs: Mahout, MLLib, … Hadoop Ecosystem ApplicaKons New Technology

25

slide-26
SLIDE 26

Outline

q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions

26

slide-27
SLIDE 27

Clustering Recommendation Systems Classification Association

Poten.al scenarios

Real Time Analytics/ Big Data Streams Social Media Mining Social Big Data

Big Data AnalyKcs

27

slide-28
SLIDE 28

Big Data AnalyKcs: A 3 generaKonal view

28

slide-29
SLIDE 29

Mahout (Samsara)

29

h<p://mahout.apache.org/

  • First ML library iniKally based on Hadoop MapReduce.
  • Abandoned MapReduce implementaKons from version 0.9.
  • Nowadays it is focused on a new math environment called

Samsara.

  • It is integrated with Spark, Flink and H2O
  • Main algorithms:

– StochasKc Singular Value DecomposiKon (ssvd, dssvd) – StochasKc Principal Component Analysis (spca, dspca) – Distributed Cholesky QR (thinQR) – Distributed regularized AlternaKng Least Squares (dals) – CollaboraKve Filtering: Item and Row Similarity – Naive Bayes ClassificaKon

slide-30
SLIDE 30

h<ps://spark.apache.org/mllib/

Spark Libraries

30

slide-31
SLIDE 31

As of Spark 2.0

31

slide-32
SLIDE 32

h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/

FlinkML

32

slide-33
SLIDE 33

Outline

q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions

33

slide-34
SLIDE 34

Demo

  • In this demo I will show two ways of working

with Apache Spark:

– InteracKve mode with Spark Notebook. – Standalone mode with Scala IDE.

  • All the code used in this presentaKon is

available at: h<p://www.cs.no<.ac.uk/~pszit/benelearn.html

34

slide-35
SLIDE 35

DEMO with Spark Notebook in local

h<p://spark-notebook.io/

35

slide-36
SLIDE 36

DEMO with Spark Notebook in local

36

slide-37
SLIDE 37

DEMO with Spark Notebook in local

37

slide-38
SLIDE 38

DEMO with Spark Notebook in local

38

slide-39
SLIDE 39

DEMO with Spark Notebook in local

39

slide-40
SLIDE 40

DEMO with Spark Notebook in local

40

slide-41
SLIDE 41

DEMO with Spark Notebook in local

41

slide-42
SLIDE 42

DEMO with Spark Notebook in local

42

slide-43
SLIDE 43

DEMO with Spark Notebook in local

43

slide-44
SLIDE 44

DEMO with Spark Notebook in local

Advantages: ü InteracKve. ü AutomaKc plots. ü It allows connecKon with a cluster. ü Tab compleKon Disadvantages: q Built-in for specific spark versions. q Difficult to integrate your own code.

44

slide-45
SLIDE 45

DEMO with Scala IDE

45

h<p://scala-ide.org/

slide-46
SLIDE 46

Example: An Imbalanced Big Data problem

n Two main approaches to

tackle this problem:

n Data sampling: n Undersampling, n Oversampling n Hybrid approaches n Algorithmic modificaKons

46

  • I. Triguero et al, Evolu.onary Undersampling for Extremely Imbalanced Big Data Classifica.on under Apache
  • Spark. IEEE Congress on EvoluKonary ComputaKon (CEC 2016), Vancouver (Canada), 640-647, July 24-29.
slide-47
SLIDE 47

Imbalanced Big Data ClassificaKon with Spark

slide-48
SLIDE 48

Imbalanced Big Data ClassificaKon with Spark

slide-49
SLIDE 49

Imbalanced Big Data ClassificaKon with Spark

slide-50
SLIDE 50

Imbalanced Big Data ClassificaKon with Spark

slide-51
SLIDE 51

Run examples from ScalaIDE

slide-52
SLIDE 52

Run examples from terminal

$ mvn package -Dmaven.test.skip=true $ /opt/spark/bin/spark-submit --master local[*] --class Undersampling.UndersamplingExample target/EUS-0.0.1- BETA.jar hdfs://localhost:9000/user/pszit/datasets/ ECBDL14_25.header hdfs://localhost:9000/user/pszit/ datasets/ECBDL14_25-5-1tra100000.data hdfs://localhost: 9000/user/pszit/datasets/ECBDL14_25-5-1tst10000.data 4 4 RUS DecisionTree /Users/pszit/outputRUS-DecisionTree

slide-53
SLIDE 53

Outline

q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions

53

slide-54
SLIDE 54

Conclusions

  • We need new strategies to perform ML in big

datasets – Choosing the right technology is like choosing the right data structure in a program.

  • The world of big data is rapidly changing. Being up-

to-date is difficult but necessary.

  • InteracKve notebooks are very useful for a quick

start and standard experiments.

54

slide-55
SLIDE 55

Acknowledgments

55

Thank you

slide-56
SLIDE 56

Big Data Learning in Prac.ce

12th September 2016

Isaac Triguero

School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/

slide-57
SLIDE 57

Extra slides

57

slide-58
SLIDE 58

Volume: data at rest

  • Vast amounts of data

generated every second

  • Data sets are becoming too

large to store using tradiKonal database technology

  • Big data technology stores

these data sets using distributed systems

What is Big Data? The 5V’s definiKon

58

slide-59
SLIDE 59

Velocity: data in moKon

  • Speed at which:

– Data is generated – Data needs to be analyzed.

  • ConKnuous data streams are

being captured (e.g. from sensors or mobile devices) and produced

  • Late decisions imply missed
  • pportuniKes

What is Big Data? The 5V’s definiKon

59

slide-60
SLIDE 60

Variety: data in many forms

  • One application may generate

many different kind of data

  • Several formats and

structures: – Structured data:

  • Tables, relation

databases – Unstructured data:

  • Text, images, audio,

video.

What is Big Data? The 5V’s definiKon

60

slide-61
SLIDE 61

Veracity: data in doubt

  • Uncertainty about the quality
  • f the data.

– E.g. natural language processing on social media: typos, abbreviaKons, colloquial speech.

  • Data may be missing,

ambiguous, or even completely wrong.

What is Big Data? The 5V’s definiKon

61

slide-62
SLIDE 62
  • Most important moKvaKon for

big data

  • Big data may result in:

– Be<er staKsKcs/models – Novel insights – New opportuniKes for research and industry

Value: data in use

What is Big Data? The 5V’s definiKon

62

slide-63
SLIDE 63

Big Data: applicaKons

  • Science and research:

– E.g. Physics, BioinformaKcs, astronomy.

  • Healthcare and public health:

– Be<er personalized medicine

  • Business and e-commerce

– Personalized adverKsement.

  • Financial services

– Insurance, banks.

63

slide-64
SLIDE 64

MapReduce

  • Based on func.onal programming (e.g. Lisp)

– Operates on <key, value> pairs

  • Web-based example: key = URL ; value = webpage
  • Graph-based example: key = nodes ; value =

adjacency list – Users specifies two funcKons: map: ( k1, v1 ) → list[k2, v2] reduce: ( k2, list[v2] ) → list[k3, v3] – SorKng of intermediate keys between map and reduce phase

64

slide-65
SLIDE 65

The dataflow in MapReduce is transparent to the programmers

MapReduce

65

slide-66
SLIDE 66

Hello World Bye World

Input File Map key Value Splifng Short and Shuffle Reduce key Value Pairs Output

Hello, 1 World, 1 Bye, 1 World, 1 Hello, 1 Hello, 1 World, 1 World, 1 Bye, 1 Hello Hadoop Goodbye Hadoop Hello, 1 Hadoop, 1 Goodbye, 1 Hadoop, 1 Hadoop, 1 Hadoop, 1 Goodbye, 1 Bye, {1} World, {1,1} Hello, {1,1} Hadoop, {1,1} Goodbye, {1, 1} Hello, 2 World, 2 Bye, 1 Hadoop, 2 Goodbye, 1

WordCount using MapReduce

66

slide-67
SLIDE 67

Machine learning for Big Data

  • Data mining techniques have demonstrated to be very useful

tools to extract new valuable knowledge from data.

  • The knowledge extracKon process from big data has become a

very difficult task for most of the classical and advanced data mining tools.

  • The main challenges are to deal with:

– The increasing scale of data

  • at the level of instances
  • at the level of features

– The complexity of the problem. – And many other points

67

slide-68
SLIDE 68

Mllib: Spark Machine learning library

  • MLlib (2010): is a Spark implementaKon of some

common machine learning funcKonality, as well associated tests and data generators.

  • Includes:

– Binary classificaKon (SVMs and – LogisKc Regression) – Random Forest – Regression (Lasso, Ridge, etc.) – Clustering (K-Means) – CollaboraKve Filtering – Gradient Descent OpKmizaKon – PrimiKve h<ps://spark.apache.org/docs/latest/mllib-guide.html

68