Big Data Learning in Prac.ce
12th September 2016
Isaac Triguero
School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html
Big Data Learning in Prac.ce Isaac Triguero School of Computer - - PowerPoint PPT Presentation
Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016 Outline q What is Big data? q
12th September 2016
Isaac Triguero
School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html
2
3
4
5
6
7
8
worker nodes
(lots of them)
central storage CommunicaKon network (Infiniband) Network for I/O OS OS OS OS OS Limited I/O c c c c c input data (relaKvely small) Lots of computa.ons Lots of communica.on
Source: Jan Fos.er. Introduc.on to MapReduce and its Applica.on to Post-Sequencing Analysis
Low compute intensity
Fast communicaKon network (Infiniband) Network for I/O OS OS OS OS OS a
Limited communicaKon
central storage
input data (lots of it) b c d e f g h i j a b c d e f g h i j Lots of I/O doesn’t scale
Low compute intensity
CommunicaKon network
Limited communicaKon
input data (lots of it) e j b c g j a c h i b e g i d f f h a d Solu.on: store data on local disks of the nodes that perform computaKons on that data (“data locality”)
h<p://hadoop.apache.org/
12
13
. . . Input
HDFS read HDFS write HDFS read HDFS write
Input query 1 query 2 query 3 result 1 result 2 result 3 . . .
HDFS read
Slow due to replicaKon, serializaKon, and disk IO
14
– The DAG defines the dataflow of the applicaKon, and the verKces of the graph defines the operaKons on the data.
– More complex graph models that be<er represent the dataflow of the applicaKon. – Cyclic models -> IteraKvity.
– An extented programming model that supports iteraKve MapReduce computaKons efficiently.
15
GIRAPH (APACHE Project) (h<p://giraph.apache.org/) Itera8ve graph processing GPS - A Graph Processing System, (Stanford) h<p://infolab.stanford.edu/gps/ Amazon's EC2 Distributed GraphLab (Carnegie Mellon Univ.) h<ps://github.com/graphlab-code/ graphlab Amazon's EC2 HaLoop (University of Washington) h<p://clue.cs.washington.edu/node/14 h<p://code.google.com/p/haloop/ Amazon’s EC2 Twister (Indiana University) h<p://www.iteraKvemapreduce.org/ Private Clusters PrIter (University of Massachuse<s Amherst, Northeastern University- China) h<p://code.google.com/p/priter/ Private cluster and Amazon EC2 cloud GPU based plauorms Mars Grex Spark (UC Berkeley) h<p://spark.incubator.apache.org/ research.html
16
17
Efficient
Usable
Python
Up to 10× faster on disk,
18
Ini.al Solu.on: augment data flow model with “resilient distributed datasets” (RDDs)
19
20
Transformations (define a new RDD)
map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey …
21
22
274 157 106 197 121 87 143 61 33 50 100 150 200 250 300 25 50 100 Iteration time (s) Number of machines Hadoop HadoopBinMem Spark
K-Means
[Zaharia et. al, NSDI’12]
Lines of code for K- Means Spark ~ 90 lines – Hadoop ~ 4 files, > 300 lines
Python)
plans (e.g. SQL expressions).
23
h<ps://flink.apache.org/
24
2001-2010 2010-2016
2001 3V’s Gartner Doug Laney 2004 MapReduce Google Jeffrey Dean 2008 Hadoop Yahoo! Doug Cufng 2010 Spark U Berckeley Apache Spark
Matei Zaharia 2009-2013 Flink TU Berlin Flink Apache (Dec. 2014) Volker Markl
2010-2016:
Big Data AnalyKcs: Mahout, MLLib, … Hadoop Ecosystem ApplicaKons New Technology
25
26
Clustering Recommendation Systems Classification Association
Real Time Analytics/ Big Data Streams Social Media Mining Social Big Data
27
28
29
h<p://mahout.apache.org/
Samsara.
– StochasKc Singular Value DecomposiKon (ssvd, dssvd) – StochasKc Principal Component Analysis (spca, dspca) – Distributed Cholesky QR (thinQR) – Distributed regularized AlternaKng Least Squares (dals) – CollaboraKve Filtering: Item and Row Similarity – Naive Bayes ClassificaKon
h<ps://spark.apache.org/mllib/
30
31
h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/
32
33
34
h<p://spark-notebook.io/
35
36
37
38
39
40
41
42
43
44
45
h<p://scala-ide.org/
n Two main approaches to
n Data sampling: n Undersampling, n Oversampling n Hybrid approaches n Algorithmic modificaKons
46
$ mvn package -Dmaven.test.skip=true $ /opt/spark/bin/spark-submit --master local[*] --class Undersampling.UndersamplingExample target/EUS-0.0.1- BETA.jar hdfs://localhost:9000/user/pszit/datasets/ ECBDL14_25.header hdfs://localhost:9000/user/pszit/ datasets/ECBDL14_25-5-1tra100000.data hdfs://localhost: 9000/user/pszit/datasets/ECBDL14_25-5-1tst10000.data 4 4 RUS DecisionTree /Users/pszit/outputRUS-DecisionTree
53
54
55
12th September 2016
Isaac Triguero
School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/
57
Volume: data at rest
58
Velocity: data in moKon
59
Variety: data in many forms
60
Veracity: data in doubt
61
Value: data in use
62
63
64
The dataflow in MapReduce is transparent to the programmers
65
Hello World Bye World
Input File Map key Value Splifng Short and Shuffle Reduce key Value Pairs Output
Hello, 1 World, 1 Bye, 1 World, 1 Hello, 1 Hello, 1 World, 1 World, 1 Bye, 1 Hello Hadoop Goodbye Hadoop Hello, 1 Hadoop, 1 Goodbye, 1 Hadoop, 1 Hadoop, 1 Hadoop, 1 Goodbye, 1 Bye, {1} World, {1,1} Hello, {1,1} Hadoop, {1,1} Goodbye, {1, 1} Hello, 2 World, 2 Bye, 1 Hadoop, 2 Goodbye, 1
66
tools to extract new valuable knowledge from data.
very difficult task for most of the classical and advanced data mining tools.
– The increasing scale of data
– The complexity of the problem. – And many other points
67
common machine learning funcKonality, as well associated tests and data generators.
– Binary classificaKon (SVMs and – LogisKc Regression) – Random Forest – Regression (Lasso, Ridge, etc.) – Clustering (K-Means) – CollaboraKve Filtering – Gradient Descent OpKmizaKon – PrimiKve h<ps://spark.apache.org/docs/latest/mllib-guide.html
68