Big Data Learning in Prac.ce Isaac Triguero School of Computer - PowerPoint PPT Presentation

Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016

Outline q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions 2

What is Big Data? There is no a standard defini.on! “ Big Data ” involves data whose volume, diversity and complexity requires new techniques, algorithms and analyses to extract valuable knowledge (hidden) . Data Intensive applica.ons 3

What is Big Data? The 5V’s definiKon 4

Big data has many faces 5

How to deal with data intensive applicaKons? • Problem statement: scalability to big data sets. • Example: – Explore 100 TB by 1 node @ 50 MB/sec = 23 days – ExploraKon with a cluster of 1000 nodes = 33 minutes • Solu.on è Divide-And-Conquer What happens if we have to manage 1000 or 10000 TB ? 7

MapReduce • Parallel Programming model • Divide & conquer strategy § divide : parKKon dataset into smaller, independent chunks to be processed in parallel ( map ) § conquer : combine, merge or otherwise aggregate the results from the previous step ( reduce ) • Based on simplicity and transparency to the programmers, and assumes data locality . • Becomes popular thanks to the open-source project Hadoop! (Used by Google, Facebook, Amazon, …) 8

TradiKonal HPC way of doing things CommunicaKon network (Infiniband) Lots of communica.on … worker c c c c c nodes (lots of them) Lots of OS OS OS OS OS computa.ons Network for I/O Limited I/O input data (relaKvely small) central storage Source: Jan Fos.er. Introduc.on to MapReduce and its Applica.on to Post-Sequencing Analysis

Data-intensive jobs Fast communicaKon network (Infiniband) Limited communicaKon … Low compute intensity OS OS OS OS OS doesn’t scale Network for I/O Lots of I/O central storage a a b c d b c d e e input data (lots of it) f f g h i g h i j j

Data-intensive jobs CommunicaKon network Limited … communicaKon Low compute intensity input data b c a c b e d f a d (lots of it) e j g j h i g i f h Solu.on: store data on local disks of the nodes that perform computaKons on that data (“ data locality ”)

Hadoop • Hadoop is: – An open-source framework wri<en in Java – Distributed storage of very large data sets (Big Data) – Distributed processing of very large data sets • This framework consists of a number of modules – Hadoop Common – Hadoop Distributed File System (HDFS) – Hadoop YARN – resource manager – Hadoop MapReduce – programming model h<p://hadoop.apache.org/ 12

Hadoop MapReduce: Main CharacterisKcs • Automa.c paralleliza.on: – Depending on the size of the input data è there will be mulKple MAP tasks! – Depending on the number of Keys <k, value> è there will be mulKple REDUCE tasks! • Scalability: – It may work over every data center or cluster of computers. • Transparent for the programmer – Fault-tolerant mechanism. – AutomaKc communicaKons among computers 13

Data Sharing in Hadoop MapReduce HDFS HDFS HDFS HDFS read write read write . . . iter. 1 iter. 2 Input result 1 query 1 HDFS read result 2 query 2 result 3 query 3 Input . . . Slow due to replicaKon, serializaKon, and disk IO 14

Paradigms that do not fit with Hadoop MapReduce • Directed Acyclic Graph (DAG) model: – The DAG defines the dataflow of the applicaKon, and the verKces of the graph defines the operaKons on the data. • Graph model: – More complex graph models that be<er represent the dataflow of the applicaKon. – Cyclic models -> IteraKvity. • Itera.ve MapReduce model: – An extented programming model that supports iteraKve MapReduce computaKons efficiently. 15

New plauorms to overcome Hadoop’s limitaKons GIRAPH (APACHE Project) (h<p://giraph.apache.org/) Twister (Indiana University) Itera8ve graph processing h<p://www.iteraKvemapreduce.org/ Private Clusters GPS - A Graph Processing System, PrIter (University of Massachuse<s (Stanford) Amherst, Northeastern University- h<p://infolab.stanford.edu/gps/ China) Amazon's EC2 h<p://code.google.com/p/priter/ Private cluster and Amazon EC2 cloud Distributed GraphLab (Carnegie Mellon Univ.) HaLoop h<ps://github.com/graphlab-code/ (University of Washington) graphlab h<p://clue.cs.washington.edu/node/14 Amazon's EC2 h<p://code.google.com/p/haloop/ Amazon’s EC2 GPU based plauorms Spark (UC Berkeley) Mars h<p://spark.incubator.apache.org/ Grex research.html 16

Big data technologies 17

What is Spark? Fast Fast and Expr Expressive essive Cluster Computing � Engine Compatible with Apache Hadoop Up to 10× faster on disk, 1 2-5× less code 0 0 × in memory Efficient Usable • General execuKon graphs • Rich APIs in Java, Scala, Python • In-memory storage • InteracKve shell 18

Spark Goal • Provide distributed memory abstracKons for clusters to support apps with working sets • Retain the aZrac.ve proper.es of MapReduce : – Fault tolerance (for crashes & stragglers) – Data locality – Scalability Ini.al Solu.on: augment data flow model with “resilient distributed datasets” (RDDs) 19

RDDs in Detail • An RDD is a fault-tolerant collecKon of elements that can be operated on in parallel. • There are two ways to create RDDs: – Parallelizing an exisKng collecKon in your driver program – Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Hbase. • Can be cached for future reuse 20

OperaKons with RDDs • TransformaKons (e.g. map, filter, groupBy, join) – Lazy operaKons to build RDDs from other RDDs • AcKons (e.g. count, collect, save) – Return a result or write it to storage Transformations Parallel operations (define a new RDD) (return a result to driver) map reduce filter collect sample count union save groupByKey lookupKey reduceByKey … join cache 21 …

Spark vs. hadoop K-Means 274 300 Hadoop Lines of code for K- HadoopBinMem 250 Means Iteration time (s) Spark 197 Spark ~ 90 lines – 200 157 143 150 121 Hadoop ~ 4 files, > 300 106 lines 87 100 61 33 50 0 25 50 100 Number of machines [Zaharia et. al, NSDI’12] 22

Apache Spark – new collecKons DataFrame (Spark 1.3+) - Equivalent to a table in a relaKonal database (data frame in R/ Python) - Avoid Java serializaKon performed by RDDs. - API natural for developers who are familiar with building query plans (e.g. SQL expressions). Datasets (Spark 1.6+) - Best of both DataFrame and RDDs. - FuncKonal transformaKons (map, flatMap, filter, etc) - Spark SQL’s opKmised execuKon engine. 23

Flink h<ps://flink.apache.org/ 24

Big Data: Technology 2001-2010 and Chronology 2010-2016 2001 3V’s Gartner Doug Laney 2009-2013 Flink 2010-2016: 2004 TU Berlin MapReduce Big Data Google Flink Apache (Dec. Big 2014) Volker AnalyKcs: Jeffrey Dean Markl Mahout, MLLib, … Data Hadoop Ecosystem 2010 Spark 2008 ApplicaKons U Berckeley Hadoop New Apache Spark Yahoo! Technology Feb. 2014 Doug Cufng Matei Zaharia 25

Big Data AnalyKcs Poten.al scenarios Real Time Analytics/ Clustering Classification Big Data Streams Association Social Media Mining Recommendation Social Big Data 27 Systems

Big Data AnalyKcs: A 3 generaKonal view 28

Mahout (Samsara) • First ML library iniKally based on Hadoop MapReduce. • Abandoned MapReduce implementaKons from version 0.9. • Nowadays it is focused on a new math environment called Samsara. • It is integrated with Spark, Flink and H2O • Main algorithms: – StochasKc Singular Value DecomposiKon (ssvd, dssvd) – StochasKc Principal Component Analysis (spca, dspca) – Distributed Cholesky QR (thinQR) – Distributed regularized AlternaKng Least Squares (dals) – CollaboraKve Filtering: Item and Row Similarity – Naive Bayes ClassificaKon 29 h<p://mahout.apache.org/

Spark Libraries h<ps://spark.apache.org/mllib/ 30

As of Spark 2.0 31

FlinkML h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/ 32

Demo • In this demo I will show two ways of working with Apache Spark: – InteracKve mode with Spark Notebook. – Standalone mode with Scala IDE. • All the code used in this presentaKon is available at: h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 34

DEMO with Spark Notebook in local h<p://spark-notebook.io/ 35

DEMO with Spark Notebook in local 36

Big Data Learning in Prac.ce Isaac Triguero School of Computer - PowerPoint PPT Presentation

Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016 Outline q What is Big data? q

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

PRAC feedback to working parties Presented by: V. Hivert, R.Anderson (PRAC) 25 September 2019

Objec(ves Review Lab 1 Linux prac(ce Programming prac(ce Print statements

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

PRAC Update PCWP Meeting EMA, 22 Nov 17 Prsented by: Marco Greco Public Hearing: a learning

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Looking ahead to ongoing improvement In this presentation: Key pharmacovigilance improvement

High-Impact Prac/ces (HIPs) An Overview of Evidence &

Pharmacovigilance Risk Assessment Committee - PRAC Regulatory challenges and opportunities PCWP/

Re-analysis and replica/on prac/ces in reproducible research Daniele Fanelli Conceptual

Relations entre XLIM Universit de Limoges et ISG-SCC Royal Holloway, University of London

Java Technology Goes to the Movies: Java Technology in Next- Generation Optical Disc Formats

CO444H Administrivia Overview of the Material Ben Livshits Two Primary Goals We Pursue

ARC SDK overview ARC SDK overview Martin Skou Andersen University of Copenhagen skou@nbi.ku.dk

Use of graphs and taxonomic classifications to analyze content relationships among courseware

Mobile network security issues A very short overview of some mobile security issues.

Intro to Java Week 6 Wednesday, December 3, 14 Final Homeworks Hwk 5 due Wed Noon Hwk 6 up

Android without Java Bernhard "Bero" Rosenkrnzer, Linaro bero@linaro.org Android

Big Data Learning in Prac.ce Isaac Triguero School of Computer - PowerPoint PPT Presentation

Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016 Outline q What is Big data? q

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

PRAC feedback to working parties Presented by: V. Hivert, R.Anderson (PRAC) 25 September 2019

Objec(ves Review Lab 1 Linux prac(ce Programming prac(ce Print statements

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

PRAC Update PCWP Meeting EMA, 22 Nov 17 Prsented by: Marco Greco Public Hearing: a learning

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Looking ahead to ongoing improvement In this presentation: Key pharmacovigilance improvement

High-Impact Prac/ces (HIPs) An Overview of Evidence &amp;

Pharmacovigilance Risk Assessment Committee - PRAC Regulatory challenges and opportunities PCWP/

Re-analysis and replica/on prac/ces in reproducible research Daniele Fanelli Conceptual

Relations entre XLIM Universit de Limoges et ISG-SCC Royal Holloway, University of London

Java Technology Goes to the Movies: Java Technology in Next- Generation Optical Disc Formats

CO444H Administrivia Overview of the Material Ben Livshits Two Primary Goals We Pursue

ARC SDK overview ARC SDK overview Martin Skou Andersen University of Copenhagen skou@nbi.ku.dk

Use of graphs and taxonomic classifications to analyze content relationships among courseware

Mobile network security issues A very short overview of some mobile security issues.

Intro to Java Week 6 Wednesday, December 3, 14 Final Homeworks Hwk 5 due Wed Noon Hwk 6 up

Android without Java Bernhard &quot;Bero&quot; Rosenkrnzer, Linaro bero@linaro.org Android

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

High-Impact Prac/ces (HIPs) An Overview of Evidence &

Android without Java Bernhard "Bero" Rosenkrnzer, Linaro bero@linaro.org Android