Big Data Programming: an Introduction Spring 2015, X. Zhang - PowerPoint PPT Presentation

Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ.

Outline • What the course is about? • scope • Introduction to big data programming • Opportunity and challenge of big data • Origin of Hadoop • High-level overview: HDFS, MapReduce, YARN

Learning Goal • Understand concepts in distributed computing for big data • Able to develop MapReduce programs to crunch big data • Able to perform basic management/ administration/troubleshooting of Hadoop cluster • Able to understand and use tools in Hadoop ecosystems by self-learning • final projects/presentations

Prerequisite • Proficiency in C++, Java or Python – And being able to pick up a new language quickly • Familiarity with Unix/Linux systems – Understanding of Unix file systems, users and permissions… – Basics Unix commands, – Shell scripting: to automate running your programs and collecting results…

What is Big Data • Data sets that grow so large that they become awkward to work with using on hand database management tools. (Wikipedia)

Where do they come from? • New York Stock Exchange: one terabyte of new trade data per day • Facebook: 10 billion photos, one petabyte of storage • Data generated by machines: logs, sensor networks, GPS traces, electronic transactions, … • Have you collected data? – Network traces projects: Internet measurements…

Multiple of Bytes: decimal prefix • 1000 kB kilobyte • 1000 2 MB megabyte • 1000 3 Gb gigabyte • 1000 4 TB terabyte • 1000 5 PB petabyte • 1000 6 EB exabyte • 1000 7 ZB zettabyte • 1000 8 YB yottabyte

Cost of Storage • 1991, consumer grade, 1 gigabyte (1/1000 TB) disk drives, US$2699 • 1995, 1 GB drives, US$849 • 2007: 1 terabyte hard disk, $375 • 2010: 2 terabyte hard disk costs US$200 • 2012: 4 terabyte hard disk US$450, 1 terabyte hard disk US$100 • 2013: 4 terabyte hard disk US$179, 3 terabyte hard disk $129, 2 terabyte hard disk $100, 1 terabyte hard disk US $80 • 2014: 4 terabyte hard disk US$150, 3 terabyte hard disk $129, 2 terabyte hard disk $90, 1 terabyte hard disk US $60

Challenges • General Problems in Big Data Era: • How to process very big volume of data in a reasonable amount of time? • It turns out: disk bandwidth has become bottleneck, i.e., hard disk cannot read data fast enough… • Solutions: parallel processing • Google’s problems: to crawl, analyze and rank web pages into giant inverted index (to support search engine) • Google engineers went ahead to build their own systems: • Google File System, “exabyte-scale data management using commodity hardware” • Google MapReduce (GMR), “implementation of design pattern applied to massively parallel processing”

Background: Inverted Index • Goal: to support search query, where we need to locate documents containing some given words, and then rank these documents by relevance • Means: create inverted index, which stores a list of the documents containing each word � � Example: Word: Documents where the work appears the Document 1, Document 3, Document 4, Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7 � � 10

Hadoop History • Originally Yahoo Nutch Project: crawl and index a large number of web pages • Idea: a program is distributed, and process part of data stored with them • Two Google papers => Hadoop project (an open source implementation of Distributed File system and MapReduce framework) • Hadoop: schedule and resource management framework for execute map and reduce jobs in a cluster environment • Now an open source project, Apache Hadoop • Hadoop ecosystem: various tools to make it easier to use • Hive, Pig: tools that can translate more abstract description of workload to map-reduce pipelines.

High-level View: HDFS, MapReduce 12

HDFS (Hadoop Distributed File System) A file system running on • clusters of commodity hardware Capable of storing very large • files Optimized for streaming data • access (i.e., sequential reads) initial intent of Hadoop for • large parallel, batch processing jobs resilient to node failures • through replication via replication •

HDFS as a file system • Command line operations: • hadoop fs -ls (-mkdir, -cd, …) • hadoop fs -copyFromLocal … • hadoop fs -copyToLocal … • Java Programming API: • open file, close file, read and write file,… from programs 14

MapReduce • End-user MapReduce API for programming MapReduce application. • MapReduce framework, the runtime implementation of various phases such as map phase, sort/shuffle/merge aggregation and reduce phase. • MapReduce system , which is the backend infrastructure required to run the user’s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc. 15

MapReduce Programming Model Split Shuffle [k1,v11,v12, intermediate …] [key,value] Input: a set of [k2,v21,v22, pairs [key,value] pairs Output: a set of …] [key,value] pairs … 16

Woud Count Example • Example: Counting number of occurrences of each word in a large collection of documents. • pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); � reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); 17

WeatherData Example • Problem: analyze highest temperature for each year • Input: a single file contains multiple year weather data • Output: [year, highest_temp] pairs Input: intermediate: Output: [k,v] pairs [k,v] pairs [k,v] pairs 18

Parallel Execution: Scaling Out A MapReduce job is a unit of work that client/user wants to be performed • input data • MapReduce program • Configuration information Hadoop system: * divides job into map and reduce tasks . * divides input into fixed-size pieces called input splits , or splits . * Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split 19

MapReduce and HDFS Parallism of MapReduce + very high aggregate I/O bandwidth across a large cluster provided by HDFS => economics of the system are extremely compelling – a key factor in the popularity of Hadoop. Keys: lack of data motion i.e. move compute to data, and do not move data to compute node via network. Specifically, MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. Benefits: reduces network I/O and keeps most of the I/O on local disk or within same rack. 20

Hadoop 1.x There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers . • jobtracker: coordinates all jobs run on the system by scheduling tasks to run on tasktrackers. • Tasktrackers: run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker. 21

YARN: Yet Another Resource Negotiator � • Resource management => a global ResourceManager • Per-node resource monitor => NodeManager • Job scheduling/monitoring => per-application ApplicationMaster (AM). Hadoop Deamons are Java processes , running in background, talking to other via RPC over SSH protocol. 22

YARN: • Master-slave System : ResourceManager and per-node slave, NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. • ResourceManager: ultimate authority that arbitrates resources among all applications in the system. • Pluggable Scheduler , allocate resources to various running applications • based on the resource requirements of the applications • based on abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc. • Per-application ApplicationMaster : negotiate resources from ResourceManager and working with NodeManager(s) to execute and monitor component tasks. 23

Big Data Programming: an Introduction Spring 2015, X. Zhang - PowerPoint PPT Presentation

Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop High-level

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Scaling Security for Big, Parallel File Systems Andrew Leung and Ethan Miller University of

Putting Big Data in its Place Mike Amundsen, API Academy at CA @mamund HH Camp Strasbourg,

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Apache Libcloud Paul Querna, Chief Architect, Cloudkick June 22, 2010 Tuesday, June 22, 2010 I

Introduction to Magnetic Recording Laurent Ranno laurent.ranno@grenoble.cnrs.fr Dept

Outline Introduction and Motivation 1 Analysis and Optimization for Processing XML and SOAP

Society Expanding context: Fairness A simple problem:

Tuni ng means di fferent thi ngs to di fferent peopl e The Tyranny of Carlo J. D. Bjorken

Big Data Programming: an Introduction Spring 2015, X. Zhang - PowerPoint PPT Presentation

Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop High-level

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Scaling Security for Big, Parallel File Systems Andrew Leung and Ethan Miller University of

Putting Big Data in its Place Mike Amundsen, API Academy at CA @mamund HH Camp Strasbourg,

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus

Apache Libcloud Paul Querna, Chief Architect, Cloudkick June 22, 2010 Tuesday, June 22, 2010 I

Introduction to Magnetic Recording Laurent Ranno laurent.ranno@grenoble.cnrs.fr Dept

Outline Introduction and Motivation 1 Analysis and Optimization for Processing XML and SOAP

Society Expanding context: Fairness A simple problem:

Tuni ng means di fferent thi ngs to di fferent peopl e The Tyranny of Carlo J. D. Bjorken

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data