Apache Hadoop, Big Data, and You Philip Zeyliger - PowerPoint PPT Presentation

Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009

Hi there! Software Engineer Worked at Wednesday, November 18, 2009

I work on stuff... Wednesday, November 18, 2009

Outline Why should you care? (Intro) Challenging yesteryear’s assumptions The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Wednesday, November 18, 2009

Data is everywhere. Data is important. Wednesday, November 18, 2009

Wednesday, November 18, 2009

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Wednesday, November 18, 2009

So, what’s Hadoop? The Little Prince , Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009

Apache Hadoop is an open-source system (written in Java!) to store and process gobs of data across many commodity computers. The Little Prince , Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009

Two Big Components HDFS Map/Reduce Self-healing high- Fault-tolerant bandwidth distributed computing. clustered storage. Wednesday, November 18, 2009

Challenging some of yesteryear’s assumptions... Wednesday, November 18, 2009

Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Wednesday, November 18, 2009

Hadoop Goal: Separate distributed system fault-tolerance code from application logic. Systems Statisticians Programmers Wednesday, November 18, 2009

Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Wednesday, November 18, 2009

Hadoop Goal: Users should interact with clusters, not machines. Wednesday, November 18, 2009

Assumption 3: A data set fits on one machine... Image: Matthew J. Stinson CC- BY-NC Wednesday, November 18, 2009

Hadoop Goal: System should scale linearly (or better) with data size. Wednesday, November 18, 2009

The M/R Programming Model Wednesday, November 18, 2009

You specify map() and reduce() functions. The framework does the rest. Wednesday, November 18, 2009

map() map: K ₁ ,V ₁→ list K ₂ ,V ₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Wednesday, November 18, 2009

(the shuffle) map output is assigned to a “reducer” map output is sorted by key Wednesday, November 18, 2009

reduce() K ₂ , iter(V ₂ ) → list(K ₃ ,V ₃ ) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Wednesday, November 18, 2009

Putting it together... Logical Physical Wednesday, November 18, 2009

Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DW technologies (by archiving raw data). Wednesday, November 18, 2009

There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language Great for for easy ad-hoc stdin/stdout/ analysts. analysis. stderr Developed at Developed at Facebook Yahoo! Friday, @10:10 Wednesday, November 18, 2009

A typical look... Commodity servers (8-core, 8-16GB RAM, 4-12 TB, 2x1 gE NIC) 2-level network architecture 20-40 nodes per rack Wednesday, November 18, 2009

The cast... Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… TaskTrackers DataNodes (task execution) (block storage) Thanks to Zak Stone for earmuff image! Wednesday, November 18, 2009

HDFS 3x64MB file, 3 rep 4x64MB file, 3 rep Namenode Small file, 7 rep Datanodes One Rack A Different Rack Wednesday, November 18, 2009

HDFS Write Path Wednesday, November 18, 2009

HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh Wednesday, November 18, 2009

M/R Job on stars Different job Tasktrackers on the same Idle machines as datanodes One Rack A Different Rack Wednesday, November 18, 2009

M/R Wednesday, November 18, 2009

M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Wednesday, November 18, 2009

Hadoop in the Wild Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld) Wednesday, November 18, 2009

The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Hive (SQL) Sqoop Pig (Data Flow) Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) Wednesday, November 18, 2009

Ok, fine, what next? Get Hadoop! http://hadoop.apache.org/ Cloudera Distribution for Hadoop Try it out! (Locally, or on EC2) Wednesday, November 18, 2009

Just one slide... Software: Cloudera Distribution for Hadoop, Cloudera Desktop, more… Training and certification… Free on-line training materials (including video) Support & Professional Services @cloudera, blog, etc. Wednesday, November 18, 2009

Questions? philip@cloudera.com Wednesday, November 18, 2009

Apache Hadoop, Big Data, and You Philip Zeyliger - PowerPoint PPT Presentation

Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009 Hi there! Software Engineer Worked at Wednesday, November 18, 2009 I work on stuff... Wednesday,

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

MMOWGLI Game Overview CENIC, 12 March 2012 Don Brutzman Modeling Virtual Environments Simulation

D R a f t Learning from the Past: Tools and Techniques for Timeline Analysis Andreas Schuster

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara

Web Applications Software Engineering 2017 Alessio Gambi - Saarland University Based on the work

PHP PHPQuebec March 20, 2003. Montreal Rasmus Lerdorf <rasmus@php.net>

Qi Gao, Wenbin Zhang, Yan Tang, and Feng Qin The Ohio State University 1 Memory Management Bugs

its impact on the profession THE PROFESSION The AOLS as a self-governing profession

The Software Development Standards Instructor: Dr. Hany H. Ammar Dept. of Computer Science and

Apache Hadoop, Big Data, and You Philip Zeyliger - PowerPoint PPT Presentation

Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009 Hi there! Software Engineer Worked at Wednesday, November 18, 2009 I work on stuff... Wednesday,

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Apex: Next Gen Big Data Analytics Thomas Weise &lt;thw@apache.org&gt; @thweise PMC Chair

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

MMOWGLI Game Overview CENIC, 12 March 2012 Don Brutzman Modeling Virtual Environments Simulation

D R a f t Learning from the Past: Tools and Techniques for Timeline Analysis Andreas Schuster

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara

Web Applications Software Engineering 2017 Alessio Gambi - Saarland University Based on the work

PHP PHPQuebec March 20, 2003. Montreal Rasmus Lerdorf &lt;rasmus@php.net&gt;

Qi Gao, Wenbin Zhang, Yan Tang, and Feng Qin The Ohio State University 1 Memory Management Bugs

its impact on the profession THE PROFESSION The AOLS as a self-governing profession

The Software Development Standards Instructor: Dr. Hany H. Ammar Dept. of Computer Science and

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

PHP PHPQuebec March 20, 2003. Montreal Rasmus Lerdorf <rasmus@php.net>