Hadoop @ SURFsara USING THE CLUSTER Jeroen Schot - PowerPoint PPT Presentation

Hadoop @ SURFsara USING THE CLUSTER Jeroen Schot <jeroen.schot@surfsara.nl>

Overview - SURFsara in a nutshell - The SURFsara Hadoop cluster - How to use the cluster

About SURF

SARA  SURFsara -Founded in 1971 as SARA by UvA, VU and CWI as shared data-processing facility. -Starting 1984 took the role as national supercomputing center. -Became an independent foundation in 1995. -Joined the SURF foundation as SURFsara in 2013.

Hadoop cluster -Started a test cluster in 2011 on six old machines -Real cluster in 2012: 60 machines: Hadoop 0.20, MapReduce, Pig -Now ~ 100 machines: Hadoop 2.6, MapReduce, Pig, Spark, Giraph, Tez, Cascading

Hadoop 1.0

Hadoop 2.0: YARN No longer just MapReduce: One ResourceManager (~ JobTracker) Many NodeManagers (~ TaskTracker) Job coordination is done by an ApplicationMaster (one per job) (Used to be the JobTracker)

Using the cluster - Command-line: from your own computer or our login node - Resource manager web-interface - Develop in your favorite IDE (Eclipse, IntelliJ) - Package your jobs as jar files - Submit the jar file using ‘ hadoop jar’ or ‘yarn jar’

Dependency management Your code probably depends on libraries. These libraries need to be available on the cluster machines. Multiple options: 1. Specify on command line: - yarn jar myjar.jar -libjars foo.jar,bar.jar 2. Bundle the jars inside the lib folder or your jar. 3. Extract all dependency class files (maven shade plugin) Build tools like maven, ivy and ant can help you with this. Example Maven POM-file (using method 2): http://beehub.nl/surfsara-hadoop/public/lsde-pom.xml You don’t need to include the Hadoop/MapReduce dependencies.

Step 1 – Login node Access via SSH: ssh lsdeXX@login.hathi.surfsara.nl (replace lsdeXX with your username) Optionally enable X-Forwarding for graphical applications: ssh -X lsdeXX@login.hathi.surfsara.nl

Step 2 – Interacting with the HDFS Use the ‘ hdfs dfs ’ command to access the distributed filesystem. Some common commands include: list contents of ‘ dir ’ hdfs dfs -ls dir hdfs dfs -rm file remove file hdfs dfs -cat file print file hdfs dfs -copyFromLocal src dest copy src on login node to dest on HDFS hdfs dfs -copyFromLocal src dest copy src on HDFS to dest on login node The full list can be found at http://hadoop.apache.org/docs/r2.6.0/hadoop-project- dist/hadoop-common/FileSystemShell.html

Step 3 – Submitting jobs (MapReduce ) jobs can be submitted using the ‘yarn jar’ command. This runs one of the standard jobs bundled with the Hadoop framework: yarn jar /usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 10 Generally, you build your jar file on your desktop, use scp to copy it to the login node and use: yarn jar JARFILE MAINCLASS ARGUMENTS

Step 4 – ResourceManager web interface You can look at the progress of your job and the log files of individual process on the web interface of the ResourceManager. This can be accessed via ‘ firefox ’ started on the login node (X -Forwarding needed) You will need to change one setting in Firefox, see https://surfsara.nl/systems/hadoop/usage

Need help? Problems using the SURFsara Hadoop cluster? Contact either your course instructors or hadoop.support@surfsara.nl

Hadoop @ SURFsara USING THE CLUSTER Jeroen Schot - PowerPoint PPT Presentation

Hadoop @ SURFsara USING THE CLUSTER Jeroen Schot <jeroen.schot@surfsara.nl> Overview - SURFsara in a nutshell - The SURFsara Hadoop cluster - How to use the cluster About SURF SARA SURFsara -Founded in 1971 as SARA by UvA, VU and

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

SURFsara NOC Flash talk Erik Ruiter, Sr. Network Specialist, SURFsara TF-NOC Meeting Cambridge

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

iARCH Asynchronous file handling with iRODS tape resources

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Fault Tolerance, Replication, and Consistency 1 Motivation: Hadoop Cluster 2 Motivation:

Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Agenda

Lecture Notes on Ant (COMP 303) These slides extracted from material at:

Introduction to Java Programming Design goals Language features Running sample code Tools

ESC/Java2 Use and Features David Cok, Joe Kiniry, Erik Poll Eastman Kodak Company, University

Outline Some brief background to help with homework Java beans Some interesting

Modular Applications and the Lookup API David trupl Sun Microsystems The Need for Modular

Task Dependencies: ant Steven J Zeil February 25, 2013 Task Dependencies: ant Outline

Darrell Bethea May 31, 2011 Midterm grades posted Program 2 grades Grades posted

IDE Review BlueJ NetBeans Eclipse Namespace, Package, Classpath baseDir/x

Sambuz

Useful Links

Newsletter

Mail Us