TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department - PowerPoint PPT Presentation

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming • Lecturer: Timo Aaltonen – timo.aaltonen@tut.fi • Assistants – Adnan Mushtaq – MSc Antti Luoto – MSc Antti Kallonen

Lecturer • University Lecturer • Doctoral degree in Software Engineering, TUT, Software Engineering, 2005 • Work history – Various positions, TUT, 1995 – 2010 – Principal Researcher, System Software Engineering, Nokia Research Center, 2010 - 2012 – University lecturer, TUT

Working at the course • Lectures on Fridays • Weekly exercises – beginning from the week #2 • Course work – announced next Friday • Communication – http://www.cs.tut.fi/~dip/ • Exam

Weekly Exercises • Linux class TC217 • In the beginning of the course hands-on training • In the end of the course reception for problems with the course work • Enrolment is open • Not compulsory, no credit points • Two more instances will be added

Course Work • Using Hadoop tools and framework to solve typical Big Data problem (in Java) • Groups of three • Hardware – Your own laptop with self-installed Hadoop – Your own laptop with VirtualBox 5.1 and Ubuntu VM – A TUT virtual machine

Exam • Electronic exam after the course • Tests rather understanding than exact syntax • ”Use pseudocode to write a MapReduce program which …” • General questions on Hadoop and related technologies

Today • Big data • Data Science • Hadoop • HDFS • Apache Flume

1: Big Data • World is drowning in data – click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits • “More data usually beats better algorithms”

Three Vs of Big Data • Volume : amount of data – Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-to- machine data • Velocity : speed of data in and out – streaming data from RFID, sensors, … • Variety : range of data types and sources – structured, unstructured

Big Data • Variability – Data flows can be highly inconsistent with periodic peaks • Complexity – Data comes from multiple sources. – linking, matching, cleansing and transforming data across systems is a complex task

Data Science • Definition: Data science is an activity to extracts insights from messy data • Facebook analyzes location data – to identify global migration patterns – to find out the fanbases to different sport teams • A retailer might track purchases both online and in-store to targeted marketing

Data Science

New Challenges • Compute-intensiveness – raw computing power • Challenges of data intensiveness – amount of data – complexity of data – speed in which data is changing

Data Storage Analysis • Hard drive from 1990 – store 1,370 MB – speed 4.4 MB/s • Hard drive 2010s – store 1 TB – speed 100 MB/s

Scalability • Grows without requiring developers to re- architect their algorithms/application • Horizontal scaling • Vertical scaling

Parallel Approach • Reading from multiple disks in parallel – 100 drives having 1/100 of the data => 1/100 reading time • Problem: Hardware failures – replication • Problem: Most analysis tasks need to be able to combine data in some way – MapReduce • Hadoop

2: Apache Hadoop • Hadoop is a frameworks of tools – libraries and methodologies • Operates on large unstructured datasets • Open source (Apache License) • Simple programming model • Scalable

Hadoop • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Core Hadoop has two main systems: – Hadoop Distributed File System : self-healing high- bandwidth clustered storage – MapReduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

Hadoop • Administrators – Installation – Monitor/Manage Systems – Tune Systems • End Users – Design MapReduce Applications – Import and export data – Work with various Hadoop Tools

Hadoop • Developed by Doug Cutting and Michael J. Cafarella • Based on Google MapReduce technology • Designed to handle large amounts of data and be robust • Donated to Apache Foundation in 2006 by Yahoo

Hadoop Design Principles • Moving computation is cheaper than moving data • Hardware will fail • Hide execution details from the user • Use streaming data access • Use simple file system coherency model • Hadoop is not a replacement for SQL, always fast and efficient quick ad-hoc querying

Hadoop MapReduce • MapReduce (MR) is the original programming model for Hadoop • Collocate data with compute node – data access is fast since its local ( data locality ) • Network bandwidth is the most precious resource in the data center – MR implementations explicit model the network topology

Hadoop MapReduce • MR operates at a high level of abstraction – programmer thinks in terms of functions of key and value pairs • MR is a shared-nothing architecture – tasks do not depend on each other – failed tasks can be rescheduled by the system • MR was introduced by Google – used for producing search indexes – applicable to many other problems too

Hadoop Components • Hadoop Common – A set of components and interfaces for distributed file systems and general I/O • Hadoop Distributed Filesystem (HDFS) • Hadoop YARN – a resource-management platform, scheduling • Hadoop MapReduce – Distributed programming model and execution environment

Hadoop Stack Transition

Hadoop Ecosystem • HBase – a scalable data warehouse with support for large tables • Hive – a data warehouse infrastructure that provides data summarization and ad hoc querying • Pig – a high-level data-flow language and execution framework for parallel computation • Spark – a fast and general compute engine for Hadoop data. Wide range of applications – ETL, Machine Learning, stream processing, and graph analytics

Flexibility: Complex Data Processing 1. Java MapReduce : Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch : A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin : A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive : A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie : A workflow engine that enables creating a workflow of jobs composed of any of the above.

3: Hadoop Distributed File System • Hadoop comes with distributed file system called HDFS (Hadoop Distributed File System) • Based on Google’s GFS (Google File System) • HDFS provides redundant storage for massive amounts of data – using commodity hardware • Data in HDFS is distributed across all data nodes – Efficient for MapReduce processing

HDFS Design • File system on commodity hardware – Survives even with high failure rates of the components • Supports lots of large files – File size hundreds GB or several TB • Main design principles – Write once, read many times – Rather streaming reads, than frequent random access – High throughput is more important than low latency

HDFS Architecture • HDFS operates on top of existing file system • Files are stored as blocks (default size 128 MB, different from file system blocks) • File reliability is based on block-based replication – Each block of a file is typically replicated across several DataNodes (default replication is 3) • NameNode stores metadata, manages replication and provides access to files • No data caching (because of large datasets), but direct reading/streaming from DataNode to client

HDFS Architecture • NameNode stores HDFS metadata – filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups • The number of files in HDFS is limited by the amount of available RAM in the NameNode – HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace

HDFS Architecture • DataNode stores file contents as blocks – Different blocks of the same file are stored on different DataNodes – Same block is typically replicated across several DataNodes for redundancy – Periodically sends report of all existing blocks to the NameNode – DataNodes exchange heartbeats with the NameNode

HDFS Architecture • Built-in protection against DataNode failure • If NameNode does not receive any heartbeat from a DataNode within certain time period, DataNode is assumed to be lost • In case of failing DataNode, block replication is actively maintained – NameNode determines which blocks were on the lost DataNode – The NameNode finds other copies of these lost blocks and replicates them to other nodes

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department - PowerPoint PPT Presentation

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen timo.aaltonen@tut.fi Assistants Adnan Mushtaq MSc Antti Luoto MSc Antti Kallonen

Welcome to our SERVICES for Tie Presentation of Christ in Tie Temple Sunday 2 February 2020

Spreading the spirit of Entrepreneurship Brief Presentation by TiE Kerala Page 1 Brief

BOX: HowTo Zip Tie Like a Pro JellyBox Build: 10_Quadruple (link directly to Zip Tie Tips t=0)

Tie Vegan Data Diet Tie Vegan Data Diet How Wikipedia cuts down privacy issues while keeping

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

C O N N E C T E D I N C R I M E using network science to reduce gun violence Andrew V.

Intro THE BUSINESS INSTITUTE CORPORATE FINANCE MASTER CLASS 2 Tie Business Institute Tie

Concrete Tie Degradation Study Tie Condition and Crack Growth Rate Assessment Final Results

McCanns Oatmeal Irish Heritage Ad Campaign Tie Original Tie New Information and visuals

COMPUTATIONAL TIE STRENGTH: THEORY AND APPLICATIONS Eric Gilberts Prelim kevin casey lucas

PREDICTING TIE STRENGTH WITH SOCIAL MEDIA Eric Gilbert & Karrie Karahalios University of

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

UCL for FreeBSD A universal config language for (almost) everything in FreeBSD Allan Jude --

CSP .LMC Prototype Proposed Design E.Giani, C.Baffa LMC harmonization through Telescopes

RIR delegation reports and address-by-economy measurements DNS-OARC Workshop 25 July 2005

L A T EX3 Project Team A Modern Regression Test Suite for T EX Programming Frank Mittelbach,

C++ Basics Lecture 2 COP 3014 Fall 2018 August 27, 2018 Structure of a C++ Program Sequence

Logistics Setup Instructions A First Project Files & Paths Streams Meme Credit: Thomas

Presentation, best practices and roadmap Synopsis Historic and general Installation

Mapreduce Programming at TSCC and HW4 UCSB CS140 2014. Tao Yang CS140 HW4: Data Analysis from

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department - PowerPoint PPT Presentation

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen timo.aaltonen@tut.fi Assistants Adnan Mushtaq MSc Antti Luoto MSc Antti Kallonen

Welcome to our SERVICES for Tie Presentation of Christ in Tie Temple Sunday 2 February 2020

Spreading the spirit of Entrepreneurship Brief Presentation by TiE Kerala Page 1 Brief

BOX: HowTo Zip Tie Like a Pro JellyBox Build: 10_Quadruple (link directly to Zip Tie Tips t=0)

Tie Vegan Data Diet Tie Vegan Data Diet How Wikipedia cuts down privacy issues while keeping

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

C O N N E C T E D I N C R I M E using network science to reduce gun violence Andrew V.

Intro THE BUSINESS INSTITUTE CORPORATE FINANCE MASTER CLASS 2 Tie Business Institute Tie

Concrete Tie Degradation Study Tie Condition and Crack Growth Rate Assessment Final Results

McCanns Oatmeal Irish Heritage Ad Campaign Tie Original Tie New Information and visuals

COMPUTATIONAL TIE STRENGTH: THEORY AND APPLICATIONS Eric Gilberts Prelim kevin casey lucas

PREDICTING TIE STRENGTH WITH SOCIAL MEDIA Eric Gilbert &amp; Karrie Karahalios University of

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

UCL for FreeBSD A universal config language for (almost) everything in FreeBSD Allan Jude --

CSP .LMC Prototype Proposed Design E.Giani, C.Baffa LMC harmonization through Telescopes

RIR delegation reports and address-by-economy measurements DNS-OARC Workshop 25 July 2005

L A T EX3 Project Team A Modern Regression Test Suite for T EX Programming Frank Mittelbach,

C++ Basics Lecture 2 COP 3014 Fall 2018 August 27, 2018 Structure of a C++ Program Sequence

Logistics Setup Instructions A First Project Files &amp; Paths Streams Meme Credit: Thomas

Presentation, best practices and roadmap Synopsis Historic and general Installation

Mapreduce Programming at TSCC and HW4 UCSB CS140 2014. Tao Yang CS140 HW4: Data Analysis from

PREDICTING TIE STRENGTH WITH SOCIAL MEDIA Eric Gilbert & Karrie Karahalios University of

Logistics Setup Instructions A First Project Files & Paths Streams Meme Credit: Thomas