tie 22306 data intensive programming
play

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department - PowerPoint PPT Presentation

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen timo.aaltonen@tut.fi Assistants Adnan Mushtaq MSc Antti Luoto MSc Antti Kallonen


  1. TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

  2. Data-Intensive Programming • Lecturer: Timo Aaltonen – timo.aaltonen@tut.fi • Assistants – Adnan Mushtaq – MSc Antti Luoto – MSc Antti Kallonen

  3. Lecturer • University Lecturer • Doctoral degree in Software Engineering, TUT, Software Engineering, 2005 • Work history – Various positions, TUT, 1995 – 2010 – Principal Researcher, System Software Engineering, Nokia Research Center, 2010 - 2012 – University lecturer, TUT

  4. Working at the course • Lectures on Fridays • Weekly exercises – beginning from the week #2 • Course work – announced next Friday • Communication – http://www.cs.tut.fi/~dip/ • Exam

  5. Weekly Exercises • Linux class TC217 • In the beginning of the course hands-on training • In the end of the course reception for problems with the course work • Enrolment is open • Not compulsory, no credit points • Two more instances will be added

  6. Course Work • Using Hadoop tools and framework to solve typical Big Data problem (in Java) • Groups of three • Hardware – Your own laptop with self-installed Hadoop – Your own laptop with VirtualBox 5.1 and Ubuntu VM – A TUT virtual machine

  7. Exam • Electronic exam after the course • Tests rather understanding than exact syntax • ”Use pseudocode to write a MapReduce program which …” • General questions on Hadoop and related technologies

  8. Today • Big data • Data Science • Hadoop • HDFS • Apache Flume

  9. 1: Big Data • World is drowning in data – click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits • “More data usually beats better algorithms”

  10. Three Vs of Big Data • Volume : amount of data – Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-to- machine data • Velocity : speed of data in and out – streaming data from RFID, sensors, … • Variety : range of data types and sources – structured, unstructured

  11. Big Data • Variability – Data flows can be highly inconsistent with periodic peaks • Complexity – Data comes from multiple sources. – linking, matching, cleansing and transforming data across systems is a complex task

  12. Data Science • Definition: Data science is an activity to extracts insights from messy data • Facebook analyzes location data – to identify global migration patterns – to find out the fanbases to different sport teams • A retailer might track purchases both online and in-store to targeted marketing

  13. Data Science

  14. New Challenges • Compute-intensiveness – raw computing power • Challenges of data intensiveness – amount of data – complexity of data – speed in which data is changing

  15. Data Storage Analysis • Hard drive from 1990 – store 1,370 MB – speed 4.4 MB/s • Hard drive 2010s – store 1 TB – speed 100 MB/s

  16. Scalability • Grows without requiring developers to re- architect their algorithms/application • Horizontal scaling • Vertical scaling

  17. Parallel Approach • Reading from multiple disks in parallel – 100 drives having 1/100 of the data => 1/100 reading time • Problem: Hardware failures – replication • Problem: Most analysis tasks need to be able to combine data in some way – MapReduce • Hadoop

  18. 2: Apache Hadoop • Hadoop is a frameworks of tools – libraries and methodologies • Operates on large unstructured datasets • Open source (Apache License) • Simple programming model • Scalable

  19. Hadoop • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Core Hadoop has two main systems: – Hadoop Distributed File System : self-healing high- bandwidth clustered storage – MapReduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

  20. Hadoop • Administrators – Installation – Monitor/Manage Systems – Tune Systems • End Users – Design MapReduce Applications – Import and export data – Work with various Hadoop Tools

  21. Hadoop • Developed by Doug Cutting and Michael J. Cafarella • Based on Google MapReduce technology • Designed to handle large amounts of data and be robust • Donated to Apache Foundation in 2006 by Yahoo

  22. Hadoop Design Principles • Moving computation is cheaper than moving data • Hardware will fail • Hide execution details from the user • Use streaming data access • Use simple file system coherency model • Hadoop is not a replacement for SQL, always fast and efficient quick ad-hoc querying

  23. Hadoop MapReduce • MapReduce (MR) is the original programming model for Hadoop • Collocate data with compute node – data access is fast since its local ( data locality ) • Network bandwidth is the most precious resource in the data center – MR implementations explicit model the network topology

  24. Hadoop MapReduce • MR operates at a high level of abstraction – programmer thinks in terms of functions of key and value pairs • MR is a shared-nothing architecture – tasks do not depend on each other – failed tasks can be rescheduled by the system • MR was introduced by Google – used for producing search indexes – applicable to many other problems too

  25. Hadoop Components • Hadoop Common – A set of components and interfaces for distributed file systems and general I/O • Hadoop Distributed Filesystem (HDFS) • Hadoop YARN – a resource-management platform, scheduling • Hadoop MapReduce – Distributed programming model and execution environment

  26. Hadoop Stack Transition

  27. Hadoop Ecosystem • HBase – a scalable data warehouse with support for large tables • Hive – a data warehouse infrastructure that provides data summarization and ad hoc querying • Pig – a high-level data-flow language and execution framework for parallel computation • Spark – a fast and general compute engine for Hadoop data. Wide range of applications – ETL, Machine Learning, stream processing, and graph analytics

  28. Flexibility: Complex Data Processing 1. Java MapReduce : Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch : A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin : A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive : A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie : A workflow engine that enables creating a workflow of jobs composed of any of the above.

  29. 3: Hadoop Distributed File System • Hadoop comes with distributed file system called HDFS (Hadoop Distributed File System) • Based on Google’s GFS (Google File System) • HDFS provides redundant storage for massive amounts of data – using commodity hardware • Data in HDFS is distributed across all data nodes – Efficient for MapReduce processing

  30. HDFS Design • File system on commodity hardware – Survives even with high failure rates of the components • Supports lots of large files – File size hundreds GB or several TB • Main design principles – Write once, read many times – Rather streaming reads, than frequent random access – High throughput is more important than low latency

  31. HDFS Architecture • HDFS operates on top of existing file system • Files are stored as blocks (default size 128 MB, different from file system blocks) • File reliability is based on block-based replication – Each block of a file is typically replicated across several DataNodes (default replication is 3) • NameNode stores metadata, manages replication and provides access to files • No data caching (because of large datasets), but direct reading/streaming from DataNode to client

  32. HDFS Architecture • NameNode stores HDFS metadata – filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups • The number of files in HDFS is limited by the amount of available RAM in the NameNode – HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace

  33. HDFS Architecture • DataNode stores file contents as blocks – Different blocks of the same file are stored on different DataNodes – Same block is typically replicated across several DataNodes for redundancy – Periodically sends report of all existing blocks to the NameNode – DataNodes exchange heartbeats with the NameNode

  34. HDFS Architecture • Built-in protection against DataNode failure • If NameNode does not receive any heartbeat from a DataNode within certain time period, DataNode is assumed to be lost • In case of failing DataNode, block replication is actively maintained – NameNode determines which blocks were on the lost DataNode – The NameNode finds other copies of these lost blocks and replicates them to other nodes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend