TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department - - PowerPoint PPT Presentation

tie 22306 data intensive programming
SMART_READER_LITE
LIVE PREVIEW

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department - - PowerPoint PPT Presentation

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen timo.aaltonen@tut.fi Assistants Adnan Mushtaq MSc Antti Luoto MSc Antti Kallonen


slide-1
SLIDE 1

TIE-22306 Data-intensive Programming

  • Dr. Timo Aaltonen

Department of Pervasive Computing

slide-2
SLIDE 2

Data-Intensive Programming

  • Lecturer: Timo Aaltonen

– timo.aaltonen@tut.fi

  • Assistants

– Adnan Mushtaq – MSc Antti Luoto – MSc Antti Kallonen

slide-3
SLIDE 3

Lecturer

  • University Lecturer
  • Doctoral degree in Software Engineering, TUT,

Software Engineering, 2005

  • Work history

– Various positions, TUT, 1995 – 2010 – Principal Researcher, System Software Engineering, Nokia Research Center, 2010 - 2012 – University lecturer, TUT

slide-4
SLIDE 4

Working at the course

  • Lectures on Fridays
  • Weekly exercises

– beginning from the week #2

  • Course work

– announced next Friday

  • Communication

– http://www.cs.tut.fi/~dip/

  • Exam
slide-5
SLIDE 5

Weekly Exercises

  • Linux class TC217
  • In the beginning of the course hands-on

training

  • In the end of the course reception for

problems with the course work

  • Enrolment is open
  • Not compulsory, no credit points
  • Two more instances will be added
slide-6
SLIDE 6

Course Work

  • Using Hadoop tools and framework to solve

typical Big Data problem (in Java)

  • Groups of three
  • Hardware

– Your own laptop with self-installed Hadoop – Your own laptop with VirtualBox 5.1 and Ubuntu VM – A TUT virtual machine

slide-7
SLIDE 7

Exam

  • Electronic exam after the course
  • Tests rather understanding than exact syntax
  • ”Use pseudocode to write a MapReduce

program which …”

  • General questions on Hadoop and related

technologies

slide-8
SLIDE 8

Today

  • Big data
  • Data Science
  • Hadoop
  • HDFS
  • Apache Flume
slide-9
SLIDE 9

1: Big Data

  • World is drowning in data

– click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits

  • “More data usually beats better algorithms”
slide-10
SLIDE 10

Three Vs of Big Data

  • Volume: amount of data

– Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-to- machine data

  • Velocity: speed of data in and out

– streaming data from RFID, sensors, …

  • Variety: range of data types and sources

– structured, unstructured

slide-11
SLIDE 11

Big Data

  • Variability

– Data flows can be highly inconsistent with periodic peaks

  • Complexity

– Data comes from multiple sources. – linking, matching, cleansing and transforming data across systems is a complex task

slide-12
SLIDE 12

Data Science

  • Definition: Data science is an activity to

extracts insights from messy data

  • Facebook analyzes location data

– to identify global migration patterns – to find out the fanbases to different sport teams

  • A retailer might track purchases both online

and in-store to targeted marketing

slide-13
SLIDE 13

Data Science

slide-14
SLIDE 14

New Challenges

  • Compute-intensiveness

– raw computing power

  • Challenges of data intensiveness

– amount of data – complexity of data – speed in which data is changing

slide-15
SLIDE 15

Data Storage Analysis

  • Hard drive from 1990

– store 1,370 MB – speed 4.4 MB/s

  • Hard drive 2010s

– store 1 TB – speed 100 MB/s

slide-16
SLIDE 16

Scalability

  • Grows without requiring developers to re-

architect their algorithms/application

  • Horizontal scaling
  • Vertical scaling
slide-17
SLIDE 17

Parallel Approach

  • Reading from multiple disks in parallel

– 100 drives having 1/100 of the data => 1/100 reading time

  • Problem: Hardware failures

– replication

  • Problem: Most analysis tasks need to be able

to combine data in some way

– MapReduce

  • Hadoop
slide-18
SLIDE 18

2: Apache Hadoop

  • Hadoop is a frameworks of tools

– libraries and methodologies

  • Operates on large unstructured datasets
  • Open source (Apache License)
  • Simple programming model
  • Scalable
slide-19
SLIDE 19

Hadoop

  • A scalable fault-tolerant distributed system for

data storage and processing (open source under the Apache license)

  • Core Hadoop has two main systems:

– Hadoop Distributed File System: self-healing high- bandwidth clustered storage – MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

slide-20
SLIDE 20

Hadoop

  • Administrators

– Installation – Monitor/Manage Systems – Tune Systems

  • End Users

– Design MapReduce Applications – Import and export data – Work with various Hadoop Tools

slide-21
SLIDE 21

Hadoop

  • Developed by Doug Cutting and Michael J.

Cafarella

  • Based on Google MapReduce technology
  • Designed to handle large amounts of data and

be robust

  • Donated to Apache Foundation in 2006 by

Yahoo

slide-22
SLIDE 22

Hadoop Design Principles

  • Moving computation is cheaper than moving data
  • Hardware will fail
  • Hide execution details from the user
  • Use streaming data access
  • Use simple file system coherency model
  • Hadoop is not a replacement for SQL, always fast

and efficient quick ad-hoc querying

slide-23
SLIDE 23

Hadoop MapReduce

  • MapReduce (MR) is the original programming

model for Hadoop

  • Collocate data with compute node

– data access is fast since its local (data locality)

  • Network bandwidth is the most precious

resource in the data center

– MR implementations explicit model the network topology

slide-24
SLIDE 24

Hadoop MapReduce

  • MR operates at a high level of abstraction

– programmer thinks in terms of functions of key and value pairs

  • MR is a shared-nothing architecture

– tasks do not depend on each other – failed tasks can be rescheduled by the system

  • MR was introduced by Google

– used for producing search indexes – applicable to many other problems too

slide-25
SLIDE 25

Hadoop Components

  • Hadoop Common

– A set of components and interfaces for distributed file systems and general I/O

  • Hadoop Distributed Filesystem (HDFS)
  • Hadoop YARN – a resource-management

platform, scheduling

  • Hadoop MapReduce

– Distributed programming model and execution environment

slide-26
SLIDE 26

Hadoop Stack Transition

slide-27
SLIDE 27

Hadoop Ecosystem

  • HBase – a scalable data warehouse with support

for large tables

  • Hive – a data warehouse infrastructure that

provides data summarization and ad hoc querying

  • Pig – a high-level data-flow language and

execution framework for parallel computation

  • Spark – a fast and general compute engine for

Hadoop data. Wide range of applications – ETL, Machine Learning, stream processing, and graph analytics

slide-28
SLIDE 28

Flexibility: Complex Data Processing

1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie: A workflow engine that enables creating a workflow of jobs composed of any of the above.

slide-29
SLIDE 29

3: Hadoop Distributed File System

  • Hadoop comes with distributed file system

called HDFS (Hadoop Distributed File System)

  • Based on Google’s GFS (Google File System)
  • HDFS provides redundant storage for massive

amounts of data

– using commodity hardware

  • Data in HDFS is distributed across all data

nodes

– Efficient for MapReduce processing

slide-30
SLIDE 30

HDFS Design

  • File system on commodity hardware

– Survives even with high failure rates of the components

  • Supports lots of large files

– File size hundreds GB or several TB

  • Main design principles

– Write once, read many times – Rather streaming reads, than frequent random access – High throughput is more important than low latency

slide-31
SLIDE 31

HDFS Architecture

  • HDFS operates on top of existing file system
  • Files are stored as blocks (default size 128 MB,

different from file system blocks)

  • File reliability is based on block-based replication

– Each block of a file is typically replicated across several DataNodes (default replication is 3)

  • NameNode stores metadata, manages replication

and provides access to files

  • No data caching (because of large datasets), but

direct reading/streaming from DataNode to client

slide-32
SLIDE 32

HDFS Architecture

  • NameNode stores HDFS metadata

– filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups

  • The number of files in HDFS is limited by the

amount of available RAM in the NameNode

– HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace

slide-33
SLIDE 33

HDFS Architecture

  • DataNode stores file contents as blocks

– Different blocks of the same file are stored on different DataNodes – Same block is typically replicated across several DataNodes for redundancy – Periodically sends report of all existing blocks to the NameNode – DataNodes exchange heartbeats with the NameNode

slide-34
SLIDE 34

HDFS Architecture

  • Built-in protection against DataNode failure
  • If NameNode does not receive any heartbeat

from a DataNode within certain time period, DataNode is assumed to be lost

  • In case of failing DataNode, block replication is

actively maintained

– NameNode determines which blocks were on the lost DataNode – The NameNode finds other copies of these lost blocks and replicates them to other nodes

slide-35
SLIDE 35

HDFS

  • HDFS Federation

– Multiple Namenode servers – Multiple namespaces

  • High Availability – redundant NameNodes
  • Heterogeneous Storage and Archival Storage

ARCHIVE, DISK, SSD, RAM_DISK

slide-36
SLIDE 36

High-Availability (HA) Issues: NameNode Failure

  • NameNode failure corresponds to losing all

files on a file system % sudo rm --dont-do-this /

  • For recovery, Hadoop provides two options

– Backup files that make up the persistent state of the file system – Secondary NameNode

  • Also some more advanced techniques exist
slide-37
SLIDE 37

HA Issues: the secondary NameNode

  • The secondary NameNode is not mirrored NameNode
  • Required memory-intensive administrative functions

– NameNode keeps metadata in memory and writes changes to an edit log – The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large

  • Keeps a copy of the merged namespace image, which

can be used in the event of the NameNode failure

slide-38
SLIDE 38

Network Topology

  • HDFS is aware how close two nodes are in the

network

  • From closer to further

0: Processes in the same node 2: Different nodes in the same rack 4: Nodes in different racks in the same data center 6: Nodes in different data centers

slide-39
SLIDE 39

Network Topology

slide-40
SLIDE 40

File Block Placement

  • Clients always read from the closest node
  • Default placement strategy

– One replica in the same local node as client – Second replica in a different rack – Third replica in different, randomly selected, node in the same rack as the second replica

  • Additional (3+) replicas are random
slide-41
SLIDE 41

Balancing

  • Hadoop works best when blocks are evenly

spread out

  • Support for DataNodes of different size

– In optimal case the disk usage percentage in all DataNodes approximately the same level

  • Hadoop provides balancer daemon

– Re-distributes blocks – Should be run when new DataNodes are added

slide-42
SLIDE 42

Running Hadoop

  • Three configurations

– standalone – pseudo-distributed – fully-distributed – https://hadoop.apache.org/docs/r2.7.2/hadoop- project-dist/hadoop-common/SingleCluster.html

slide-43
SLIDE 43

Configuring HDFS

  • Variable HADOOP_CONF_DIR defines the

directory for the Hadoop configuration files

  • core-site.xml

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9001</value> </property> </configuration>

slide-44
SLIDE 44
  • hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/NN/hadoop/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:///home/NN/hadoop/datanode</value> </property> </configuration>

slide-45
SLIDE 45

Accessing Data

  • Data can be accessed using various methods

– Java API – C API – Command line / POSIX (FUSE mount) – Command line / HDFS client: Demo – HTTP – Various tools

slide-46
SLIDE 46

HDFS URI

  • All HDFS (CLI) commands take path URIs as

arguments

  • URI example

– hdfs://localhost:9000/user/hduser/log-data/file1.log

  • The scheme and authority are optional

– /user/hduser/log-data/file1.log

  • Home directory

– log-data/file1.log

slide-47
SLIDE 47

RDBMS vs HDFS

  • Schema-on-Write

(RDBMS)

– Schema must be created before any data can be loaded – An explicit load operation which transforms data to DB internal structure – New columns must be added explicitly before new data for such columns can be loaded into the DB

  • Schema-on-Read (HDFS)

– Data is simply copied to the file store, no transformation is needed – A SerDe (Serializer /Deserlizer) is applied during read time to extract the required columns (late binding) – New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it

slide-48
SLIDE 48

Conclusions

  • Pros

– Support for very large files – Designed for streaming data – Commodity hardware

  • Cons

– Not designed for low-latency data access – Architecture does not support lots of small files – No support for multiple writers / arbitrary file modifications (Writes always at the end of the file)

slide-49
SLIDE 49

Reading data

slide-50
SLIDE 50

Flume

slide-51
SLIDE 51

4: Data Modeling

  • HDFS is a Schema-on-read system

– allows storing all of your raw data

  • Still following must be considered

– Data storage formats – Multitenancy – Schema design – Metadata management

slide-52
SLIDE 52

Data Storage Options

  • No standard data storage format

– Hadoop allows storing of data in any format

  • Major considerations for data storage include

– File format (e.g. plain text, SequenceFile or more complex but more functionally rich options, such as Avro and Parquet) – Compression (splittability) – Data storage system (HDFS, HBase, Hive, Impala)

slide-53
SLIDE 53

File Formats: Text File

  • Common use case: web logs and server logs

– comes in many formats

  • Organization of the files in the filesystem
  • Text files consume space -> compression
  • Overhead for conversion (‘123’ ->123)
  • Structured text data

– XML and JSON present challenges to HADOOP

  • hard to split

– Dedicated libraries exist

slide-54
SLIDE 54

File Formats: Binary Data

  • Hadoop can be used to process binary files

– e.g. images

  • Container format is preferred

– e.g. SequenceFile

  • If the splittable unit of binary data is larger

than 64 MB, you may consider putting the data in its own file, without using a container format

slide-55
SLIDE 55

Hadoop File Types

  • Hadoop-specific file formats are specifically created to

work well with MapReduce

– file-based data structures such as sequence files, – serialization formats like Avro, and – columnar formats such as RCFile and Parquet

  • Splittable compression

– These formats support common compression formats and are also splittable

  • Agnostic compression

– codec is stored in the header metadata of the file format - > the file can be compressed with any compression codec, without readers having to know the codec

slide-56
SLIDE 56

File-Based Data Structures

  • SequenceFile format is one of the most

commonly used file-based formats in Hadoop

– other formats: MapFiles, SetFiles, ArrayFiles, BloomMapFiles, … – stores data as binary key-value pairs – three formats available for records: uncompressed, record-compressed, block- compressed

slide-57
SLIDE 57

Sequence File

  • Header metadata

– compression codec, key and value class names, user- defined metadata, randomly generated sync marker

  • Often used a container for

smaller files

slide-58
SLIDE 58

Compression

  • Also for speeding MapReduce

– Not only for reducing storage requirements

  • Compression must be splittable

– MapReduce framework splits data for input to multiple tasks

slide-59
SLIDE 59

HDFS Schema Design

  • Hadoop is often a data hub for the entire
  • rganization

– data is shared by many departments and teams

  • Carefully structured and organized repository has

several benefits

– standard directory structure makes it easier to share data between teams – allows for enforcing access rights and quota – conventions regarding e.g. staging data lead less errors – code reuse – Hadoop tools make assumptions of the data placement

slide-60
SLIDE 60

Recommended Locations of Files

  • /user/<username>

– data, JARs, and config files of a specific user

  • /etl

– data in all phases of an ETL workflow – /etl/<group>/<application>/<process>/{input, processing, output, bad}

  • /tmp

– temporary data

slide-61
SLIDE 61

Recommended Locations of Files

  • /data

– datasets shared across organization – data is written by automated ETL processes – read-only for users – subdirectories for each data set

  • /app

– JARs, Oozie workflow definitions, Hive HQL files, … – /app/<group>/<application>/<version>/<artifact directory>/<artifact>

slide-62
SLIDE 62

Recommended Locations of Files

  • /metadata

– the metadata required by some tools

slide-63
SLIDE 63

Partitioning

  • HDFS has no indexes

– pro: fast to ingest data – con: might lead to full table scan (FTC), even when

  • nly a portion of data is needed
  • Solution: break data set into smaller subsets

(partitions)

– a HDFS subdirectory for each partition – allows queries to read only the specific partitions

slide-64
SLIDE 64

Partitioning: Example

  • Assume data sets for all orders for various

pharmacies

  • Without partitioning checking order history

for just one physician over the past three months leads to full table scan

  • medication_orders/date=20160824/{order1.csv, order2.csv}

– only 90 directories must be scanned

slide-65
SLIDE 65

5: Data Movement

  • File system client for simple usage
  • Common data sources for Hadoop include

– traditional data management systems such as relational databases and mainframes – logs, machine-generated data, and other forms of event data – files being imported from existing enterprise data storage systems

slide-66
SLIDE 66

Data Movement: Considerations

  • Timeliness of data ingestion and accessibility

– What are the requirements around how often data needs to be ingested? How soon does data need to be available to downstream processing?

  • Incremental updates

– How will new data be added? Does it need to be appended to existing data? Or overwrite existing data?

slide-67
SLIDE 67

Data Movement: Considerations

  • Data access and processing

– Will the data be used in processing? If so, will it be used in batch processing jobs? Or is random access to the data required?

  • Source system and data structure

– Where is the data coming from? A relational database? Logs? Is it structured, semistructured,

  • r unstructured data?
slide-68
SLIDE 68

Data Movement: Considerations

  • Partitioning and splitting of data

– How should data be partitioned after ingest? Does the data need to be ingested into multiple target systems (e.g., HDFS and HBase)?

  • Storage format

– What format will the data be stored in?

  • Data transformation

– Does the data need to be transformed in flight?

slide-69
SLIDE 69

Timeliness of Data Ingestion

  • Time lag from when data is available for ingestion to when it’s

accessible in Hadoop

  • Classifications ingestion requirements:
  • Macro batch

– anything over 15 minutes to hours, or even a daily job.

  • Micro batch

– fired off every 2 minutes or so, but no more than 15 minutes in total.

  • Near-Real-Time Decision Support

– “immediately actionable” by the recipient of the information – delivered in less than 2 minutes but greater than 2 seconds.

  • Near-Real-Time Event Processing

– under 2 seconds, and can be as fast as a 100-millisecond range.

  • Real Time

– anything under 100 milliseconds.

slide-70
SLIDE 70

Incremental Updates

  • Data is either appended to an existing data set or

it is modified

– HDFS works fine for append only implementations.

  • The downside to HDFS is the inability to do

appends or random writes to files after they’re created

  • HDFS is optimized for large files

– If the requirements call for a two-minute append process that ends up producing lots of small files, then a periodic process to combine smaller files will be required to get the benefits from larger files

slide-71
SLIDE 71

Original Source System and Data Structure

  • Original file type

– any format: delimited, XML, JSON, Avro, fixed length, variable length, copybooks, …

  • Hadoop can accept any file format

– not all formats are optimal for particular use cases – not all file formats can work with all tools in the Hadoop ecosystem, example: variable-length files

slide-72
SLIDE 72

Compression

  • Pro

– transferring a compressed file over the network requires less I/O and network bandwidth

  • Con

– most compression codecs applied outside of Hadoop are not splittable (e.g., Gzip)

slide-73
SLIDE 73

Misc

  • RDBMS

– Tool: Sqoop

  • Streaming Data

– Twitter feeds, a Java Message Service (JMS) queue, events firing from a web application server – Tools: Flume or Kafka

  • Logfiles

– an anti-pattern is to read the logfiles from disk as they are written because this is almost impossible to implement without losing data – The correct way of ingesting logfiles is to stream the logs directly to a tool like Flume or Kafka, which will write directly to Hadoop instead

slide-74
SLIDE 74

Transformations

  • modifications on incoming data, distributing

the data into partitions or buckets, sending the data to more than one store or location

– Transformation: XML or JSON is converted to delimited data – Partitioning: incoming data is stock trade data and partitioning by ticker is required – Splitting: The data needs to land in HDFS and HBase for different access patterns.

slide-75
SLIDE 75

Data Ingestion Options

  • file transfers
  • Tools like Flume, Sqoop, and Kafka