Big Data Processing Technologies Chentao Wu Associate Professor - PowerPoint PPT Presentation

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn

Schedule • lec1: Introduction on big data and cloud computing • Iec2: Introduction on data storage • lec3: Data reliability (Replication/Archive/EC) • lec4: Data consistency problem • lec5: Block level storage and file storage • lec6: Object-based storage • lec7: Distributed file system • lec8: Metadata management

Collaborators

Contents Introduction on Storage Devices 1

An Example Memory Hierarchy Smaller, L0: faster, registers CPU registers hold words retrieved and from L1 cache. on-chip L1 costlier L1: cache (SRAM) (per byte) L1 cache holds cache lines retrieved storage from the L2 cache memory. off-chip L2 devices L2: cache (SRAM) L2 cache holds cache lines retrieved from main memory. main memory L3: (DRAM) Larger, Main memory holds disk slower, blocks retrieved from local and disks. cheaper local secondary storage L4: (per byte) (local disks) storage Local disks hold files retrieved from disks on devices remote network servers. remote secondary storage L5: (tapes, distributed file systems, Web servers)

Read Only Memory (ROM) When a computer is first switched on, it needs to load up the BIOS (Basic Input/Output System) and basic instructions for the hardware. These are stored in ROM ( R ead O nly M emory). This type of memory is called non-volatile because it retains the data . Data stored in ROM remains there even when the computer is switched off. ROM can be found on the motherboard .

Random Access Memory (RAM) Computers store temporary data in the RAM ( R andom A ccess M emory). These could be operating instructions, loose bits of data or content from programs that are running. The contents of RAM are constantly rewritten as the data is processed. When the computer is switched off, all the data is cleared from the RAM. This type of memory is called volatile because it only stores the data whilst the computer is switched on . RAM sticks are found on the motherboard.

Secondary Storage/Backup Storage Computers need backing storage outside the CPU to store data and programs not currently in use. There are three main types of storage device : Those that store data by magnetizing a special material that coats the surface of a disk. Those that store data using optical technology to etch the data onto a plastic-coated metal disk. Laser beams are then passed over the surface to read the data. Flash drives use solid state technology and store data in a similar way to the BIOS chip.

Hard Disk Drives (HDDs) The hard disk of the computer stores the system information, programs and data that the computer uses every day. Computer servers will use RAID systems with many hard drives to provide huge capacity and safer storage. The drives can be mirrored so that data written to one of them is also written to others, so if one drive fails, the others just take over. Removable hard drives plug into the USB port and can be used for backup or transfer of data to another computer.

What’s Inside A Disk Drive? Spindle Arm Platters Actuator Electronics SCSI connector Image courtesy of Seagate Technology

Disk Electronics Just like a small computer – processor, memory, network interface • Connect to disk • Control processor • Cache memory • Control ASIC • Connect to motor

Longitudinal Recording

How Bits Are Stored Magnetic Transition

Disk “Geometry” Disks contain platters, each with two surfaces Each surface organized in concentric rings called tracks Each track consists of sectors separated by gaps tracks surface track k gaps spindle sectors

Disk Geometry (Muliple-Platter View) Aligned tracks form a cylinder cylinder k surface 0 platter 0 surface 1 surface 2 platter 1 surface 3 surface 4 platter 2 surface 5 spindle

Disk Structure Arm Read/Write Head Upper Surface Platter Lower Surface Cylinder Track Sector Actuator

Disk Structure - top view of single platter Surface organized into tracks Tracks divided into sectors

Disk Access Head in position above a track

Disk Access Rotation is counter-clockwise

Disk Access – Read About to read blue sector

Disk Access – Read After BLUE read After reading blue sector

Disk Access – Read After BLUE read Red request scheduled next

Disk Access – Read After BLUE read Seek for RED Seek to red’s track

Disk Access – Read After BLUE read Seek for RED Rotational latency Wait for red sector to rotate around

Disk Access – Read After BLUE read Seek for RED Rotational latency After RED read Complete read of red

Disk Access – Read After BLUE read Seek for RED Rotational latency After RED read Seek Rotational Latency Data Transfer

Disk Access Time Average time to access a specific sector approximated by: • Taccess = Tavg seek + Tavg rotation + Tavg transfer Seek time (Tavg seek) • Time to position heads over cylinder containing target sector • Typical Tavg seek = 3-5 ms Rotational latency (Tavg rotation) • Time waiting for first bit of target sector to pass under r/w head • Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min • e.g., 3ms for 10,000 RPM disk Transfer time (Tavg transfer) • Time to read the bits in the target sector • Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min • e.g., 0.006ms for 10,000 RPM disk with 1,000 sectors/track • given 512-byte sectors, ~85 MB/s data transfer rate

Solid State Drives

Flash Memory Cell

NAND-Flash

SLC and MLC

Performance Comparison HDD vs SSD

Contents 2 Introduction to RAID

RAID Array Components Logical Array (RAID Sets) RAID Controller Hard Disks Host RAID Array

RAID Techniques • Three key techniques used for RAID are: • Striping • Mirroring • Parity

RAID Technique – Striping Strip RAID Stripe Controller Host

RAID Technique – Mirroring Block 0 RAID Controller Block 0 Block 0 Host

RAID Technique – Parity 4 D 1 6 D 2 RAID 1 Controller D 3 7 Host D 4 18 P Actual parity calculation is a bitwise XOR operation

Data Recovery in Parity Technique 4 D 1 6 D 2 RAID ? Controller D 3 7 Host D 4 Regeneration of data when Drive D 3 fails: 18 4 + 6 + ? + 7 = 18 ? = 18 – 4 – 6 – 7 P ? = 1

RAID-0 • It splits data among two or more disks. • Provides good performance. • Lack of data redundancy means there is no fail over support with this configuration. • In the diagram to the right, the odd blocks are written to disk 0 and the even blocks to disk 1 such that A1, A2, A3, A4, … would be the order of blocks read if read sequentially from the beginning. • Used in read only NFS systems and gaming systems.

RAID-1 • RAID1 is ‘data mirroring’. • Two copies of the data are held on two physical disks, and the data is always identical. • Twice as many disks are required to store the same data when compared to RAID 0. • Array continues to operate so long as at least one drive is functioning.

RAID-5 • RAID 5 is an ideal combination of good performance, good fault tolerance and high capacity and storage efficiency. • An arrangement of parity and CRC to help rebuilding drive data in case of disk failures. • “Distributed Parity” is the key word here.

RAID-6 • It is seen as the best way to guarantee data integrity as it uses double parity. • Lesser MTBF compared to RAID5. • It has a drawback though of longer write time.

RAID-10 • Combines RAID 1 and RAID 0. • Which means having the pleasure of both - good performance and good failover handling. • Also called ‘Nested RAID’.

Implementations Software based RAID: • Software implementations are provided by many Operating Systems. • A software layer sits above the disk device drivers and provides an abstraction layer between the logical drives(RAIDs) and physical drives. • Server's processor is used to run the RAID software. • Used for simpler configurations like RAID0 and RAID1.

Implementations (Contd.) Hardware based RAID: • A hardware implementation of RAID requires at least a special- purpose RAID controller. • On a desktop system this may be built into the motherboard. • Processor is not used for RAID calculations as a separate controller present. A PCI-bus-based, IDE/ATA hard disk RAID controller, supporting levels 0, 1, and 01.

Hot Spare Failed disk RAID Controller Replace failed disk Hot spare

Thank you!

Big Data Processing Technologies Chentao Wu Associate Professor - PowerPoint PPT Presentation

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Big Data processing with Hadoop Luca Pireddu CRS4Distributed Computing Group April 18, 2012

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Magnetic Disks Have cylinders, sectors platters, tracks, heads virtual and real disk blocks (x

Continuous Improvement Toolkit RAID Log R A I D www. citoolkit .com The Continuous

File System Reliability OSPP Chapter 14 Main Points Problem posed by machine/disk failures

Disk Management Disk Structure Disk Scheduling RAID Disk Block Management

Manage your disk space... for free :) Julien Wallior Plug Central 11.1.2006 Agenda

Mass Storage & IO - II Tevfik Ko ar University at Buffalo November 10 th , 2011 1 RAID

parallel workloads Vikas Aggarwal Storage Developer Conference Bangalore 2017 Team Members

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

Big Data Processing Technologies Chentao Wu Associate Professor - PowerPoint PPT Presentation

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Big Data processing with Hadoop Luca Pireddu CRS4Distributed Computing Group April 18, 2012

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and

Magnetic Disks Have cylinders, sectors platters, tracks, heads virtual and real disk blocks (x

Continuous Improvement Toolkit RAID Log R A I D www. citoolkit .com The Continuous

File System Reliability OSPP Chapter 14 Main Points Problem posed by machine/disk failures

Disk Management Disk Structure Disk Scheduling RAID Disk Block Management

Manage your disk space... for free :) Julien Wallior Plug Central 11.1.2006 Agenda

Mass Storage &amp; IO - II Tevfik Ko ar University at Buffalo November 10 th , 2011 1 RAID

parallel workloads Vikas Aggarwal Storage Developer Conference Bangalore 2017 Team Members

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Mass Storage & IO - II Tevfik Ko ar University at Buffalo November 10 th , 2011 1 RAID