Big Data Processing Technologies Chentao Wu Associate Professor - - PowerPoint PPT Presentation

big data processing technologies
SMART_READER_LITE
LIVE PREVIEW

Big Data Processing Technologies Chentao Wu Associate Professor - - PowerPoint PPT Presentation

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data


slide-1
SLIDE 1

Big Data Processing Technologies

Chentao Wu Associate Professor

  • Dept. of Computer Science and Engineering

wuct@cs.sjtu.edu.cn

slide-2
SLIDE 2

Schedule

  • lec1: Introduction on big data and cloud

computing

  • Iec2: Introduction on data storage
  • lec3: Data reliability (Replication/Archive/EC)
  • lec4: Data consistency problem
  • lec5: Block level storage and file storage
  • lec6: Object-based storage
  • lec7: Distributed file system
  • lec8: Metadata management
slide-3
SLIDE 3

Collaborators

slide-4
SLIDE 4

Contents

Introduction on Storage Devices

1

slide-5
SLIDE 5

An Example Memory Hierarchy

registers

  • n-chip L1

cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, and cheaper (per byte) storage devices remote secondary storage (tapes, distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers. Main memory holds disk blocks retrieved from local disks.

  • ff-chip L2

cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory. CPU registers hold words retrieved from L1 cache. L2 cache holds cache lines retrieved from main memory.

L0: L1: L2: L3: L4: L5: Smaller, faster, and costlier (per byte) storage devices

slide-6
SLIDE 6

Read Only Memory (ROM)

When a computer is first switched on, it needs to load up the BIOS (Basic Input/Output System) and basic instructions for the hardware. These are stored in ROM (Read Only Memory). This type of memory is called non-volatile because it retains the data. Data stored in ROM remains there even when the computer is switched off. ROM can be found on the motherboard.

slide-7
SLIDE 7

Random Access Memory (RAM)

Computers store temporary data in the RAM (Random Access Memory). These could be operating instructions, loose bits of data

  • r content from programs that are running.

The contents of RAM are constantly rewritten as the data is processed. When the computer is switched off, all the data is cleared from the RAM. This type of memory is called volatile because it only stores the data whilst the computer is switched on. RAM sticks are found on the motherboard.

slide-8
SLIDE 8

Secondary Storage/Backup Storage

Computers need backing storage outside the CPU to store data and programs not currently in use. There are three main types of storage device: Those that store data by magnetizing a special material that coats the surface of a disk. Those that store data using optical technology to etch the data onto a plastic-coated metal disk. Laser beams are then passed over the surface to read the data. Flash drives use solid state technology and store data in a similar way to the BIOS chip.

slide-9
SLIDE 9

Hard Disk Drives (HDDs)

The hard disk of the computer stores the system information, programs and data that the computer uses every day. Computer servers will use RAID systems with many hard drives to provide huge capacity and safer storage. The drives can be mirrored so that data written to one of them is also written to others, so if one drive fails, the

  • thers just take over.

Removable hard drives plug into the USB port and can be used for backup or transfer of data to another computer.

slide-10
SLIDE 10

What’s Inside A Disk Drive?

Spindle Arm Actuator Platters Electronics SCSI connector

Image courtesy of Seagate Technology

slide-11
SLIDE 11

Disk Electronics

  • Connect to disk
  • Control processor
  • Cache memory
  • Control ASIC
  • Connect to motor

Just like a small computer – processor, memory, network interface

slide-12
SLIDE 12

Longitudinal Recording

slide-13
SLIDE 13

How Bits Are Stored

Magnetic Transition

slide-14
SLIDE 14

Disk “Geometry”

Disks contain platters, each with two surfaces Each surface organized in concentric rings called tracks Each track consists of sectors separated by gaps

spindle surface tracks track k sectors gaps

slide-15
SLIDE 15

Disk Geometry (Muliple-Platter View)

surface 0 surface 1 surface 2 surface 3 surface 4 surface 5 cylinder k spindle platter 0 platter 1 platter 2

Aligned tracks form a cylinder

slide-16
SLIDE 16

Disk Structure

Read/Write Head Upper Surface Platter Lower Surface Cylinder Track Sector Arm Actuator

slide-17
SLIDE 17

Disk Structure - top view of single platter

Tracks divided into sectors Surface organized into tracks

slide-18
SLIDE 18

Disk Access

Head in position above a track

slide-19
SLIDE 19

Disk Access

Rotation is counter-clockwise

slide-20
SLIDE 20

Disk Access – Read

About to read blue sector

slide-21
SLIDE 21

Disk Access – Read

After BLUE read

After reading blue sector

slide-22
SLIDE 22

Disk Access – Read

After BLUE read

Red request scheduled next

slide-23
SLIDE 23

Disk Access – Read

After BLUE read Seek for RED

Seek to red’s track

slide-24
SLIDE 24

Disk Access – Read

After BLUE read Seek for RED Rotational latency

Wait for red sector to rotate around

slide-25
SLIDE 25

Disk Access – Read

After BLUE read Seek for RED Rotational latency After RED read

Complete read of red

slide-26
SLIDE 26

Disk Access – Read

After BLUE read Seek for RED Rotational latency After RED read

Seek Rotational Latency Data Transfer

slide-27
SLIDE 27

Disk Access Time

Average time to access a specific sector approximated by:

  • Taccess = Tavg seek + Tavg rotation + Tavg transfer

Seek time (Tavg seek)

  • Time to position heads over cylinder containing target sector
  • Typical Tavg seek = 3-5 ms

Rotational latency (Tavg rotation)

  • Time waiting for first bit of target sector to pass under r/w head
  • Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min
  • e.g., 3ms for 10,000 RPM disk

Transfer time (Tavg transfer)

  • Time to read the bits in the target sector
  • Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min
  • e.g., 0.006ms for 10,000 RPM disk with 1,000 sectors/track
  • given 512-byte sectors, ~85 MB/s data transfer rate
slide-28
SLIDE 28

Solid State Drives

slide-29
SLIDE 29

Flash Memory Cell

slide-30
SLIDE 30

NAND-Flash

slide-31
SLIDE 31

SLC and MLC

slide-32
SLIDE 32

Performance Comparison HDD vs SSD

slide-33
SLIDE 33

Performance Comparison HDD vs SSD

slide-34
SLIDE 34

Contents

Introduction to RAID

2

slide-35
SLIDE 35

RAID Array Components

RAID Controller

Hard Disks Logical Array (RAID Sets) RAID Array Host

slide-36
SLIDE 36

RAID Techniques

  • Three key techniques used for RAID are:
  • Striping
  • Mirroring
  • Parity
slide-37
SLIDE 37

RAID Technique – Striping

RAID Controller

Host

Stripe Strip

slide-38
SLIDE 38

RAID Technique – Mirroring

Host

Block 0

RAID Controller

Block 0 Block 0

slide-39
SLIDE 39

RAID Technique – Parity

RAID Controller

D1 D2 D3 D4 P

4 6 1 7 18

Host

Actual parity calculation is a bitwise XOR operation

slide-40
SLIDE 40

Data Recovery in Parity Technique

Host

4 + 6 + ? + 7 = 18 ? = 18 – 4 – 6 – 7 ? = 1

Regeneration of data when Drive D3 fails: D1 D2 D3 D4 P

4 6 ? 7 18 RAID Controller

slide-41
SLIDE 41

RAID-0

  • It splits data among two or more disks.
  • Provides good performance.
  • Lack of data redundancy means there is

no fail over support with this configuration.

  • In the diagram to the right, the odd

blocks are written to disk 0 and the even blocks to disk 1 such that A1, A2, A3, A4, … would be the order of blocks read if read sequentially from the beginning.

  • Used in read only NFS systems and

gaming systems.

slide-42
SLIDE 42

RAID-1

  • RAID1 is ‘data mirroring’.
  • Two copies of the data are held on two

physical disks, and the data is always identical.

  • Twice as many disks are required to

store the same data when compared to RAID 0.

  • Array continues to operate so long as at

least one drive is functioning.

slide-43
SLIDE 43

RAID-5

  • RAID 5 is an ideal combination of good

performance, good fault tolerance and high capacity and storage efficiency.

  • An arrangement of parity and CRC to

help rebuilding drive data in case of disk failures.

  • “Distributed Parity” is the key word

here.

slide-44
SLIDE 44

RAID-6

  • It is seen as the best way to guarantee

data integrity as it uses double parity.

  • Lesser MTBF compared to RAID5.
  • It has a drawback though of longer write

time.

slide-45
SLIDE 45

RAID-10

  • Combines RAID 1 and RAID 0.
  • Which means having the pleasure of

both - good performance and good failover handling.

  • Also called ‘Nested RAID’.
slide-46
SLIDE 46

Implementations

Software based RAID:

  • Software implementations are provided by many

Operating Systems.

  • A software layer sits above the disk device drivers and

provides an abstraction layer between the logical drives(RAIDs) and physical drives.

  • Server's processor is used to run the RAID software.
  • Used for simpler configurations like RAID0 and RAID1.
slide-47
SLIDE 47

Implementations (Contd.)

Hardware based RAID:

  • A hardware implementation of

RAID requires at least a special- purpose RAID controller.

  • On a desktop system this may

be built into the motherboard.

  • Processor is not used for RAID

calculations as a separate controller present.

A PCI-bus-based, IDE/ATA hard disk RAID controller, supporting levels 0, 1, and 01.

slide-48
SLIDE 48

Hot Spare

Hot spare Failed disk Replace failed disk

RAID Controller

slide-49
SLIDE 49

Thank you!