CS 644: Introduction to Big Data Chapter 8. Enabling Big-data - - PowerPoint PPT Presentation

cs 644 introduction to big data chapter 8 enabling big
SMART_READER_LITE
LIVE PREVIEW

CS 644: Introduction to Big Data Chapter 8. Enabling Big-data - - PowerPoint PPT Presentation

CS 644: Introduction to Big Data Chapter 8. Enabling Big-data Scientific Workflows in High-performance Networks Chase Wu New Jersey Institute of Technology 1 Outline Introduction Challenges & Objectives A Three-layer


slide-1
SLIDE 1

CS 644: Introduction to Big Data Chapter 8. Enabling Big-data Scientific Workflows in High-performance Networks

1

Chase Wu

New Jersey Institute of Technology

slide-2
SLIDE 2

Outline

  • Introduction
  • Challenges & Objectives
  • A Three-layer Architecture Solution
  • Enabling Technologies: networking and computing
  • Networking for Big Data
  • Software-defined Networking (SDN)
  • High-performance Networking (HPN)
  • Computing for Big Data
  • Workflow Management and Optimization
  • Simulation/Experimental Results
  • Conclusion

2

slide-3
SLIDE 3

Introduction

  • Supercomputing for big-data science

3

Astrophysics Computational biology Climate research Flow dynamics Computational materials Fusion simulation Neutron sciences Nanoscience

slide-4
SLIDE 4

Terascale Supernova Initiative (TSI)

4

Visualization channel Visualization control channel Computation steering channel

  • Collaborative project
  • Supernova explosion
  • TSI simulation
  • 1 terabyte a day with a

small portion of parameters

  • From TSI to PSI
  • Transfer to remote sites
  • Interactive distributed

visualization

  • Collaborative data

analysis

  • Computation monitoring
  • Computation steering

Supercomputer or Cluster Client

slide-5
SLIDE 5

Challenges for Extreme-scale Scientific Applications

  • A typical way of research conduct
  • Run simulation code on supercomputer in a batch mode

Ø One day: 1 terabyte datasets

  • Move datasets to HPSS

Ø About 8 hours

  • Transfer datasets to remote sites over the Internet

Ø TCP-based transfer tools: up to one week

  • Filter out data of interest
  • Partition dataset for parallel processing
  • Extract geometry data
  • Generate images on rendering engine
  • Display results on desktop, laptop, powerwall, etc.

}

5

Visualization: several hours or days

Start over if any parameter values are not set appropriately!

slide-6
SLIDE 6

Challenges in Modern Sciences

  • BIG DATA: from T to P, to E, to Z, to Y, and beyond…
  • Simulation
  • Astrophysics, climate modeling, combustion research, etc.
  • Experimental
  • Spallation Neutron Source, Large Hadron Collider, etc.
  • Observational
  • Large-scale sensor networks, astronomical image data (Dark

Energy Camera), etc.

No matter which type of data is considered, we need

an end-to-end workflow solution

for data transfer, processing, and analysis!

6

slide-7
SLIDE 7

Big-data Scientific Workflows

  • Require massively distributed resources
  • Hardware
  • Computing facilities, storage systems, special rendering engines, display devices

(tiled display, powerwall, etc.), network infrastructure, etc.

  • Software
  • Domain-specific data analytics/processing tools, programs, etc.
  • Data type
  • Real-time, archival
  • Feature different complexities
  • Simple case: linear pipeline (a special case of DAG)
  • Complex case: DAG-structured graph
  • Support different application types
  • Interactive: minimize total end-to-end delay for fast response
  • Streaming: maximize frame rate to achieve smooth data flow

7

slide-8
SLIDE 8

8

  • Support distributed workflows in heterogeneous

environments

  • Optimize workflow performance to meet various

user requirements

Ø Delay, throughput, reliability, etc. Ø Remote visualization, online computational monitoring and steering, etc.

  • Make the best use of computing and networking

resources

Ultimate Goals

slide-9
SLIDE 9

Solution: A Three-layer Architecture

9

slide-10
SLIDE 10

Enabling Technologies

10

  • Three layers
  • Top: Abstract scientific workflow
  • Middle: Virtual overlay network (grid, cloud)
  • Bottom: Physical high-performance network
  • Top and bottom layers meet at middle layer
  • From bottom to middle: resource abstraction
  • Bandwidth scheduling
  • Performance modeling and prediction
  • From top to middle: workflow mapping
  • Optimization: where to execute modules?
  • Workflow execution
  • Actual data transfer: transport control
  • Actual module running: job scheduling
slide-11
SLIDE 11

Networking Requirements

  • Provision dedicated channels to meet

different transport objectives

  • High bandwidths
  • Multiples of 10Gbps to terabits networking
  • Support bulk data transfers
  • Stable bandwidths
  • 100s of Mbps
  • Support interactive control operations
  • Why not the Internet?
  • Only backbone has high bandwidths (last mile)
  • Packet-level resource sharing
  • Best-effort IP routing
  • TCP: hard to sustain 10s Gbps or to stabilize

11

slide-12
SLIDE 12

12

An Overview of TCP/IP Stack

slide-13
SLIDE 13

Software-Defined Networking

  • The Concept of Virtualization
  • Virtualization of Computing
  • Virtualization of Networking
  • Software-Defined Network
  • Possible Directions

13

slide-14
SLIDE 14

Concept of Virtualization

  • Decoupling HW/SW
  • Abstraction and layering
  • Using, demanding, but not owning or

configuring

  • Resource pool: flexible to slice, resize,

combine, and distribute

  • A degree of automation by software

14

slide-15
SLIDE 15

Benefits of Virtualization

  • An analogy: owning a huge house
  • Real estate, immovable property
  • Does not generate cash and income
  • How to gain more profit?
  • Divide this huge house into suites, and RENT to

people!

  • Renting suites: using but not owning
  • Transform a static investment into cash

generators!!!

15

slide-16
SLIDE 16

Virtualization of Computing

  • Partitioning one physical machine
  • Virtual instances
  • Running concurrently, sharing resources
  • Hypervisor: Virtual Machine Monitor (VMM)
  • A software layer presents abstraction of physical resources

Key Factor of Virtualization

16

slide-17
SLIDE 17

Networks are Hard to Manage

  • Operating a network is expensive
  • More than half the cost of a network
  • Yet, operator error causes most outages
  • Buggy software in the equipment
  • Routers with 20+ million lines of code
  • Cascading failures, vulnerabilities, etc.
  • The network is “in the way”
  • Especially a problem in data centers
  • … and home networks

17

slide-18
SLIDE 18

Traditional Computer Networks

Collect measurements and configure the equipment Management plane:

18

slide-19
SLIDE 19

Software Defined Networking (SDN)

API to the data plane (e.g., OpenFlow) Logically-centralized control Switches Smart, slow Dumb, fast

19

slide-20
SLIDE 20

OpenFlow Protocol NETWORK OPERATING SYSTEM

Bandwidth

  • on -

Demand Dynamic Optical Bypass Unified Recovery

Unified Control Plane

Switch Abstraction Networking Applications

VIRTUALIZATION (SLICING) PLANE Underlying Data Plane Switching

Traffic Engineering Application

  • Aware

QoS

Provide Choices

Packet Switch Packet Switch Wavelength Switch Time-slot Switch Multi-layer Switch

Packet & Circuit Switch Packet & Circuit Switch

20

slide-21
SLIDE 21

Architecture

Onix / Network OS

Logical Forwarding Plane Control Plane / Applications

Network Hypervisor

Real States Logical States Abstractions Mapping Control Commands Distributes, Configures Network Info Base

API

Distributed System

Abstraction

Provides

Provides

OpenFlow

21

slide-22
SLIDE 22

Switch Forwarding Pipeline

Logical Forwarding Plane As packets/flows traverse the network: moving both in logical and physical forwarding plane → logical context

22

slide-23
SLIDE 23

Data-Plane: Simple Packet Handling

  • Simple packet-handling rules
  • Pattern: match packet header bits
  • Actions: drop, forward, modify, send to controller
  • Priority: disambiguate overlapping patterns
  • Counters: #bytes and #packets
  • 1. src=1.2.*.*, dest=3.4.5.* à drop
  • 2. src = *.*.*.*, dest=3.4.*.* à forward(2)
  • 3. src=10.1.2.3, dest=*.*.*.* à send to controller

23

slide-24
SLIDE 24

Unifies Different Kinds of Boxes

  • Router
  • Match: longest destination

IP prefix

  • Action: forward out a link
  • Switch
  • Match: destination MAC

address

  • Action: forward or flood
  • Firewall
  • Match: IP addresses and

TCP/UDP port numbers

  • Action: permit or deny
  • NAT
  • Match: IP address and port
  • Action: rewrite address and

port

24

slide-25
SLIDE 25

Controller: Programmability

25

Network OS

Controller Application Events from switches Topology changes, Traffic statistics, Arriving packets Commands to switches (Un)install rules, Query statistics, Send packets

slide-26
SLIDE 26

Example OpenFlow Applications

  • Dynamic access control
  • Seamless mobility/migration
  • Server load balancing
  • Network virtualization
  • Using multiple wireless access points
  • Energy-efficient networking
  • Adaptive traffic monitoring
  • Denial-of-Service attack detection

See http://www.openflow.org/videos/

26

slide-27
SLIDE 27

Example: Dynamic Access Control

  • Inspect first packet of a connection
  • Consult the access control policy
  • Install rules to block or route traffic

27

slide-28
SLIDE 28

Seamless Mobility/Migration

  • See host send traffic at new location
  • Modify rules to reroute the traffic

28

slide-29
SLIDE 29

Server Load Balancing

  • Pre-install load-balancing policy
  • Split traffic based on source IP

29

src=0* src=1*

slide-30
SLIDE 30

Network Virtualization

30

Partition the space of packet headers

Controller #1 Controller #2 Controller #3

slide-31
SLIDE 31

OpenFlow in the Wild

  • Open Networking Foundation
  • Google, Facebook, Microsoft, Yahoo, Verizon, Deutsche

Telekom, and many other companies

  • Commercial OpenFlow switches
  • HP, NEC, Quanta, Dell, IBM, Juniper, …
  • Network operating systems
  • NOX, Beacon, Floodlight, Nettle, ONIX, POX, Frenetic
  • Network deployments
  • Eight campuses, and two research backbone networks
  • Commercial deployments (e.g. Google backbone)

31

slide-32
SLIDE 32

Controller Delay and Overhead

  • Controller is much slower than the switch
  • Processing packets leads to delay and overhead
  • Need to keep most packets in the “fast path”

32

packets

slide-33
SLIDE 33

Distributed Controller

33

Network OS

Controller Application

Network OS

Controller Application For scalability and reliability Partition and replicate state

slide-34
SLIDE 34

High-performance Networks

  • Production and testbed networks
  • UltraScience Net
  • ESnet OSCARS
  • Offers MPLS tunnels and VLAN virtual circuits
  • Internet2 ION
  • Offers MPLS tunnels and VLAN virtual circuits
  • UCLP
  • User Controlled Light Paths
  • CHEETAH
  • Circuit-switched High-speed End-to-End Transport

Architecture

  • DRAGON
  • Dynamic Resource Allocation via GMPLS Optical

Networks

34

slide-35
SLIDE 35

UltraScience Net – In a Nutshell

  • Experimental Network Research Testbed

35

slide-36
SLIDE 36

Control Plane

36

Responsible for

  • 1. Reserving link

bandwidths

  • 2. Setting up end-to-

end network paths

  • 3. Releasing

resources when tasks are completed

slide-37
SLIDE 37

Bandwidth Scheduling

37

slide-38
SLIDE 38

Bandwidth Scheduler

  • Central component
  • Computes paths and allocate link bandwidths
  • Scheduling in USN
  • Fixed slot: start time, end time, target BW
  • Extension of Dijkstra’s Algorithm
  • All slots: duration, target BW
  • Extension of Bellman-Ford Algorithm
  • Both are solvable by polynomial-time algorithms

38

slide-39
SLIDE 39

Network Model for Bandwidth Scheduling

  • Graph: G(V, E)
  • Link bandwidth
  • Segmented constant functions of time
  • Time-bandwidth (TB): (t[i], t[i+1], b[i])
  • Aggregated TB for all links

39 Residual Bandwidth

[0] t [2] t [1] t [3] t

slide-40
SLIDE 40

4 Types of Scheduling Problems (TON’13)

  • Given: G = (E, V), ATB, vs , vd , data size δ
  • Objective: minimize data transfer end time
  • Fixed Path with Fixed Bandwidth (FPFB)
  • Compute a fixed path from vs to vd with a constant (fixed)

bandwidth

  • Fixed Path with Variable Bandwidth (FPVB)
  • Compute a fixed path from vs to vd with varying bandwidths

across multiple time slots

  • Variable Path with Fixed Bandwidth (VPFB)
  • Compute a set of paths from vs to vd at different time slots with

the same (fixed) bandwidth

  • Variable Path with Variable Bandwidth (VPVB)
  • Compute a set of paths from vs to vd at different time slots with

varying bandwidths across multiple time slots

40

slide-41
SLIDE 41

VPFB & VPVB

  • Multiple paths are used in a sequential order
  • Path switching incurs a delay (overhead) τ
  • Path switching delay is negligible (τ = 0)
  • VPFB-0, VPVB-0
  • Path switching delay is not negligible (τ > 0)
  • VPFB-1, VPVB-1
  • When τ is large enough
  • VPFB-1 -> FPFB, VPVB-1 -> FPVB

41

slide-42
SLIDE 42

Problem Features

  • FPFB is the most stringent
  • VPVB is the most flexible
  • FPFB and VPFB restrict the BW
  • Not always optimal to start data transfer

immediately

  • Suited for transport methods with fixed rate
  • FRTP, PLUT, Tsunami, Hurricane, RBUDP
  • FPVB and VPVB use variable BW
  • Always start immediately
  • Suited for transport methods with dynamically

adapted rate

  • SABUL, RAPID, RUNAT, improved TCP

42

slide-43
SLIDE 43

Complexity and Algorithm

  • An optimal algorithm for each problem
  • A heuristic for each NPC problem

43

slide-44
SLIDE 44

Workflow Mapping

44

slide-45
SLIDE 45

Cost Models

A computing workflow consist of modules: . A computer network modeled an arbitrary directed graph , where computing nodes are interconnected by directed communication links. Objectives: map modules to nodes to achieve minimum end-to-end delay (MED) or maximum frame rate (MFR).

1 1

, ,...

m

w w w -

, 1, , 1

j

w j m =

  • !

( )

j

w

l ×

1 j

w +

1 j

z -

Computational complexity

j

z

1 j

w -

Module

Recv data : Send data :

( , ) G V E = | | V n =

1 1

, , ,

n

v v v - !

k

v

Link bandwidth:

, h k

b

h

v

Min link delay:

, h k

d

Node processing power

k

p

Network

m

, , ,

( ) Message transmission time: , Node computing time:

i i

w w i j i j i j

z z d b p l +

45

Link failure rate:

, h k

f

Node Failure rate

Workflow

k

f

slide-46
SLIDE 46
  • Workflow mapping and execution conditions
  • Single-node mapping
  • If w is mapped to v,
  • Otherwise,
  • Module execution precedence
  • Data transfer precedence

1,

c

wv w v V

x w V

Î

= " Î

å

,

,

, ,

j i j

s f w e j w i j w

t t w V e E ³ " Î Î

,

,

, ,

i j i

s f e w i w i j w

t t w V e E ³ " Î Î 1

wv

x =

wv

x =

46

slide-47
SLIDE 47
  • Fair node sharing
  • Fair link sharing

comp

( ) ( , ) ( , ) , ,

w w w c v

z A w v T w v w V v V p l × = " Î Î

where and

( , ) ( )

f w s w

t v t

A w v t dt a = ò

:( )( ) 0

( )

f s w w w

v wv w V t t t t

t x a

Î

  • ³

=

å

, , , tran , , , , , ,

( , ) ( , ) , ,

i j i j h k i j h k h k i j w h k c h k

z B e l T e l d e E l E b × = + " Î Î

where and

, , ,

, ,

( , ) ( )

f ei j s h k ei j

t i j h k l t

B e l t dt b = ò

, , , ,

:( )( ) 0

( )

h k i h j k f s i j w e e i j i j

l w v w v e E t t t t

t x x b

Î

  • ³

= ×

å

47

slide-48
SLIDE 48
  • Performance Metrics
  • End-to-end delay (ED)

1

ED 1 2 comp tran ( , ) 1 [ ] , 1 [ ] , 1 1 [ ], [ 1] [ ], [ 1] [ ], [ 1]

(CP mapped to a path of nodes) ( ) ( , ) ( ) ( ( , ), )

i i i j j i

q q g e g g i i q w w j P i i j g j P i i i i i P i P i P i P i i P i P i

T P q T T T T z A w v p z g g B e g g l d b l

+

  • =

=

  • =

Î ³ + + + + = +

= + = + × æ ö = ç ÷ ç ÷ è ø æ ö × + + ç ÷ ç ÷ è ø

å å å å

2 q-

å

48

slide-49
SLIDE 49
  • Frame rate: inverse of global bottleneck (BN)
  • Overall failure rate

, , ' ', ' ' ', '

BN ' comp ' ' , , , , ', ' tran , ', ' , , ', ' ', '

( mapped to a network ) ( ) ( , ) ( , ) max max ( , ) ( , )

i i i w j k w i w j k w i c j k i c j k

w c w w i i i i i w V e E w V e E j k j k j k j k j k v V l Ec v V l Ec j k j k

T G G z A w v T w v p z B e l T e l d b l

Î Î Î Î Î Î Î Î

× æ ö ç ÷ æ ö ç ÷ = = ç ÷ ç ÷ ç ÷ × è ø + ç ÷ ç è ø ÷

, , , ,

, mappedon mappedon , ,

1 (1 ) (1 )

i j h k i h i w h c i j w h k c

h k h e l w v w V v V e E l E

f f

Î Î Î Î

æ ö æ ö ç ÷ ç ÷ = -

  • ×
  • ç

÷ ç ÷ ç ÷ ç ÷ è ø è ø

Õ Õ

F

49

slide-50
SLIDE 50
  • Objective Functions
  • Minimum End-to-end Delay (MED)
  • Maximum Frame Rate (MFR)

ED all possible mappings

min ( ), such that T £ F F

BN all possible mappings

min ( ), such that T £ F F

50

slide-51
SLIDE 51

Pipeline Mapping

  • - A special case of DAG
  • Problem categories
  • MED

Ø No Node Reuse (MED-NNR) Ø Contiguous Node Reuse (MED-CNR) Ø Arbitrary Node Reuse (MED-ANR)

  • MFR

Ø No Node Reuse or Share (MFR-NNR) Ø Contiguous Node Reuse and Share (MFR-CNR) Ø Arbitrary Node Reuse and Share (MFR-ANR)

51

A rbi t rary node reuse

Source Destination

N

  • node reuse

C

  • nt i guous node reuse

M0 M1 Mn-5 Mn-4 Mn-2 Mn-1 Mn-3 Vn-2 M2 Vp+1 Vq+1 Vp Vq V1 V5 Vs Vd Source Destination M0 M1 Mn-5 Mn-4 Mn-2 Mn-1 Mn-3 Vn-2 M2 Vp+1 Vq+1 Vp Vq V1 V5 Vs Vd Source Destination M0 M1 Mn-5 Mn-4 Mn-2 Mn-1 Mn-3 Vn-2 M2 Vp+1 Vq+1 Vp Vq V1 V5 Vs Vd

slide-52
SLIDE 52

Problem Category and Complexity

52

No Node Reuse Contiguous Node Reuse Arbitrary Node Reuse Objective Function Constraints Minimum End-to-end Delay NP-complete NP-complete Polynomial (Dyn. Prog.) Maximum Frame Rate NP-complete NP-complete NP-complete

  • MED-ANR is polynomially solvable
  • Dynamic Programming -based solution
  • MED/MFR-NNR/CNR are NP-complete
  • Reduce from DISJOINT-CONNECTING-PATH (DCP)
  • MFR-ANR is NP-complete
  • Reduce from Widest-path with Linear Capacity Constraints

(WLCC)

slide-53
SLIDE 53

53

  • Mapping algorithm for MED:
  • Recursive Critical Path for MED with Failure Rate

Constraint

RCP-F

From special to general: DAG Mapping (TC’14)

slide-54
SLIDE 54

54

  • Mapping algorithm for MFR:
  • Layer-oriented DP algorithm for MFR with Failure Rate

Constraint

  • Sort a DAG-structured workflow in a topological order
  • Map computing modules to network nodes on a layer-by-layer

basis, considering:

  • Module dependency in the workflow
  • Network connectivity
  • Node and link failure rate

LDP-F

( ( ))

, , , , , , , , ( ), 1to 1, 1 ( ( )) ,

( , ) max 1 (1 ) (1 ) (1 ) if ( ) ( , ) max ( ) ( , ) max

j j u j i v V pre w mapping j j j

h u u j u j h i h i j u h i i h i w w j i i j w pre w j m h n v suc v i h u w w j h h

T z B e l d f f h i b z A w v T p T z A w v p l l

" Î

Î =

  • £ £ -

Î

æ ö ç ÷ ç ÷ ç ÷ × ç ÷ + Ù = -

  • ×
  • ×
  • £

¹ ç ÷ ç ÷ × = ç ÷ ç ÷ è ø × ! F F F F F 1 (1 ) (1 ) if

j u i

f h i æ ö ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ æ ö ç ÷ ç ÷ Ù = -

  • ×
  • £

= ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ è ø è ø F F F F F

slide-55
SLIDE 55

55

The DP table with separated layers in . Layered workflow in a topological sorting.

vs=v0 vd=vn-1 v1 v2 v3 v4 ... w0 w1 w2 w3 w4 w5 …... ws wm-1 …... …... …... …... …... ... …... layer 0 vs=v0 vd=vn-1 v1 v2 v3 v4 ... w0 w1 w2 w3 w4 w5 …... ws wm-1 …... …... …... …... ... …... layer 1 vs=v0 vd=vn-1 v1 v2 v3 v4 ... w0 w1 w2 w3 w4 w5 …... ws wm-1 …... …... …... …... …... ... …... layer 2 vs=v0 vd=vn-1 v1 vy vp vq ... w0 w1 w2 w3 wl wt ws wm-1 …... …... …... …... …... ... …... layer l-2 vs=v0 vd=vn-1 w0 w1 w2 w3 wswm-1 …... …... …... …... ... …... layer l-1

…... T0,0 T1,1 T2,2 T2,4 T3,3 T4,5

v1 vy vp vq ... …...

Tq,s

Tn-1,m-1

…... …... …... wl wt

Ty,lTp,t

LDP-F

vs v3 v1 vd vq vx vp

…... …...

w0

wm-1

w2 wl wt

…... …... …...

Workflow Computer network

w1 ws vy v2 w3

…...

w5 w4

layer 0 layer 1 layer 2

…...

layer l-1 layer l-2

v4 v5

slide-56
SLIDE 56

Jet air flow dynamics (pressure, raycasting) TSI explosion (density, raycasting)

A Prototype System: Distributed Remote Intelligent Visualization Environment (DRIVE)

56

Two Examples in the Visualization of Large-scale Scientific Applications

slide-57
SLIDE 57

A Production System (JOGC’13): Scientific Workflow Automation and Management Platform (SWAMP)

57

slide-58
SLIDE 58

58

Use Case: Spallation Neutron Source (SNS)

slide-59
SLIDE 59

SNS Workflow (abstract)

59

slide-60
SLIDE 60

SNS Workflow (concrete)

60

slide-61
SLIDE 61

61

slide-62
SLIDE 62

SNS Workflow Results

62