Towards a Smart Data Transfer Node Zhengchun Liu, Rajkumar - - PowerPoint PPT Presentation

towards a smart data transfer node
SMART_READER_LITE
LIVE PREVIEW

Towards a Smart Data Transfer Node Zhengchun Liu, Rajkumar - - PowerPoint PPT Presentation

The 4th Innovating the Network for Data Intensive Science (INDIS) workshop Towards a Smart Data Transfer Node Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, Peter H. Beckman Presented by: Zhengchun Liu November 12, 2017, Denver CO Motivation


slide-1
SLIDE 1

Towards a Smart Data Transfer Node

Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, Peter H. Beckman

Presented by: Zhengchun Liu November 12, 2017, Denver CO The 4th Innovating the Network for Data Intensive Science (INDIS) workshop

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

Computer systems are getting ever more sophisticated, and human-lead empirical- based approach towards system optimization is not the most efficient way to realize the full potential of these modern and complex high performance computing systems.

slide-4
SLIDE 4

Motivation

Computer systems are getting ever more sophisticated, and human-lead empirical- based approach towards system optimization is not the most efficient way to realize the full potential of these modern and complex high performance computing systems.

๏The effectiveness of parameters are not straightforward or intuitive understandable. ๏The system is dynamic. Fairly impossible to design a one-size-fits-all rule. ๏Parameter space is very big and very time consuming to explore. ๏Environment and platform are different.

slide-5
SLIDE 5

Motivation

Computer systems are getting ever more sophisticated, and human-lead empirical- based approach towards system optimization is not the most efficient way to realize the full potential of these modern and complex high performance computing systems.

๏The effectiveness of parameters are not straightforward or intuitive understandable. ๏The system is dynamic. Fairly impossible to design a one-size-fits-all rule. ๏Parameter space is very big and very time consuming to explore. ๏Environment and platform are different.

The data transfer nodes (DTN) are compute systems dedicated for wide area data transfers in distributed science environments.

slide-6
SLIDE 6

Motivation

Computer systems are getting ever more sophisticated, and human-lead empirical- based approach towards system optimization is not the most efficient way to realize the full potential of these modern and complex high performance computing systems. Inspired by work from Google Deepmind about using reinforcement learning to play games (e.g., AlphaGo, Atari). We use reinforcement machine learning methods to discover the “just right” control parameters for data transfer nodes in dynamic environment.

๏The effectiveness of parameters are not straightforward or intuitive understandable. ๏The system is dynamic. Fairly impossible to design a one-size-fits-all rule. ๏Parameter space is very big and very time consuming to explore. ๏Environment and platform are different.

The data transfer nodes (DTN) are compute systems dedicated for wide area data transfers in distributed science environments.

slide-7
SLIDE 7

Motivation

* Aggregate incoming transfer rate vs. total concurrency (i.e., instantaneous number of GridFTP server instances) at two heavily used endpoints, with Weibull curve fitted.

* Z. Liu, P . Balaprakash, R. Kettimuthu, I. Foster, Explaining wide area data transfer performance. HPDC’17

slide-8
SLIDE 8

Motivation

* Aggregate incoming transfer rate vs. total concurrency (i.e., instantaneous number of GridFTP server instances) at two heavily used endpoints, with Weibull curve fitted.

* Z. Liu, P . Balaprakash, R. Kettimuthu, I. Foster, Explaining wide area data transfer performance. HPDC’17

Luckily, the optimal operating point of these two endpoints are almost fixed. However, the optimal operating point of most endpoints are dynamical because of continuously changing external load.

slide-9
SLIDE 9

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

https://en.wikipedia.org/wiki/Q-learning

slide-10
SLIDE 10

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller)

https://en.wikipedia.org/wiki/Q-learning

slide-11
SLIDE 11

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

https://en.wikipedia.org/wiki/Q-learning

slide-12
SLIDE 12

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St

https://en.wikipedia.org/wiki/Q-learning

slide-13
SLIDE 13

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At

https://en.wikipedia.org/wiki/Q-learning

slide-14
SLIDE 14

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

https://en.wikipedia.org/wiki/Q-learning

slide-15
SLIDE 15

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn https://en.wikipedia.org/wiki/Q-learning

slide-16
SLIDE 16

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn

St The state of environment (control object) at any given time t

https://en.wikipedia.org/wiki/Q-learning

slide-17
SLIDE 17

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn

St The state of environment (control object) at any given time t

At

The corresponding optimal action at any given time t https://en.wikipedia.org/wiki/Q-learning

slide-18
SLIDE 18

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn

St The state of environment (control object) at any given time t

At

The corresponding optimal action at any given time t

Rt The actual reward from , i.e., what we want to optimize At

https://en.wikipedia.org/wiki/Q-learning

slide-19
SLIDE 19

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn

St The state of environment (control object) at any given time t

At

The corresponding optimal action at any given time t

Rt The actual reward from , i.e., what we want to optimize At

Q-learning

https://en.wikipedia.org/wiki/Q-learning

slide-20
SLIDE 20

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn

St The state of environment (control object) at any given time t

At

The corresponding optimal action at any given time t

Rt The actual reward from , i.e., what we want to optimize At

Q-learning

https://en.wikipedia.org/wiki/Q-learning

slide-21
SLIDE 21

Reinforcement Learning

[Idea] An agent interacting with an environment, which provides its current state and numeric reward signals after each action the agent takes. [Goal] Learn how to take actions in order to maximize reward.

Agent (Controller) Environment / Object

State St Action At Reward Rt

Learn

St The state of environment (control object) at any given time t

At

The corresponding optimal action at any given time t

Rt The actual reward from , i.e., what we want to optimize At

Q-learning Policy Gradient

https://en.wikipedia.org/wiki/Q-learning

slide-22
SLIDE 22

File Transfer Tool(e.g. GridFTP) Knowledge Engine

User

A pool of files to transfer Submission

R e q u e s t R e s p

  • n

s e Q u e r y

Network

Chunk size Concurrency Parallelism … …

File System

L e a r n i n g

Achieved Performance (reward)

Action

CPU load RAM load Network load Storage load … …

Environment

1 2 3 4 5 2 6

Destination DTN

Current state

Transfer with action

Smart Data Transfer Node

Workflow

1 A file transfer tool requests a file to transfer from the KE. The KE 2 checks the current DTN state and 3 responds to the transfer tool with a chunk of file and corresponding optimal transfer parameters (the steering action). 4 The transfer tool transfers the associated chunk with the parameters and monitors the aggregate DTN throughout during this transfer. 5 Once completed, DTN’s average aggregate throughput is reported to the KE as a reward for its actions. 6 Based on the reward (encourage or discourage), the KE updates its internal model parameters to improve its decision policy.

slide-23
SLIDE 23

State, Action and Reward

Context — High performance wide area data transfer scheme

slide-24
SLIDE 24

๏Wether start transferring a new file chunk (True/False). It controls the total Concurrency. ๏Parallelism used to transfer the file chunk ๏The size of file chunk to transfer. It controls the transfer duration, e.g., command frequency. ๏CPU usage (# of GridFTP instance here); ๏Total number of TCP streams on DTN; ๏The aggregate ingress and egress throughout of the DTN’s network interface card; ๏The aggregate disk read and write throughput. State, Action and Reward

St

Rt At

State Action

๏The aggregated transfer throughput (of all transfers).

Reward

slide-25
SLIDE 25

Knowledge Engine

Reinforcement learning model architecture

L

  • θQ

= 1 N X

t

[Q(St, At) − yt]2 yt = r (St, At) + γQ(St+1, At+1) rθµJ ⇡ 1 N X

t

rAtQ (St, At) rθµµ (St)

Actor Neural Network Critic Neural Network Transfer the file chunk Actor Neural Network Critic Neural Network

Reward (Average total outgoing throughput ) concurrency, parallelism, chunk size

L

  • θQ

St St

At

At+1

St+1

Q(St, At) Q(St+1, At+1)

τθµ + (1 − τ) θµT → θµT τθQ + (1 − τ) θQT → θQT

Update weights using

rAtQ (St, At) rθµµ (St)

Update weights using policy gradient Update target Update target

rθQL

minimize L

  • D. Silver et al. Deterministic Policy Gradient Algorithms. ICML’14

* * *

slide-26
SLIDE 26

Results and discussion

Reinforcement learning model accuracy versus DTN’s aggregated throughput (credit) in dedicated environment.

Effectiveness of the knowledge engine (KE) in a dedicated environment. DTN performance increases as the KE’s prediction accuracy improves. (64 iterations per epoch)

L

  • θQ

= 1 N X

t

[Q(St, At) − yt]2

slide-27
SLIDE 27

Results and discussion

Reinforcement learning model accuracy versus DTN’s aggregated throughput (credit) in dedicated environment.

Effectiveness of the knowledge engine (KE) in a dedicated environment. DTN performance increases as the KE’s prediction accuracy improves. (64 iterations per epoch)

L

  • θQ

= 1 N X

t

[Q(St, At) − yt]2

It works! The knowledge engine is able to find the optimal operating point and, keep DTN working in the optimal operating region.

slide-28
SLIDE 28

Results and discussion

Heuristic configuration (2.040 Gbps) Knowledge Engine configuration (2.043 Gbps)

Experiment in shared environment (adding artificial, reproducible external load to storage)

slide-29
SLIDE 29

Results and discussion ๏GridFTP does not support dynamic concurrency and parallelism. ๏We have to restart GridFTP to apply the new parameters. ๏There is an overhead for changing parameters.

Overhead issue

slide-30
SLIDE 30

Results and discussion ๏GridFTP does not support dynamic concurrency and parallelism. ๏We have to restart GridFTP to apply the new parameters. ๏There is an overhead for changing parameters.

Overhead issue

Steady Start up Steady Time Throughput

slide-31
SLIDE 31

Results and discussion ๏GridFTP does not support dynamic concurrency and parallelism. ๏We have to restart GridFTP to apply the new parameters. ๏There is an overhead for changing parameters.

Knowledge engine configuration, adjusted to remove overheads (2.273 Gbps)

With knowledge engine, we get about 11.3% improvement compare with heuristic configuration.

Overhead issue

slide-32
SLIDE 32

Results and discussion ๏GridFTP does not support dynamic concurrency and parallelism. ๏We have to restart GridFTP to apply the new parameters. ๏There is an overhead for changing parameters.

Knowledge engine configuration, adjusted to remove overheads (2.273 Gbps)

With knowledge engine, we get about 11.3% improvement compare with heuristic configuration.

Overhead issue

Cumulative distribution of throughput

slide-33
SLIDE 33

Fully unsupervised, does not need labeled historical data; Changes parameters automatically according the state of environment; Training is online, self-optimization; Suitable for any deployment without specialist;

Conclusion

The knowledge engine that powers the conventional data transfer node with smartness are:

slide-34
SLIDE 34

Future work

slide-35
SLIDE 35

Future work

Tuning more parameters; Testing in practical environment. Embed in distributed workflow; Smart autonomous science ecosys.;

slide-36
SLIDE 36

Future work

Knowledge Engine (KE)

ctrl data

KE

storage

ctrl data

KE

storage

ctrl data

Compress Pace Checksum Uncompress Action n Action n Checksum

… …

Source DTN Destination DTN Knowledge Engine KE Network Adaptive routing

high chance

ctrl data

scenario n

ctrl data

destination bottleneck

  • f corruption

bottleneck network

Tuning more parameters; Testing in practical environment. Embed in distributed workflow; Smart autonomous science ecosys.;

slide-37
SLIDE 37

Future work

Perception Layer Application Layer Processing Layer Collective Layer

Storage Knowledge Engine

(Lightweight Machine Learning)

Monitor Steer

Data Control

P2P Communication

Data Control Status Intent

Edge Computing Science Workflows

Intent Intent

Data Transfer Node Knowledge Engine

(Lightweight Machine Learning)

Monitor Steer

Data

P2P Communication

Data Status

Edge Computing

Intent

Compute(HPC, Cloud)

Knowledge Engine

(Lightweight Machine Learning)

Monitor Steer

Data

P2P Communication

Data Status

Edge Computing

Intent

Network Knowledge Engine

(Lightweight Machine Learning)

Monitor Steer

Data

P2P Communication

Data Status

Edge Computing

Intent

Instrument Knowledge Engine

(Lightweight Machine Learning)

Monitor Steer

Data

P2P Communication

Data Status

Edge Computing

Intent Control Control Intent Intent Control Control Intent Intent Control Control Intent Intent Control Control Intent Intent

… … … …

Fabric Layer

Knowledge Engine (KE)

ctrl data

KE

storage

ctrl data

KE

storage

ctrl data

Compress Pace Checksum Uncompress Action n Action n Checksum

… …

Source DTN Destination DTN Knowledge Engine KE Network Adaptive routing

high chance

ctrl data

scenario n

ctrl data

destination bottleneck

  • f corruption

bottleneck network

Tuning more parameters; Testing in practical environment. Embed in distributed workflow; Smart autonomous science ecosys.;

slide-38
SLIDE 38

Thank you for your attention!

U.S. Department of Energy, Office of Science, ASCR, and the program manager Richard Carlson; The Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory and the Chameleon project <www.chameleoncloud.org> for providing resources for testbed.

We also want to THANK:

Q & A