DtCraft: A High-performance Distributed Execution Engine at Scale
- Dr. Tsung-Wei Huang
Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, IL, USA
1
DtCraft: A High-performance Distributed Execution Engine at Scale - - PowerPoint PPT Presentation
DtCraft: A High-performance Distributed Execution Engine at Scale Dr. Tsung-Wei Huang Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, IL, USA 1 Outline Streamline the cluster programming
Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, IL, USA
1
Streamline the cluster programming
DtCraft system
Leverage your time to produce promising results
Hands-on examples
Q&A
2
Motivation: A “Hard-coded” Distributed Timer
General design partitions
Logical, physical, or hierarchical partitions Design data are stored in a shared storage (e.g., NFS, GPFS)
Single-server multiple-client model
Server is the centralized coordinator Clients exchange boundary timing with server
TOP level M1 Hierarchy M2 PI1 PI2 PI3 Hierarchy M1 PO1 M1:PI1 M1:PI2 M1:PO1 M2:PI1 M2:PI2
M2:PO1
M2 I1 G1 H1
Three partitions, top-level, M1, and M2 (given by design teams)
3
Non-blocking IO Event-driven programming Serialization/Deserialization
Huang et al., “A Distributed Timing Analysis Framework for Large Designs,” IEEE/ACM DAC16
4
Programming language Transparency Performance
A unified engine to streamline cluster programming
Completely built from the ground up using C++17
Save your time away from the pain of DevOps
5
DtCraft Kernel (master, agents, executors) Network programming I/O stream Event-driven reactor Resource control
…
Serialization High-level C++17-based Stream Graph API
T.-W. Huang, C.-X. Lin, and M. D. F. Wong, “DtCraft: A distributed execution engine for compute- intensive applications,” IEEE/ACM ICCAD, 2017
6
Express your parallelism in our stream graph model
Generic dataflow at any granularity
Deliver transparent concurrency through the kernel
Automatic workload distribution and message passing
DtCraft website: http://dtcraft.web.engr.illinois.edu/
A general representation of a dataflow
Abstraction over computation and communication
Analogous to the assembly line model
Vertex storage goods store Stream processing unit independent workers
7
A B
Data stream Data stream
A
istream
B
istream
A
Stream graph
B
AB AB
AB
AB
Generate data Compute unit Compute unit Generate data buffer
Streamline the cluster programming
DtCraft system
Leverage your time to produce promising results
Hands-on examples
Q&A
8
Step 1: Decide the stream graph of your application Step 2: Specify the data types to stream Step 3: Define the stream computation callback Step 4: Attach resources on vertices (optional) Step 5: Submit
9
A B
Container 1 (1 CPU / 4GB RAM) Container 2 (1 CPU / 8GB RAM)
A B String str; int id; String str; int id;
istream ./submit –master=host hello-world
Concurrent Ping-pong
Each end keeps sending a binary data to the other end Iteration breaks when one end received a hundred 1s
10
A B
‘1’ or ‘0’ (random) ‘1’ or ‘0’ (random) Break at counter ≥ 100
Step 2: bool flag; Step 2: bool flag; Step 5: ./submit –master=127.0.0.1 ping-pong
[=] (auto& B, auto& is) { Extract bool from is; if received 100: close; else send bool to A; }
istream
B Step 3: AB callback
[=] (auto& A, auto& is) { Extract bool from is; if received 100: close; else send bool to B; }
istream
A Step 3: AB callback Step 4: A’s resource 1 CPU / 1 GB RAM Step 4: B’s resource 1 CPU / 1 GB RAM Step 1: stream graph
11
… and more
~$ ./ping-pong ~$ ./submit ping-pong
Two-level hierarchical design (three partitions)
Timer Timer Timer
API
report_at report_slew report_rat remove_gate insert_gate power_gate insert_net connect_pin
... Optimization program
12 TOP level M1 Hierarchy M2 PI1 PI2 PI3 Hierarchy M1 PO1 M1:PI1 M1:PI2 M1:PO1 M2:PI1 M2:PI2
M2:PO1
M2 I1 G1 H1
User
Three timer vertices One user vertex Four Linux containers Six input/output streams
Boundary timing Timing command s
Top-level M1 M2 Each container has one OpenTimer
13
DtCraft Existing framework
In-context streaming with < 30 lines
Extra.pb.h Extra.pb.cpp … Source.cpp
Out-of-context streaming takes > 300 lines Many extra stuff
14
~$ ./submit –master=127.0.0.1 binary
Existing framework DtCraft
Top.cpp M1.cpp M2.cpp
Duplicate the code for each partition
Container 1 Container 2 Container 3
Wrap up with submission scripts Only three lines for resource control in Linux container
15
×17 fewer lines of code
33% from message passing 67% from boilerplate code
7-11% performance loss
Transparent concurrency API cost
2000 4000 6000 Small Medium Large
Runtime (40 AWS nodes)
DtCraft Hard-coded
The potential productivity gain is tremendous!
5 10 15 # weeks
Development time
DtCraft Hard-coded
Github: https://github.com/twhuang-uiuc/DtCraft Star our project to receive updates MIT license Open to collaboration!
16
Cluster computing
Scalability Security Productivity Groovy API
One image stream generator One image label classifier/trainer
17
Image stream Stream generator Online image label classifier
…
18
Only 60-line code to create distributed ML with streaming
19
Tsung-Wei Huang twh760812@gmail.com http://web.engr.illinois.edu/~thuang19/