Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch - PowerPoint PPT Presentation

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch , Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, Zachary Tatlock

Tension between performance and flexibility 🐈 🧙 3

Tension between performance and flexibility 🐈 🧙 4

From OpenAI’s recent blog post: https://blog.openai.com/ai-and-compute/ 5

“We believe the largest training runs today employ hardware that cost in the single digit millions of dollars to purchase (although the amortized cost is much lower).” –- Open AI Blog 6

Growing compute • The community is addressing need for cost e ff ective compute with new hardware designs. • TPU, Trillium, A10 Bionic, Brainwave, etc • Hardware landscape is becoming very heterogeneous, mix of CPUs, GPUs, custom accelerators. 7

Growing compute • Di ff erent operating environments; ex. can be memory hungry in cloud, but not edge devices. • Introducing new compute may increase runtime e ffi ciency. • Doesn’t account for programming and porting costs. • For ex. Cloud FPGAs 8

Leveraging diversity • Current state of the art is to port and tweak models by hand for each hardware platform until they work. • How to write programs for many di ff erent devices, optimize for: • Memory • Quantization • New numeric representations • Model transforms • Layout change • Device scheduling 9

VTA • Take our friend Thierry who has been building new hardware accelerators for ML. • How to program the hardware? • How to port existing models? • How to adapt software for di ff erent HW designs? 10

Portability + Flexibility • We need models that can be e ff ectively optimized and run on a variety of devices. • We want generic models, but tuned implementations • Can we build custom hardware from directly from model descriptions? • “Write once, run everywhere ” 11

TVM • An end-to-end compiler stack for deep learning. • Hierarchal intermediate representations, tightly integrated for tuning models for specific hardware targets. • TVM is currently focused on producing high performance operator implementations. • TVM is bottom-up. 12

Relay • We contribute a new high level IR for TVM named Relay. • Generalize computation graphs to di ff erentiable programs. • Write Python (in the style of PyTorch) but apply end-to-end optimizations. • Composed of new front-end , IR , auto-di ff , optimizer , backend , and runtime . • Relay is top-down. 13

graph, lib, params = module = runtime.create(graph, lib, t.cuda(0)) t.compiler.build(graph, target, params) module.set_input(**params) module.run(data=data_array) CoreML output = t.nd.empty(out_shape, ctx=t.cuda(0)) Frameworks CNTK module.get_output(0, output) Computational Graph input High level Data-flow Rewriting Tensor Operator Description Deployable Module Schedule prediction tabby, tabby cat Accelerators LLVM CUDA/Metal/OpenCL 14

What Relay will replace graph, lib, params = module = runtime.create(graph, lib, t.cuda(0)) t.compiler.build(graph, target, params) module.set_input(**params) module.run(data=data_array) CoreML output = t.nd.empty(out_shape, ctx=t.cuda(0)) Frameworks CNTK module.get_output(0, output) Computational Graph input High level Data-flow Rewriting Tensor Operator Description Deployable Module Schedule prediction tabby, tabby cat Accelerators LLVM CUDA/Metal/OpenCL 14

Why not current frameworks or IRs? • We believe the key to being able to optimize programs e ff ectively is a typed, whole program representation of machine learning models. • We will show how current framework’s IRs are lacking, then examine how Relay addresses these challenges. 15

DL Frameworks Compilers • We are at the dawn of the compiler age for deep learning. • Framework designers realize performance is being left on the table, and frameworks are converging on compilation pipelines. • XLA for TF , Glow for PyTorch, NNVM/TVM for MxNet • Other IRs are framework first, we want to be IR first! • Need “whole model” to do certain classes of optimization, analogous to “whole program” in traditional compilers. • But we want flexibility, portability, and performance! 16

Disadvantages: Advantages: - Embedded domain specific language + Embedded domain specific language - Users write programs to build graph and later execute. + Dataflow graph gives rise to straightforward execution and - Staging can be complex and scheduling. confusing. + The graph is easy to optimize and - IR is computation graph (i.e a data compile, for example static memory flow graph) with embedded control planning. and mutation. + XLA style compilation is - Ex. What does a gradient of an straightforward. impure function mean? 17

x = tf.placeholder(tf.float32, shape=(None, D_in)) y = tf.placeholder(tf.float32, shape=(None, D_out)) w1 = tf.Variable(tf.random_normal((D_in, H))) w2 = tf.Variable(tf.random_normal((H, D_out))) h = tf.matmul(x, w1) h_relu = tf.maximum(h, tf.zeros(1)) Need to evaluate loss y_pred = tf.matmul(h_relu, w2) sess.run executes graph loss = tf.reduce_sum((y - y_pred) ** 2.0) grad_w1, grad_w2 = tf.gradients(loss, [w1, w2]) new_w1 = w1.assign(w1 - learning_rate * grad_w1) new_w2 = w2.assign(w2 - learning_rate * grad_w2) for _ in range(500): loss_value, _, _ = sess.run( [loss, new_w1, new_w2], feed_dict={x: x_value, y: y_value}) … Adapted from: https://github.com/jcjohnson/pytorch-examples/blob/master/autograd/two_layer_net_autograd.py 18

Advantages: Disadvantages: + Shallow embedding, users just - Trace based JIT and exporting, interact with normal Python APIs only capture specific execution traces. + Expressive, can use all of Python - Not “whole model” to interact with PyTorch, as it is the execution layer upto tensors. - Python is “control plane” + Trace based auto-di ff over a - C extensions are “data plane”; subset of Python, can handle arbitrary control flow. requires C extensions - Incredibly limited and brittle + Can accelerate pieces using Glow export functionality. and Tensor Comprehensions 19

Tracing based tools fail if traces change at all (i.e essentially static graph) 20

x = torch.randn(N, D_in) y = torch.randn(N, D_out) w1 = torch.randn(D_in, H, requires_grad=True) w2 = torch.randn(H, D_out, requires_grad=True) for t in range(500): y_pred = x.mm(w1).clamp(min=0).mm(w2) loss = (y_pred - y).pow(2).sum() print(t, loss.item()) Updates can be implemented loss.backward() in vanilla Python with torch.no_grad(): w1 -= learning_rate * w1.grad w2 -= learning_rate * w2.grad w1.grad.zero_() w2.grad.zero_() Adapted from: https://github.com/jcjohnson/pytorch-examples/blob/master/autograd/two_layer_net_autograd.py 21

System Design CoreML Frameworks CNTK Computational Graph High level Data-flow Rewriting Tensor Operator Description Schedule CUDA/Metal/OpenCL Accelerators LLVM 23

System Design CoreML Frameworks Relay Python Decorator CNTK Relay Fusion, Layout Change, Partial Eval, Traditional Optimizations Control Operators Tensor Operator Description Relay runtime Schedule system Hardware Implementation 24

Language • Functional higher order language • Closures • Tensors • Control flow • References • Shape dependent type system • Di ff erentiable 25

Language • Functional higher order language • Closures • Tensors Old PL you know and love • Control flow • References • Shape dependent type system • Di ff erentiable 25

Language • Functional higher order language • Closures • Tensors Old PL you know and love • Control flow New challenges • References • Shape dependent type system • Di ff erentiable 25

Frontend • Our current frontend is a subset of Python. @relay • We use AST rewriting to def linear_loss(a, b, x, y): transform the Python program y_hat = a * x + b into our IR directly. return (y - y_hat)**2 • We can statically analyze this subset, and type check it. • We rely on MyPy’s infrastructure (annotations, and typed_ast ). 🧙 26

If we remove all syntactic sugar we can see a little more what’s going on: @relay def linear_loss( a: Tensor [ Float , (1, 1)], b: Tensor [ Float , (1, 1)], x: Tensor [ Float , (1, 1)], y: Tensor [ Float , (1, 1)]) -> Tensor [ Float , (1, 1)]: y_hat = relay.broadcast_add(relay.broadcast_mul(a, x), b) diff = relay.broadcast_sub(y, y_hat) return relay.broadcast_mul(diff, diff) 27

If we remove all syntactic sugar we can see a little more what’s going on: We can use Python’s type annotations to provide type info. @relay def linear_loss( a: Tensor [ Float , (1, 1)], b: Tensor [ Float , (1, 1)], x: Tensor [ Float , (1, 1)], y: Tensor [ Float , (1, 1)]) -> Tensor [ Float , (1, 1)]: y_hat = relay.broadcast_add(relay.broadcast_mul(a, x), b) diff = relay.broadcast_sub(y, y_hat) return relay.broadcast_mul(diff, diff) 27

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch - PowerPoint PPT Presentation

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch , Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, Zachary Tatlock 2 Tension between performance and flexibility 3 Tension between

Tor61 P P R2 Time Note on Relay Packets A relay does not look inside Relay cells unless

2010 Relay for Life Seasons of Hope What is Relay for Life? Relay for Life is an annual event

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

Wave Relay System and Wave Relay System and General Project Details General Project Details

What is the National Traffic System (NTS)? The RELAY in American Radio Relay League

Frame Relay Basic Configurations: Point to Point Frame Relay Basic Point to Point Configuration

Frame Relay Basic Configurations: Hub and Spoke Frame Relay Basic Hub and Spoke Configuration

Harry Porters Relay Computer Harry Porter, Ph.D. Portland State University November 7, 2007

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Frame Relay Analysis 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878 Phone: (301)

Joint Optimal Power Allocation and Relay Selection with Spatial Diversity in Wireless Relay

Cooperative Strategies and Capacity Theorems for Relay Networks Desmond Lun 22 November 2004

RELAYS, CONTACTORS & MOTOR STARTER TESTING EQUIPMENTS LIST OF TEST EQUIPMENT Relay Test

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

A Quantum Multiparty Packing Lemma and the Relay Channel Dawei Ding Stanford University Joint

Frame Relay Bigger, Longer, Uncut 2005/03/11 (C) Herbert Haas What is Frame Relay?

MSRP Relays Rohan Mahy (rohan@cisco.com) Cullen Jennings (fluffy@cisco.com) Status and Changes

Model Checking Infinite-state Systems in SAL Bruno Dutertre, SRI International Automated Formal

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 1 This

Status of Krell Tools Built using Dyninst/MRNet Paradyn Week

Relay Attacks and Distance Bounding Protocols in RFID Environments Prof. Gildas Avoine

Long Distance Relay Attack Luigi Sportiello Joint Research Centre Institute for the Protection

Agenda APT demo (what is the Green Dance?) Pancakes How to hand-in Types and values

Busy Developer's Guide to Building A Virtual Machine Ted Neward Neward & Associates

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch - PowerPoint PPT Presentation

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch , Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, Zachary Tatlock 2 Tension between performance and flexibility 3 Tension between

Tor61 P P R2 Time Note on Relay Packets A relay does not look inside Relay cells unless

2010 Relay for Life Seasons of Hope What is Relay for Life? Relay for Life is an annual event

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

Wave Relay System and Wave Relay System and General Project Details General Project Details

What is the National Traffic System (NTS)? The RELAY in American Radio Relay League

Frame Relay Basic Configurations: Point to Point Frame Relay Basic Point to Point Configuration

Frame Relay Basic Configurations: Hub and Spoke Frame Relay Basic Hub and Spoke Configuration

Harry Porters Relay Computer Harry Porter, Ph.D. Portland State University November 7, 2007

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Frame Relay Analysis 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878 Phone: (301)

Joint Optimal Power Allocation and Relay Selection with Spatial Diversity in Wireless Relay

Cooperative Strategies and Capacity Theorems for Relay Networks Desmond Lun 22 November 2004

RELAYS, CONTACTORS &amp; MOTOR STARTER TESTING EQUIPMENTS LIST OF TEST EQUIPMENT Relay Test

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

A Quantum Multiparty Packing Lemma and the Relay Channel Dawei Ding Stanford University Joint

Frame Relay Bigger, Longer, Uncut 2005/03/11 (C) Herbert Haas What is Frame Relay?

MSRP Relays Rohan Mahy (rohan@cisco.com) Cullen Jennings (fluffy@cisco.com) Status and Changes

Model Checking Infinite-state Systems in SAL Bruno Dutertre, SRI International Automated Formal

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 1 This

Status of Krell Tools Built using Dyninst/MRNet Paradyn Week

Relay Attacks and Distance Bounding Protocols in RFID Environments Prof. Gildas Avoine

Long Distance Relay Attack Luigi Sportiello Joint Research Centre Institute for the Protection

Agenda APT demo (what is the Green Dance?) Pancakes How to hand-in Types and values

Busy Developer's Guide to Building A Virtual Machine Ted Neward Neward &amp; Associates

RELAYS, CONTACTORS & MOTOR STARTER TESTING EQUIPMENTS LIST OF TEST EQUIPMENT Relay Test

Busy Developer's Guide to Building A Virtual Machine Ted Neward Neward & Associates