Streaming Data And Concurrency In R Rory Winston - PowerPoint PPT Presentation

Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com

About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Interested in practical applications of functional languages and machine learning Really interested in seeing R usage grow in finance

A Short Rant 1 Why We Need Concurrency 2 Motivating Example 3 Conclusion 4 References and Further Reading 5

A Short Rant Parallelization vs. Concurrency in R Multithreading vs. parallelization i.e. fork() vs. pthread_create() R interpreter is single threaded Some historical context for this (e.g. non-threadsafe BLAS implementations) Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath Interfaces to PBLAS/Hadoop/OpenMP/MPI/Globus/etc. Parallelization suits large CPU-bound processing applications So do we really need it at all then?

Why We Need Concurrency Multithreading Is A Valuable Tool I say, "yes" For general real-time (streaming to be more precise) data analysis (Growing interest in using R for streaming data, not just offline analyis) GUI toolkit integration Fine-grained control over independent task execution Fine-grained control over CPU-bound and I/O-bound task management "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001

Why We Need Concurrency Will There Be A Multithreaded R? Short answer is: Most likely not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state ( «- vs. <- ) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?)

Motivating Example Motivating Example Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension that spawned listening thread and handled market updates New version also does publishing as well as subscribing

Motivating Example Motivating Example The (real-world) example involves building a new high-frequency trading system Step 1 is handling market prices (in this case interbank currency prices) Need to ensure that the new system’s prices are: Correct; Fast

Motivating Example R Analytics C++ RMDS API RMDS Message Bus

Motivating Example Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism

Motivating Example Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes, shared mem We will use the bigmemoRy package to leverage the latter

Motivating Example The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large ( ≥ RAM) matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...) read.big.matrix(file, sep=, ...)

Motivating Example Sample Usage > library(bigmemory) > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>

Motivating Example Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL $colNames NULL $type [1] "double"

Motivating Example Export the Descriptor In R session 1: > dput(desc, file="/tmp/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("/tmp/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance

Motivating Example Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B

Motivating Example RMDS Message Bus C++ RMDS API R / bigmemoRy R / bigmemoRy C++ RMDS API RMDS Message Bus

Conclusion Summary Lack of threads not necessarily a barrier to concurrent analysis Packages like bigmemoRy , nws , etc. facilitate decoupling via IPC Could potentially take this further (using e.g. nws )

References and Further Reading References bigmemoRy : http://cran.r-project.org/web/packages/bigmemory/ Luke Tierney’s original threading paper: http://www.cs.uiowa.edu/~luke/R/thrgui/ HPC and R Survey: http://epub.ub.uni-muenchen.de/8991/ Inside The Python GIL: www.dabeaz.com/python/GIL.pdf

Streaming Data And Concurrency In R Rory Winston - PowerPoint PPT Presentation

Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Interested in practical applications of functional

COMP31212: Concurrency Topics 4.1: Concurrency Patterns - Monitors Topic 4.1: Concurrency

Concurrency What is concurrency? In computer science, concurrency is a property of systems which

Concurrency Control Ensuring Isolation 354 Concurrency control Concurrency To increase

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

CONCURRENCY MODELS: GO CONCURRENCY MODEL BY VASYL NAKVASIUK, 2014 KYIV GO MEETUP #1

Concurrency First Concurrency First Concurrency First but we but we d better get it

Concurrency: Mutual Exclusion and Synchronization Chapter 5 1 Concurrency Multiple

Concurrency: Mutual Exclusion and Synchronization Chapter 5 1 Concurrency Concurrency arises

Advanced Java Concurrency Framework By Nisarg Shah Rutvi Joshi Advanced Java Concurrency

Asynchronous Programming Model for Concurrency concurrency Concurrency is when two or more tasks

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

RFNoC: Evolving SDR Toolkits to the FPGA platgorm Martjn Braun 31.1.2016 USRP: A White Box?

Delay and laziness no not ea eager er ... When are expressions

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Causal Commutative Arrows Revisited Jeremy Yallop Hai (Paul) Liu University of Cambridge Intel

Big Data with ADAMS Big Data with ADAMS What the heck is ADAMS? Peter Reutemann What is ADAMS?