R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In - - PowerPoint PPT Presentation
R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In - - PowerPoint PPT Presentation
R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming language S The aim was to turn ideas into software, quickly and faithfully R is an
R
- 1. What is R
- In the 1970s, at Bell Labs, John Chambers
developed a statistical programming language – S
○ The aim was to turn ideas into software, quickly and faithfully ○ R is an implementation of S, initially written by Robert Gentleman and Ross Ihaka in 1993.
- R is a language and environment for
statistical computing and graphics
- 2. Features
- Object Oriented
○ similar to Python
- Optimized for Vector/Matrix operation
○ similar to Matlab
- Fully statistical analysis support
- Part of the GNU FREE software project
- Over 4300 user contributed packages
- 3. Study Plan
- Scalar
- Vector
- Matrix
- Data Frame
- The apply Function
- Statistics
- Plot
Scalar
- Use R as a calculator
> 4+6 [1] 10 > x<-6 /* '<-' means to assign value 6 to object x */ > y<-4 > x+y [1] 10 > x<-"Hello world" /* String support */ > x [1] "Hello world"
Vector
- Create a vector
> x<-c(5,9,1,0) /* function c is to concatenate individual elements */ > x [1] 5 9 1 0 > x<-1:10 /* generate the numbers from 1 to 10 */ > x [1] 1 2 3 4 5 6 7 8 9 10 > seq(1,9,by=2) /* generate the numbers stepping by 2 from 1 to 9 */ [1] 1 3 5 7 9 > seq(8,20,length=6) /*evenly generate 6 numbers from 8 to 20 inclusively */ [1] 8.0 10.4 12.8 15.2 17.6 20.0
Vector
- Access a vector, indexing from 1 and using []
> x<-rep(1:3,6) /* repeatedly generating numbers from 1 to 3 6 times */ > x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 > x[1:9] /* Get the numbers indexed from 1 to 9 */ [1] 1 2 3 1 2 3 1 2 3 > x[c(3,6,9)] /* Get the numbers indexed as 3, 6, and 9 */ [1] 3 3 3 > x[-c(3,6,9)] /* '-' is to exclude particular elements */ [1] 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3
Vector
- Access a vector, masking
> x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 > mask = x == 3 /* Create a mask */ > mask /* mask is stored as a vector of logic(boolean) values */ [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE [16] FALSE FALSE TRUE > x[mask] [1] 3 3 3 3 3 3 > x[!mask] /* '!' is to reverse each logic value in the mask vector */ [1] 1 2 1 2 1 2 1 2 1 2 1 2
Matrix
- Create a matrix
> x<-c(5,7,9) > y<-c(6,3,4) > z<-cbind(x,y) /* bind two vectors as a column-wise matrix */ > z x y [1,] 5 6 [2,] 7 3 [3,] 9 4 > matrix(c(5,7,9,6,3,4),nrow=3) /* create a 3-row matrix from the vector */ [,1] [,2] [1,] 5 6 [2,] 7 3 [3,] 9 4 > diag(3) /* identity*/ [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1
Matrix
- Matrix Operations, component-wise
> z<-matrix(c(5,7,9,6,3,4),nrow=3,byrow=T) > z [,1] [,2] [1,] 5 7 [2,] 9 6 [3,] 3 4 > y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T) > y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > y+z [,1] [,2] [1,] 6 10 [2,] 9 15 [3,] 8 3 > y*z [,1] [,2] [1,] 5 21 [2,] 0 54 [3,] 15 -4
Matrix
- Matrix Operations, based on definition
> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > z<-matrix(c(3,4,-2,6),nrow=2,byrow=T) > z [,1] [,2] [1,] 3 4 [2,] -2 6 > y%*%x /*multiplication*/ [,1] [1,] 26 [2,] 63 [3,] 18 > t(z) /*transpose*/ [,1] [,2] [1,] 3 -2 [2,] 4 6 > solve(z) /* inverse */ [,1] [,2] [1,] 0.23076923 -0.1538462 [2,] 0.07692308 0.1153846
Matrix
- Access a matrix, indexing
> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > y[1,2] /* fetch a specific value */ [1] 3 > y[1:2,] /* fetch rows */ [,1] [,2] [1,] 1 3 [2,] 0 9 > y[,2] /* fetch columns */ [1] 3 9 -1 > y[c(1,2),] /* use vectors */ [,1] [,2] [1,] 1 3 [2,] 0 9
Matrix
- Access a matrix, masking
> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > mask<-y>0 > mask [,1] [,2] [1,] TRUE TRUE [2,] FALSE TRUE [3,] TRUE FALSE > y[mask] [1] 1 5 3 9
Data Frame
- Create, like a table in database
mydata <- data.frame(col1, col2, col3,...) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor
Data Frame
- Access a data frame
> patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor > patientdata[1:3,] /*Treat it as a special matrix*/ patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent > patientdata$patientID /*Access using column name*/ [1] 1 2 3 4
- Apply a function to data structure elements
> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > func <- function(x){ /*define a function func: 1+0.1*y */ + x = x+10 + return (x/10) + } > apply(y,c(1,2),func) /* apply the func on all elements in matrix y */ [,1] [,2] [1,] 1.1 1.3 [2,] 1.0 1.9 [3,] 1.5 0.9
The apply Function
- Some handy distributions
> dnorm(c(3,2),0,1) /* normal distribution */ [1] 0.004431848 0.053990967 > x<-seq(-5,10,by=.1) > dnorm(x,3,2) [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04 1.762978e-04 [7] 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04 …... d*:density function p*:distribution function q*:quantile function (the inverse distribution function) dnorm,pnorm,qnorm dt,pt,qt binomial,exponential,posson,gamma
Statistics
- Simulations
to randomly simulate 100 observations from the N(3,4) > rnorm(100,3,2) [1] 2.75259237 0.99932968 0.63348792 3.48292324 2.60880274 3.78258364 5.68923819 [8] 0.08003764 1.93627124 2.53843236 3.52610754 5.31448617 2.73017110 3.35264165 …… rnorm,rt,rpois
Statistics
- ploting x*sin(x)
> f <- function(x) { /* define the function f(x)=x*sin(x) */ + return (x*sin(x)) + } > plot(f,-20*pi,20*pi) /* plot f between -20*pi and 20*pi */ > abline(0,1,lty=2) /* lty = 2 means dash line */ /* add a dash line with intercept 0 and slope 1 */ > abline(0,-1,lty=2) /* add a dash line with intercept 0 and slope -1 */
Plot
More?
- The help() function
- Refer to the official manual
○ http://cran.r-project.org/manuals.html
- A wonderful 4-week long online course
○ http://blog.revolutionanalytics. com/2012/12/coursera-videos.html
- A good book
○ ‘R in Action’ by Robert Kabacoff
- 4. Bonus
- Installation
○ Tested on Ubuntu12.04http://livesoncoffee. wordpress.com/2012/12/09/installing-r-on-ubuntu- 12-04/ ○ ignore some error like “Unknown media type in type 'all/all'”
- RStudio
○ a wonderful IDE for R programmers ○ http://www.rstudio.com/
Ricardo
Integrating R and Hadoop
Motivation
- Statistical software, such as R, provides rich
functionality for data analysis and modeling, but can handle only limited amounts of data
- Data management systems, such as
hadoop, can handle large data, but provides insufficient analytical functionality
Union is strength!
Solution
- Ricardo decompose data-analysis
algorithms into
○ parts executed by the R statistical analysis system ○ parts handled by the Hadoop data management system.
Components
- R
○ The core of statistical analysis
- Large-Scale Data Management Systems
○ HDFS ○ Work with dirty, semi/un-structured data ○ Massive data storage, manipulation and parallel processing
- Jaql
○ A JSON Query Language ○ The declarative interface to Hadoop for Ricardo ○ Like Pig, Hive
Architecture
Conclusion
- The current version has poor performance
Overview of SciDB
Large Scale Array Storage, Processing and Analysis
Context
- 1. Background and Motivation
- 2. Features and Functionality
- 3. Data Definition
- 4. Data Manipulation
- 5. Architecture
What is SciDB?
- Massively parallel storage manager
- Able to parallelize large scale array
processing algorithms
- 1. Background and Motivation
- Modern scientific data differs from business data in three important
respects: ○ Sensor arrays consist of rectangular ‘arrays’ of individual sensors ○ Scientific analysis requires sophisticated data processing methods ■ Ex: Noisy data needs to be ‘cleaned’ ○ Data generated by modern scientific instruments is extremely large
- Array Data Model is more desirable in scientific domains
○ With notions of adjacency or neighborhood ○ Ordering is fundamental
- Complexity of data processing needs a much more flexible data
management platform ○ A different kind of DBMS
- 2. Features and Functionality
- Collections of n-dimensional arrays
- Cells in arrays contain tuple of values
- Values are associated with a distinguishing attribute
name
- 3. Data Definition
- Create an array:
- Output:
3.1 Sparse Arrays
- Arrays in SciDB may be sparse
- Two ways to handle missing information:
○ Ignore it ○ Treat it depending on the operation
- Sparse array with jagged edges and holes:
- 4. Data
Manipulation in SciDB
4.1 Slice()
- Projects an array along a particular index value in single
dimension
4.2 Subsample()
- Extracts a region of the array
- Generalization of Slice()
4.3 SJoin()
- Combines attributes from two input arrays
○ Combines cells with the same index value ○ Input arrays need not to have identical dimensions
4.4 Filter()
- Applies a predicate to the attribute values of input
○ Produces an array with same size ○ Cells where the predicate is found false are set to empty
4.5 Extensibility
- Provides Postgres style UDT and UDF extensibility
○ New types will inter-operate with SciDB’s own types and array operators
- Supports operator extensibility
○ Gaussian Smoothing ○ Weighted average of the cell’s neighborhood
- 5. Architecture
- Shared nothing design
- Centralized system catalog database
○ Information about nodes, data distribution and user- defined extensions
- Influenced by MapReduce
- Implements only ‘A’ and ‘D’ of ACIDity
○ Atomicity ○ Durability
References
- SciDB web-site: http://www.scidb.org
- Publications on SciDB: http://www.scidb.
- rg/about/publications.php
- Overview of SciDB, Large Scale Array Storage,
Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA: http://www.scidb.
- rg/Documents/sigmod691-brown.pdf
SciDB-R
Best Database for R
Data Frame
- Create, like a table in database
mydata <- data.frame(col1, col2, col3,...) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor
The R Programmers
- want their analytics to just work–on
extremely large datasets as nimbly as on small ones.
- want to concentrate on the analytics, not
parallelism, data formatting, and memory management.
Benefits
- Use SciDB to manage large data set
○ a storage backend ○ filter and join data before performing analytics
- Use SciDB to share intensive computing
load
○ offload large computations to a cluster ○ do some analytical task
- Use SciDB to share data among multiple
users
Example
- R codes using SciDB to perform caculations
> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/
Example
- R codes using SciDB to perform caculations
> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/ > U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */
Example
- R codes using SciDB to perform caculations
> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/ > U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */ > set.seed(1) /* Set a seed for randomization */ > x = cbind(rnorm(5)) /* Create a column vector with 5 rows */ > y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/
Example
- R codes using SciDB to perform caculations
> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/ > U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */ > set.seed(1) /* Set a seed for randomization */ > x = cbind(rnorm(5)) /* Create a column vector with 5 rows */ > y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/ > y[, drop = FALSE] /* Return the computed result to R, storing it to SciDB. drop: if data frame y has only one column and drop is true, y will be reduced to a plain vector without labels. */
References
- Official Website
○ http://www.paradigm4.com/scidb-r/
- SciDB-R package
○ https://github.com/Paradigm4
- Instructions and Manuals
○ http://cran.r-project.org/web/packages/scidb/
Julia Language
A Fast Dynamic Language for Technical Computing
Context
- 1. Motivation
- 2. Features
- 3. JIT Compiler and Performance Benchmarks
- 4. Example Codes
- 5. IJulia
- 6. Issue Tracking
- 1. Motivation
- Why do we need more?
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
- 1. What is Julia?
- High level, high performance dynamic programming
language ○ Syntax familiar to Matlab
- The library is written mostly in Julia
- 2. Some Features
- Open source with an MIT licensed core
- Easy installation
- Dynamically typed with fast user-defined types
- JIT compiler
- Distributed memory parallelism
- Call C, Fortran and Python libraries
- Unicode support
- Metaprogramming with Lisp-like macros
- 3. High-Performance JIT Compiler
- LLVM-based Just in Time Compiler
- Often match the performance of C
Benchmark times relative to C (smaller is better)
- 3. Log-scale of Benchmark
Execution time relative to C++ Benchmarks: fib, parse_int, quicksort, mandel, pi_sum, rand_mat_stat, and rand_mat_mul
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
- 4. Example Julia
Codes
4.1 Arrays and Vectors
4.1 Arrays and Vectors
4.2 Matrix Operations
4.3 Ternary Operators
4.4 Packages
- Pkg.add(“Package_Name”)
○ Cpp for calling C++ from Julia ○ Curl for Julia HTTP Curl library ○ Winston, Gadfly, Gaston or PyPlot for graphics and plotting ○ HDFS for a wrapper over Hadoop HDFS library ○ LIBSVM for LIBSVM bindings for Julia ○ and many more
- http://docs.julialang.org/en/latest/packages/packagelist
4.5 Plotting
- Graphics in Julia are available through external
packages ○ Use Winston.jl for Matlab plots ○ Use Gadfly for Wickham-Wilkinson style of graphics
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.6 Sequential Buffon’s Needle
- We have a floor made of parallel strips of wood, each
the same width, and we drop a needle onto the floor. ○ What is the probability that the needle will lie across a line between two strips?
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.7 Parallel Buffon’s Needle
- @parallel (+) for loop
○ Assign iterations to multiple processes ○ Combine them with a specified reduction (+)
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.8 Writing Low-Level Code
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
4.9 Using Python Libraries
- Pkg.add(“PyCall”)
○ Using PyCall
*Viral B. Shah, Fifth Elephant Presentation, July 13 2013
- 5. IJulia
- Julia is written in command line by default
- IJulia combines Julia with IPython
- IPython provides rich architecture for interactive
computing
- Available in GitHub: https://github.com/JuliaLang/IJulia.jl
- 6. Issue Tracking
- Source codes are in GitHub
○ https://github.com/JuliaLang/julia
- Issues can be opened from Julia GitHub Repository
○ https://github.com/JuliaLang/julia/issues
- Easy and quick bug fixes
○ No need to wait for another release
References
- Julia web-site: http://julialang.org
- C, Fortran, Julia, Python, Matlab, R and JavaScript codes
used in benchmarking: https://github. com/JuliaLang/julia/tree/master/test/perf/micro
- Publications on Julia: http://julialang.org/publications/
- Julia Source Code: https://github.com/JuliaLang/julia
- Viral B. Shah’s introductory slides on July 13, 2013: https:
//github.com/ViralBShah/julia-presentations/raw/master/Fifth- Elephant-2013/Fifth-Elephant-2013.pdf
- MIT IAP Julia Tutorial: http://www.youtube.
com/user/JuliaLanguage
- Winston, 2D plotting for Julia: https://github.
com/nolta/Winston.jl
Advantages of R over Julia
- Backed by GNU
- More mature and older
- Large collection of libraries, i.e. CRAN
- Rich development environment, i.e. RStudio
- Graphics and plotting
- Many great tutorials and books
Advantages of Julia over R
- Performance of Julia that is close to C
○ JIT LLVM Compiler
- Supports low-level programming, modify arguments
- Active community promises a bright future for Julia
- R is single threaded, whereas Julia supports parallelism
○ R has techniques for large datasets but not easy to use
- Julia provides fast development and fast execution
Syntax Differences
Operator R Julia Assignment <- = Element wise multiplication * *(*.) Element wise addition + +(+.) Modulo %% mod Creating Vector c(1,2,3,4) [1:4] Size of the array dim size
References
- Matlab, R, and Julia: Languages for Data Analysis,
October 15 2012: http://strata.oreilly. com/2012/10/matlab-r-julia-languages-for-data-analysis. html
- Julia for R Programmers, Douglas Bates, July 18 2013:
http://www.stat.wisc.edu/~bates/JuliaForRProgrammers. pdf
- An R Programmer look at Julia, Douglas Bates, April 7
2012: http://dmbates.blogspot.com/2012/04/r- programmer-looks-at-julia.html
- http://www.johnmyleswhite.
com/notebook/2012/04/09/comparing-julia-and-rs- vocabularies/
Questions?