R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In - - PowerPoint PPT Presentation

r scidb julia
SMART_READER_LITE
LIVE PREVIEW

R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In - - PowerPoint PPT Presentation

R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming language S The aim was to turn ideas into software, quickly and faithfully R is an


slide-1
SLIDE 1

R SciDB Julia

Mert Terzihan Zhixiong Chen

slide-2
SLIDE 2

R

slide-3
SLIDE 3
  • 1. What is R
  • In the 1970s, at Bell Labs, John Chambers

developed a statistical programming language – S

○ The aim was to turn ideas into software, quickly and faithfully ○ R is an implementation of S, initially written by Robert Gentleman and Ross Ihaka in 1993.

  • R is a language and environment for

statistical computing and graphics

slide-4
SLIDE 4
  • 2. Features
  • Object Oriented

○ similar to Python

  • Optimized for Vector/Matrix operation

○ similar to Matlab

  • Fully statistical analysis support
  • Part of the GNU FREE software project
  • Over 4300 user contributed packages
slide-5
SLIDE 5
  • 3. Study Plan
  • Scalar
  • Vector
  • Matrix
  • Data Frame
  • The apply Function
  • Statistics
  • Plot
slide-6
SLIDE 6

Scalar

  • Use R as a calculator

> 4+6 [1] 10 > x<-6 /* '<-' means to assign value 6 to object x */ > y<-4 > x+y [1] 10 > x<-"Hello world" /* String support */ > x [1] "Hello world"

slide-7
SLIDE 7

Vector

  • Create a vector

> x<-c(5,9,1,0) /* function c is to concatenate individual elements */ > x [1] 5 9 1 0 > x<-1:10 /* generate the numbers from 1 to 10 */ > x [1] 1 2 3 4 5 6 7 8 9 10 > seq(1,9,by=2) /* generate the numbers stepping by 2 from 1 to 9 */ [1] 1 3 5 7 9 > seq(8,20,length=6) /*evenly generate 6 numbers from 8 to 20 inclusively */ [1] 8.0 10.4 12.8 15.2 17.6 20.0

slide-8
SLIDE 8

Vector

  • Access a vector, indexing from 1 and using []

> x<-rep(1:3,6) /* repeatedly generating numbers from 1 to 3 6 times */ > x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 > x[1:9] /* Get the numbers indexed from 1 to 9 */ [1] 1 2 3 1 2 3 1 2 3 > x[c(3,6,9)] /* Get the numbers indexed as 3, 6, and 9 */ [1] 3 3 3 > x[-c(3,6,9)] /* '-' is to exclude particular elements */ [1] 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3

slide-9
SLIDE 9

Vector

  • Access a vector, masking

> x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 > mask = x == 3 /* Create a mask */ > mask /* mask is stored as a vector of logic(boolean) values */ [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE [16] FALSE FALSE TRUE > x[mask] [1] 3 3 3 3 3 3 > x[!mask] /* '!' is to reverse each logic value in the mask vector */ [1] 1 2 1 2 1 2 1 2 1 2 1 2

slide-10
SLIDE 10

Matrix

  • Create a matrix

> x<-c(5,7,9) > y<-c(6,3,4) > z<-cbind(x,y) /* bind two vectors as a column-wise matrix */ > z x y [1,] 5 6 [2,] 7 3 [3,] 9 4 > matrix(c(5,7,9,6,3,4),nrow=3) /* create a 3-row matrix from the vector */ [,1] [,2] [1,] 5 6 [2,] 7 3 [3,] 9 4 > diag(3) /* identity*/ [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1

slide-11
SLIDE 11

Matrix

  • Matrix Operations, component-wise

> z<-matrix(c(5,7,9,6,3,4),nrow=3,byrow=T) > z [,1] [,2] [1,] 5 7 [2,] 9 6 [3,] 3 4 > y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T) > y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > y+z [,1] [,2] [1,] 6 10 [2,] 9 15 [3,] 8 3 > y*z [,1] [,2] [1,] 5 21 [2,] 0 54 [3,] 15 -4

slide-12
SLIDE 12

Matrix

  • Matrix Operations, based on definition

> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > z<-matrix(c(3,4,-2,6),nrow=2,byrow=T) > z [,1] [,2] [1,] 3 4 [2,] -2 6 > y%*%x /*multiplication*/ [,1] [1,] 26 [2,] 63 [3,] 18 > t(z) /*transpose*/ [,1] [,2] [1,] 3 -2 [2,] 4 6 > solve(z) /* inverse */ [,1] [,2] [1,] 0.23076923 -0.1538462 [2,] 0.07692308 0.1153846

slide-13
SLIDE 13

Matrix

  • Access a matrix, indexing

> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > y[1,2] /* fetch a specific value */ [1] 3 > y[1:2,] /* fetch rows */ [,1] [,2] [1,] 1 3 [2,] 0 9 > y[,2] /* fetch columns */ [1] 3 9 -1 > y[c(1,2),] /* use vectors */ [,1] [,2] [1,] 1 3 [2,] 0 9

slide-14
SLIDE 14

Matrix

  • Access a matrix, masking

> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > mask<-y>0 > mask [,1] [,2] [1,] TRUE TRUE [2,] FALSE TRUE [3,] TRUE FALSE > y[mask] [1] 1 5 3 9

slide-15
SLIDE 15

Data Frame

  • Create, like a table in database

mydata <- data.frame(col1, col2, col3,...) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor

slide-16
SLIDE 16

Data Frame

  • Access a data frame

> patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor > patientdata[1:3,] /*Treat it as a special matrix*/ patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent > patientdata$patientID /*Access using column name*/ [1] 1 2 3 4

slide-17
SLIDE 17
  • Apply a function to data structure elements

> y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > func <- function(x){ /*define a function func: 1+0.1*y */ + x = x+10 + return (x/10) + } > apply(y,c(1,2),func) /* apply the func on all elements in matrix y */ [,1] [,2] [1,] 1.1 1.3 [2,] 1.0 1.9 [3,] 1.5 0.9

The apply Function

slide-18
SLIDE 18
  • Some handy distributions

> dnorm(c(3,2),0,1) /* normal distribution */ [1] 0.004431848 0.053990967 > x<-seq(-5,10,by=.1) > dnorm(x,3,2) [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04 1.762978e-04 [7] 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04 …... d*:density function p*:distribution function q*:quantile function (the inverse distribution function) dnorm,pnorm,qnorm dt,pt,qt binomial,exponential,posson,gamma

Statistics

slide-19
SLIDE 19
  • Simulations

to randomly simulate 100 observations from the N(3,4) > rnorm(100,3,2) [1] 2.75259237 0.99932968 0.63348792 3.48292324 2.60880274 3.78258364 5.68923819 [8] 0.08003764 1.93627124 2.53843236 3.52610754 5.31448617 2.73017110 3.35264165 …… rnorm,rt,rpois

Statistics

slide-20
SLIDE 20
  • ploting x*sin(x)

> f <- function(x) { /* define the function f(x)=x*sin(x) */ + return (x*sin(x)) + } > plot(f,-20*pi,20*pi) /* plot f between -20*pi and 20*pi */ > abline(0,1,lty=2) /* lty = 2 means dash line */ /* add a dash line with intercept 0 and slope 1 */ > abline(0,-1,lty=2) /* add a dash line with intercept 0 and slope -1 */

Plot

slide-21
SLIDE 21

More?

  • The help() function
  • Refer to the official manual

○ http://cran.r-project.org/manuals.html

  • A wonderful 4-week long online course

○ http://blog.revolutionanalytics. com/2012/12/coursera-videos.html

  • A good book

○ ‘R in Action’ by Robert Kabacoff

  • Google
slide-22
SLIDE 22
  • 4. Bonus
  • Installation

○ Tested on Ubuntu12.04http://livesoncoffee. wordpress.com/2012/12/09/installing-r-on-ubuntu- 12-04/ ○ ignore some error like “Unknown media type in type 'all/all'”

  • RStudio

○ a wonderful IDE for R programmers ○ http://www.rstudio.com/

slide-23
SLIDE 23

Ricardo

Integrating R and Hadoop

slide-24
SLIDE 24

Motivation

  • Statistical software, such as R, provides rich

functionality for data analysis and modeling, but can handle only limited amounts of data

  • Data management systems, such as

hadoop, can handle large data, but provides insufficient analytical functionality

Union is strength!

slide-25
SLIDE 25

Solution

  • Ricardo decompose data-analysis

algorithms into

○ parts executed by the R statistical analysis system ○ parts handled by the Hadoop data management system.

slide-26
SLIDE 26

Components

  • R

○ The core of statistical analysis

  • Large-Scale Data Management Systems

○ HDFS ○ Work with dirty, semi/un-structured data ○ Massive data storage, manipulation and parallel processing

  • Jaql

○ A JSON Query Language ○ The declarative interface to Hadoop for Ricardo ○ Like Pig, Hive

slide-27
SLIDE 27

Architecture

slide-28
SLIDE 28

Conclusion

  • The current version has poor performance
slide-29
SLIDE 29

Overview of SciDB

Large Scale Array Storage, Processing and Analysis

slide-30
SLIDE 30

Context

  • 1. Background and Motivation
  • 2. Features and Functionality
  • 3. Data Definition
  • 4. Data Manipulation
  • 5. Architecture
slide-31
SLIDE 31

What is SciDB?

  • Massively parallel storage manager
  • Able to parallelize large scale array

processing algorithms

slide-32
SLIDE 32
  • 1. Background and Motivation
  • Modern scientific data differs from business data in three important

respects: ○ Sensor arrays consist of rectangular ‘arrays’ of individual sensors ○ Scientific analysis requires sophisticated data processing methods ■ Ex: Noisy data needs to be ‘cleaned’ ○ Data generated by modern scientific instruments is extremely large

  • Array Data Model is more desirable in scientific domains

○ With notions of adjacency or neighborhood ○ Ordering is fundamental

  • Complexity of data processing needs a much more flexible data

management platform ○ A different kind of DBMS

slide-33
SLIDE 33
  • 2. Features and Functionality
  • Collections of n-dimensional arrays
  • Cells in arrays contain tuple of values
  • Values are associated with a distinguishing attribute

name

slide-34
SLIDE 34
  • 3. Data Definition
  • Create an array:
  • Output:
slide-35
SLIDE 35

3.1 Sparse Arrays

  • Arrays in SciDB may be sparse
  • Two ways to handle missing information:

○ Ignore it ○ Treat it depending on the operation

  • Sparse array with jagged edges and holes:
slide-36
SLIDE 36
  • 4. Data

Manipulation in SciDB

slide-37
SLIDE 37

4.1 Slice()

  • Projects an array along a particular index value in single

dimension

slide-38
SLIDE 38

4.2 Subsample()

  • Extracts a region of the array
  • Generalization of Slice()
slide-39
SLIDE 39

4.3 SJoin()

  • Combines attributes from two input arrays

○ Combines cells with the same index value ○ Input arrays need not to have identical dimensions

slide-40
SLIDE 40

4.4 Filter()

  • Applies a predicate to the attribute values of input

○ Produces an array with same size ○ Cells where the predicate is found false are set to empty

slide-41
SLIDE 41

4.5 Extensibility

  • Provides Postgres style UDT and UDF extensibility

○ New types will inter-operate with SciDB’s own types and array operators

  • Supports operator extensibility

○ Gaussian Smoothing ○ Weighted average of the cell’s neighborhood

slide-42
SLIDE 42
  • 5. Architecture
  • Shared nothing design
  • Centralized system catalog database

○ Information about nodes, data distribution and user- defined extensions

  • Influenced by MapReduce
  • Implements only ‘A’ and ‘D’ of ACIDity

○ Atomicity ○ Durability

slide-43
SLIDE 43

References

  • SciDB web-site: http://www.scidb.org
  • Publications on SciDB: http://www.scidb.
  • rg/about/publications.php
  • Overview of SciDB, Large Scale Array Storage,

Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA: http://www.scidb.

  • rg/Documents/sigmod691-brown.pdf
slide-44
SLIDE 44

SciDB-R

Best Database for R

slide-45
SLIDE 45

Data Frame

  • Create, like a table in database

mydata <- data.frame(col1, col2, col3,...) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor

slide-46
SLIDE 46

The R Programmers

  • want their analytics to just work–on

extremely large datasets as nimbly as on small ones.

  • want to concentrate on the analytics, not

parallelism, data formatting, and memory management.

slide-47
SLIDE 47

Benefits

  • Use SciDB to manage large data set

○ a storage backend ○ filter and join data before performing analytics

  • Use SciDB to share intensive computing

load

○ offload large computations to a cluster ○ do some analytical task

  • Use SciDB to share data among multiple

users

slide-48
SLIDE 48

Example

  • R codes using SciDB to perform caculations

> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/

slide-49
SLIDE 49

Example

  • R codes using SciDB to perform caculations

> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/ > U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */

slide-50
SLIDE 50

Example

  • R codes using SciDB to perform caculations

> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/ > U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */ > set.seed(1) /* Set a seed for randomization */ > x = cbind(rnorm(5)) /* Create a column vector with 5 rows */ > y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/

slide-51
SLIDE 51

Example

  • R codes using SciDB to perform caculations

> library(“scidb”) /* Load scidb module(package) in the current R session */ > scidbconnect() /* May require host, port, username, password*/ > U <- scidb(“Z”) /* Get ‘array’ Z from SciDB and store it in SciDB array object U -- U is an R representation of SciDB array Z, pretty much a data frame */ > set.seed(1) /* Set a seed for randomization */ > x = cbind(rnorm(5)) /* Create a column vector with 5 rows */ > y = U %*% x /* This will be computed by SciDB, returning a SciDB array object*/ > y[, drop = FALSE] /* Return the computed result to R, storing it to SciDB. drop: if data frame y has only one column and drop is true, y will be reduced to a plain vector without labels. */

slide-52
SLIDE 52

References

  • Official Website

○ http://www.paradigm4.com/scidb-r/

  • SciDB-R package

○ https://github.com/Paradigm4

  • Instructions and Manuals

○ http://cran.r-project.org/web/packages/scidb/

slide-53
SLIDE 53

Julia Language

A Fast Dynamic Language for Technical Computing

slide-54
SLIDE 54

Context

  • 1. Motivation
  • 2. Features
  • 3. JIT Compiler and Performance Benchmarks
  • 4. Example Codes
  • 5. IJulia
  • 6. Issue Tracking
slide-55
SLIDE 55
  • 1. Motivation
  • Why do we need more?

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-56
SLIDE 56
  • 1. What is Julia?
  • High level, high performance dynamic programming

language ○ Syntax familiar to Matlab

  • The library is written mostly in Julia
slide-57
SLIDE 57
  • 2. Some Features
  • Open source with an MIT licensed core
  • Easy installation
  • Dynamically typed with fast user-defined types
  • JIT compiler
  • Distributed memory parallelism
  • Call C, Fortran and Python libraries
  • Unicode support
  • Metaprogramming with Lisp-like macros
slide-58
SLIDE 58
  • 3. High-Performance JIT Compiler
  • LLVM-based Just in Time Compiler
  • Often match the performance of C

Benchmark times relative to C (smaller is better)

slide-59
SLIDE 59
  • 3. Log-scale of Benchmark

Execution time relative to C++ Benchmarks: fib, parse_int, quicksort, mandel, pi_sum, rand_mat_stat, and rand_mat_mul

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-60
SLIDE 60
  • 4. Example Julia

Codes

slide-61
SLIDE 61

4.1 Arrays and Vectors

slide-62
SLIDE 62

4.1 Arrays and Vectors

slide-63
SLIDE 63

4.2 Matrix Operations

slide-64
SLIDE 64

4.3 Ternary Operators

slide-65
SLIDE 65

4.4 Packages

  • Pkg.add(“Package_Name”)

○ Cpp for calling C++ from Julia ○ Curl for Julia HTTP Curl library ○ Winston, Gadfly, Gaston or PyPlot for graphics and plotting ○ HDFS for a wrapper over Hadoop HDFS library ○ LIBSVM for LIBSVM bindings for Julia ○ and many more

  • http://docs.julialang.org/en/latest/packages/packagelist
slide-66
SLIDE 66

4.5 Plotting

  • Graphics in Julia are available through external

packages ○ Use Winston.jl for Matlab plots ○ Use Gadfly for Wickham-Wilkinson style of graphics

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-67
SLIDE 67

4.6 Sequential Buffon’s Needle

  • We have a floor made of parallel strips of wood, each

the same width, and we drop a needle onto the floor. ○ What is the probability that the needle will lie across a line between two strips?

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-68
SLIDE 68

4.7 Parallel Buffon’s Needle

  • @parallel (+) for loop

○ Assign iterations to multiple processes ○ Combine them with a specified reduction (+)

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-69
SLIDE 69

4.8 Writing Low-Level Code

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-70
SLIDE 70

4.9 Using Python Libraries

  • Pkg.add(“PyCall”)

○ Using PyCall

*Viral B. Shah, Fifth Elephant Presentation, July 13 2013

slide-71
SLIDE 71
  • 5. IJulia
  • Julia is written in command line by default
  • IJulia combines Julia with IPython
  • IPython provides rich architecture for interactive

computing

  • Available in GitHub: https://github.com/JuliaLang/IJulia.jl
slide-72
SLIDE 72
  • 6. Issue Tracking
  • Source codes are in GitHub

○ https://github.com/JuliaLang/julia

  • Issues can be opened from Julia GitHub Repository

○ https://github.com/JuliaLang/julia/issues

  • Easy and quick bug fixes

○ No need to wait for another release

slide-73
SLIDE 73

References

  • Julia web-site: http://julialang.org
  • C, Fortran, Julia, Python, Matlab, R and JavaScript codes

used in benchmarking: https://github. com/JuliaLang/julia/tree/master/test/perf/micro

  • Publications on Julia: http://julialang.org/publications/
  • Julia Source Code: https://github.com/JuliaLang/julia
  • Viral B. Shah’s introductory slides on July 13, 2013: https:

//github.com/ViralBShah/julia-presentations/raw/master/Fifth- Elephant-2013/Fifth-Elephant-2013.pdf

  • MIT IAP Julia Tutorial: http://www.youtube.

com/user/JuliaLanguage

  • Winston, 2D plotting for Julia: https://github.

com/nolta/Winston.jl

slide-74
SLIDE 74

Advantages of R over Julia

  • Backed by GNU
  • More mature and older
  • Large collection of libraries, i.e. CRAN
  • Rich development environment, i.e. RStudio
  • Graphics and plotting
  • Many great tutorials and books
slide-75
SLIDE 75

Advantages of Julia over R

  • Performance of Julia that is close to C

○ JIT LLVM Compiler

  • Supports low-level programming, modify arguments
  • Active community promises a bright future for Julia
  • R is single threaded, whereas Julia supports parallelism

○ R has techniques for large datasets but not easy to use

  • Julia provides fast development and fast execution
slide-76
SLIDE 76

Syntax Differences

Operator R Julia Assignment <- = Element wise multiplication * *(*.) Element wise addition + +(+.) Modulo %% mod Creating Vector c(1,2,3,4) [1:4] Size of the array dim size

slide-77
SLIDE 77

References

  • Matlab, R, and Julia: Languages for Data Analysis,

October 15 2012: http://strata.oreilly. com/2012/10/matlab-r-julia-languages-for-data-analysis. html

  • Julia for R Programmers, Douglas Bates, July 18 2013:

http://www.stat.wisc.edu/~bates/JuliaForRProgrammers. pdf

  • An R Programmer look at Julia, Douglas Bates, April 7

2012: http://dmbates.blogspot.com/2012/04/r- programmer-looks-at-julia.html

  • http://www.johnmyleswhite.

com/notebook/2012/04/09/comparing-julia-and-rs- vocabularies/

slide-78
SLIDE 78

Questions?