An Interactive Introduction to R for Actuaries CAS Conference - - PowerPoint PPT Presentation

an interactive introduction to r for actuaries
SMART_READER_LITE
LIVE PREVIEW

An Interactive Introduction to R for Actuaries CAS Conference - - PowerPoint PPT Presentation

An Interactive Introduction to R for Actuaries CAS Conference November 2009 Michael E. Driscoll, Ph.D. Daniel Murphy FCAS, MAAA January 6, 2009 R is a tool for Data Manipulation connecting to data sources slicing & dicing data


slide-1
SLIDE 1

An Interactive Introduction to R for Actuaries

CAS Conference November 2009

Michael E. Driscoll, Ph.D. Daniel Murphy FCAS, MAAA

slide-2
SLIDE 2

January 6, 2009

slide-3
SLIDE 3
slide-4
SLIDE 4

Modeling & Computation

  • statistical modeling
  • numerical simulation

Data Visualization

R is a tool for…

Data Manipulation

  • connecting to data sources
  • slicing & dicing data
  • visualizing fit of models
  • composing statistical graphics
slide-5
SLIDE 5

R is an environment

slide-6
SLIDE 6

Its interface is simple

slide-7
SLIDE 7

Let’s take a tour of some claim data in R

slide-8
SLIDE 8

R is “an overgrown calculator”

  • simple math

> 2+2 4

  • storing results in variables

> x <- 2+2 2+2 ## „<-‟ is R syntax for „=‟ or assignment > x^2 16 16

  • vectorized math

> weight <- c(110, 180, 240) ## three weights > height <- c(5.5, 6.1, 6.2) ## three heights > > bmi <- (weight*4.88)/height^2 ## divides element-wise 17.7 23.6 30.4

slide-9
SLIDE 9

R is “an overgrown calculator”

  • basic statistics

mean(weight) sd sd(weight) (weight) sqrt sqrt(var var(weight)) 176.6 65.0 65.0 # same as sd sd

  • set functions

union intersect setdiff

  • advanced statistics

> > pbinom pbinom(40, 100, 0.5) ## P that a coin tossed 100 times 0.028 ## that comes up 40 heads is „fair‟ > > pshare pshare <- pbirthda pbirthday(23, 365, coincident=2) 0.530 ## proba

## probabilit bility tha y that among t among 23 pe 23 people,

  • ple, two s

two share hare a a birthday birthday

slide-10
SLIDE 10

Try It! #1 Overgrown Calculator

  • basic calculations

> 2 + 2 [Hit ENTER] > log(1 (100 00) ) [Hit ENTER]

  • calculate the value of $100 after 10 years at 5%

> 100 * exp(0 (0.0 .05*1 *10) ) [Hit ENTER]

  • construct a vector & do a vectorized calculation

> year r <- (1,2, 2,5,1 ,10,2 ,25) 5) [Hit ENTER] this returns an error. why? > year r <- c(1,2 ,2,5, 5,10, 0,25 25) ) [Hit ENTER] > 100 * exp(0 (0.0 .05*y *year ar) ) [Hit ENTER]

slide-11
SLIDE 11

R is a numerical simulator

  • built-in functions for

classical probability distributions

  • let’s simulate 10,000

trials of 100 coin flips. what’s the distribution of heads?

> h head ads < <- rb rbino nom(1 (10^5 ^5,10 100,0 ,0.50 50) > hist(heads)

slide-12
SLIDE 12

Functions for Probability Distributions

Examples Normal dnorm, pnorm, qnorm, rnorm Binomial dbinom, pbinom, … Poisson dpois, …

ddist( ) density function (pdf) pdist( ) cumulative density function qdist( ) quantile function rdist( ) random deviates

> pnorm(0) 0.05 > qnorm(0.9) 1.28 > rnorm(100) vector of length 100

slide-13
SLIDE 13

Functions for Probability Distributions

distribution dist suffix in R Beta

  • beta

Binomial

  • binom

Cauchy

  • cauchy

Chisquare

  • chisq

Exponential

  • exp

F

  • f

Gamma

  • gamma

Geometric

  • geom

Hypergeometric

  • hyper

Logistic

  • logis

Lognormal

  • lnorm

Negative Binomial

  • nbinom

Normal

  • norm

Poisson

  • pois

Student t

  • t

Uniform

  • unif

Tukey

  • tukey

Weibull

  • weib

Wilcoxon

  • wilcox

How to find the functions for lognormal distribution? 1) Use the double question mark ‘??’ to search > ??lognormal > ??lognormal 2) Then identify the package > ?Lognor normal mal 3) Discover the dist functions

dln lnorm rm, p pln lnor

  • rm,

, qln lnorm rm, rln lnorm rm

slide-14
SLIDE 14

Try It! #2 Numerical Simulation

  • simulate 1m policy holders from which we expect 4 claims

> > nu numc mclai aims ms <- rp rpoi

  • is(n

(n, l lamb mbda) a) (hint: use ?rpois to understand the parameters)

  • verify the mean & variance are reasonable

> mean(numclaims) > > va var(num umcl clai aims)

  • visualize the distribution of claim counts

> > hist(numclaims)

slide-15
SLIDE 15

Getting Data In

  • from Files

> Insurance <- read.csv(“Insurance.csv”,header=TRUE)

from Databases

> con <- dbConnect(driver,user,password,host,dbname) > Insurance <- dbSendQuery(con, “SELECT * FROM claims”)

from the Web

> > con <

con <- url('http://labs.dataspora.com/test.txt') > Insurance <- read.csv read.csv(con, (con, header=TRU header=TRUE) E)

from R objects

> load(„Insurance.RData‟)

slide-16
SLIDE 16

Getting Data Out

  • to Files

write.csv(Insurance,file=“Insurance.csv”)

  • to Databases

con <- dbConnect(dbdriver,user,password,host,dbname) dbWriteTable(con, “Insurance”, Insurance)

to R Objects

save(Insurance, file=“Insurance.RData”)

slide-17
SLIDE 17

Navigating within the R environment

  • listing all variables

> ls()

  • examining a variable ‘x’

> s str( r(x) > head(x) > t tail il(x) x) > class(x)

  • removing variables

> rm(x) > rm(x)

slide-18
SLIDE 18

Try It! #3 Data Processing

  • load data & view it

li libr brary ry(MA MASS SS) he head ad(In Insur uran ance ce) ## # th the f fir irst t 7 r row

  • ws

di dim( m(Ins nsura ranc nce) e) ## # nu numbe ber r of f row

  • ws

s & & col

  • lumn

mns

  • write it out

wr writ ite.c .csv( v(In Insu suran ance, e,fi file=“Insurance.csv”, ro rownam ames es=FA FALSE SE) getwd getwd() () # ## # wh where re am am I I?

  • view it in Excel, make a change, save it

re remo move ve th the e fi first st di dist stric ict

  • load it back in to R & plot it

read.csv(Insurance, file=“Insurance.csv”) plo lot(C (Clai aims ms/H /Hold lders rs ~ ~ Age ge, d data ta=I =Ins nsura rance ce)

slide-19
SLIDE 19

A Swiss-Army Knife for Data

  • Indexing
  • Three ways to index into a data frame

– array of integer indices – array of character names – array of logical Booleans

  • Examples:

df[1:3,] df[c(“New York”, “Chicago”),] df[c(TRUE,FALSE,TRUE,TRUE),]

df[city == “New York”,]

slide-20
SLIDE 20

A Swiss-Army Knife for Data

  • Subset

subset()

  • Reshape

res eshap ape() ()

  • Transform

transform() transform()

slide-21
SLIDE 21

A Statistical Modeler

  • R’s has a powerful modeling syntax
  • Models are specified with formulae, like

y ~ x growth ~ sun + water model relationships between continuous and categorical variables.

  • Models are also guide the visualization of

relationships in a graphical form

slide-22
SLIDE 22

A Statistical Modeler

  • Linear model

m <- lm(Claims ~ Age, data=Insurance)

  • Examine it

sum ummar ary(m (m)

  • Plot it

plo lot(m (m)

slide-23
SLIDE 23

A Statistical Modeler

  • Logistic model

m <- logit (Claims ~ Age, data=Insurance)

  • Examine it

sum ummar ary(m (m)

  • Plot it

plo lot(m (m)

slide-24
SLIDE 24

Try It! #4 Statistical Modeling

  • fit a linear model

m <- lm(Claims/Holders ~ Age + 0, data=Insurance)

  • examine it

summary(m)

  • plot it

plot(m) plot(m)

slide-25
SLIDE 25

Visualization: Multivariate Barplot

library(ggplot2) qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat='identity', position="dodge", facets=District ~ ., fill=Age)

slide-26
SLIDE 26

Visualization: Boxplots

library(ggplot2) qplot(Age, Claims/Holders, data=Insurance, geom="boxplot“) library(lattice) bwplot(Claims/Holders ~ Age, data=Insurance)

slide-27
SLIDE 27

Visualization: Histograms

library(lattice) densityplot(~ Claims/Holders | Age, data=Insurance, layout=c(4,1) library(ggplot2) qplot(Claims/Holders, data=Insurance, facets=Age ~ ., geom="density")

slide-28
SLIDE 28

Try It! #5 Data Visualization

  • simple line chart

> x <- 1:10 1:10 > y y <- x^2 x^2 > p plot

  • t(y

y ~ ~ x) x)

  • box plot

> l libr brary ry(l (lat attic ice) > > boxplot(Claims/Holders ~ Age, data=Insurance)

  • visualize a linear fit

> > abline abline() ()

slide-29
SLIDE 29

Getting Help with R

Help within R itself for a function > > help(func)

help(func) > ?func > ?func

For a topic

> help.search(topic) > help.search(topic) > ??topic > ??topic

  • search.r-project.org
  • Google Code Search www.google.com/codesearch
  • Stack Overflow http://stackoverflow.com/tags/R
  • R-help list http://www.r-project.org/posting-guide.html
slide-30
SLIDE 30

Final Try It! Simulate a Tweedie

  • Simulate the number of claims from a Poisson distribution

with λ=2 (NB: mean poisson = λ, variance poisson = λ)

  • For as many claims as were randomly simulated, simulate a

severity from a gamma distribution with shape α=49 and scale θ=0.2 (NB: mean gamma = αθ, variance gamma = αθ2)

  • Is the total simulated claim amount close to expected?
  • Calculate usual parameterization (μ,p,φ)
  • f this Tweedie distribution
  • Extra credit:
  • Repeat the above 10000 times.
  • Does your histogram look like Glenn Meyers’?

http://www.casact.org/newsletter/index.cfm?fa=viewart&id=5756

p p

p p

  • 2

) ( , 1 2 ,

2 1

       = + + = =

slide-31
SLIDE 31

Six Indispensable Books on R

Visualization Learning R Statistical Modeling Data Manipulation

slide-32
SLIDE 32

P&C Actuarial Models Design • Construction Collaboration • Education Valuable • Transparent Daniel Murphy, FCAS, MAAA dmurphy@trinostics.com 925.381.9869

32

Michael E. Driscoll, Ph.D. www.dataspora.com San Francisco, CA 415.860.4347

Contact Us

slide-33
SLIDE 33

Appendices

  • R as a Programming Language
  • Advanced Visualization
  • Embedding R in a Server Environment
slide-34
SLIDE 34

R as a Programming Language

slide-35
SLIDE 35

Assignment

x < x <- c(1,2 (1,2,6) ,6) x a variable x <- R’s assignment operator, equivalent to ‘=‘ c( c( a function c which combines its arguments into a vector y < y <- c(„apples‟,‟oranges‟) z < z <- c(TRUE,FALSE) c(TRUE TRUE,FAL ,FALSE) SE) -> z > z

These are also valid assignment statements.

slide-36
SLIDE 36

Function Calls

  • There are ~ 1100 built-in commands in the R

“base” package, which can be executed on the command-line. The basic structure of a call is thus:

  • utput <- function(arg1, arg2, …)
  • Arithmetic Operations

+ + - * / ^ * / ^

  • R functions are typically vectorized

x < x <- x/3 works whether x is a one or many-valued vector

slide-37
SLIDE 37

Character numeric vectors logical

x < <- c( c(0, 0,2: 2:4) 4) y < <- c(“alpha”, “b”, “c3”, “4”) z < <- c( c(1, 1, 0 0, , TRU RUE, , FAL ALSE) E)

Data Structures in R

> class(x) [1] 1] "n "nume meri ric" c" > x2 <- as.logical(x) > c clas ass(x (x2) 2) [1] “logical”

slide-38
SLIDE 38

matrices lists

  • bjects

data frames*

lst st <- li list st(x (x,y, y,z) M < <- ma matr trix ix(r (rep( p(x,3 ,3),n ,ncol

  • l=3)

3) df f <- dat ata. a.fr frame me(x, x,y,z ,z)

Data Structures in R

> class(df df) [1] “da data ta.f .fra rame"

slide-39
SLIDE 39

Summary of Data Structures

Linear Rectangular Homogeneous Heterogeneous

data frames* matrices vectors lists

?

slide-40
SLIDE 40

Advanced Visualization

lattice, ggplot2, and colorspace

slide-41
SLIDE 41

ggplot2 = grammar of graphics

slide-42
SLIDE 42

ggplot2 = grammar of graphics

slide-43
SLIDE 43

qpl plot(l (log

  • g(c

(car arat) t), l log( g(pri rice) e), d dat ata a = d diam amond nds, , alpha=I(1/20)) + facet_grid(. ~ color)

Achieving small multiples with “facets”

slide-44
SLIDE 44

lattice = trellis

(source: http://lmdvr.r-forge.r-project.org )

slide-45
SLIDE 45

den ensit itypl plot

  • t(~

(~ sp speed ed | | typ ype, , dat ata= a=pi pitch ch)

list of lattice functions

slide-46
SLIDE 46

visualizing six dimensions

  • f MLB pitches with lattice
slide-47
SLIDE 47

xyp yplot

  • t(x

x ~ ~ y, y, da data= a=pit itch) h)

slide-48
SLIDE 48

xyp yplot

  • t(x

x ~ ~ y, y, gr group ups=t =type pe, d data ta=p =pit itch) h)

slide-49
SLIDE 49

xyp yplot

  • t(x

x ~ ~ y y | t type pe, d data ta=pi pitch ch)

slide-50
SLIDE 50

xyplot(x ~ y | (x ~ y | type, da type, data=pit ta=pitch, ch, fill.color = = pitch$color, panel = function(x,y x,y, , fill.color, …, subscripts) { fill <- fill.color[subscripts] panel.xyplot(x,y x,y, fill= fill, …) })

slide-51
SLIDE 51

xyplot(x ~ y | (x ~ y | type, da type, data=pit ta=pitch, ch, fill.color = = pitch$color, panel = function(x,y x,y, , fill.color, …, subscripts) { fill <- fill.color[subscripts] panel.xyplot(x, y, fill= fill, …) })

slide-52
SLIDE 52

Beautiful Colors with Colorspace

library(“Colorspace”) red <- LAB(50,64,64) blue <- LAB(50,-48,-48) mixcolor(10, red, blue)

slide-53
SLIDE 53

efficient plotting with hexbinplot

hexbinplot(log(price)~log(carat),data=diamonds,xbins=40)

slide-54
SLIDE 54

Embedding R in a Web Server

Using Packages & R in a Server Environment

slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57

Linux Apache MySQL R

http://labs.dataspora.com/gameday

slide-58
SLIDE 58

Coding

Clicking

vs

slide-59
SLIDE 59