Analys ing data with GNU R Nils Kammenhuber Technis che Univers - - PowerPoint PPT Presentation

analys ing data with gnu r
SMART_READER_LITE
LIVE PREVIEW

Analys ing data with GNU R Nils Kammenhuber Technis che Univers - - PowerPoint PPT Presentation

Analys ing data with GNU R Nils Kammenhuber Technis che Univers itt Mnchen Sources for s ome of thes e s lides Gilberto Cmara Instituto Nacional de Pesquisas Espaciais Manfred Jobmann T echnische Universitt Mnchen


slide-1
SLIDE 1

Analys ing data with GNU R

Nils Kammenhuber

Technis che Univers ität München

slide-2
SLIDE 2

Sources for s ome of thes e s lides

 Gilberto Câmara

 Instituto Nacional de Pesquisas Espaciais

 Manfred Jobmann

 T

echnische Universität München

 Johannes Freudenberg

 Cincinnati Children’s Hospital Medical Center

 Marcel Baumgartner

 Nestec S.A.

 Jaeyong Lee

 Penn State University

 Jennifer Urbano Blackford, Ph.D

 Department of Psychiatry, Kennedy Center

 Wolfgang Huber

slide-3
SLIDE 3

Why this talk on s tatis tics ?

 “My new protocol increases performance.”  How to prove that?

 Analytical expressions  usually cumbersome, and the

model usually is way too simple

 Simulations  T

estbed

  Usually, you have measurements that you need to

analyse

 How to prove the measurements really show the

intended effect – i.e., they don’t just show some random artefact?

 Via more measurements  Statistics

slide-4
SLIDE 4

Performance

 “My new routing protocol increases network

performance.” – Yeah, but what's “performance”?

 Base unit:

 Throughput (IP)  Throughput (TCP, HTTP)  Delay (IP)  Delay (HTTP)  Packet drop rate (IP)  …and many more!

 Aggregation type:

 Average … but what type of average?

 Does the throughput for a 10GB object count more than for a 2kB

  • bject?

 If not, should the weight be linear with the file size? Or log(size)?  Should we use the arithmetic, harmonic or geometric mean?

 Minimum  Variation across time  Variation across different hosts

slide-5
SLIDE 5

Performance: Take-home mes s ages

 Always think about the metric very intensely

 Don't just pick the first performance metric that comes to

your mind – be imaginative and creative!

 T

ry to find multiple metrics

 Play around with them, compare them  Insight evolves  experiments evolve, metrics evolve  One metric may reveal some particular effect, another

  • ne may hide it. But both can be useful.

 Log tons of data. All you can get hold of easily.

 You never know what metrics you will come up with later  There's time for rm and drop table later  Warning: Document that data! Otherwise you won't have

tons of data, but tons of meaningless numbers.

slide-6
SLIDE 6

Take-home mes s age (2)

 In my opinion, intelligent behaviour consists of

 20% clever answers  80% clever questions (which usually have trivial or obvious

answers)

 So always try to find a useful question! Then...

 Refine the question  Find sub-questions and implied questions

 Examples for rather ubiquitous “good” questions:

 How do I measure network performance?

 From whose standpoint? End user? Network operator?  Is a one-dimensional metric enough for me?

 How do I do a good simulation?

 Am I using a realistic (or otherwise meaningful) workload?  Am I using a topology that adequately reflects reality?  Which aspects are unimportant and don't need to be included

in the simulation?

 Can I analyse different aspects in different experiments?

slide-7
SLIDE 7

Back to s tatis tics ! – Why GNU R?

 Powerful tool

 for plotting (gnuplot can do that, too)  for statistical analyses (gnuplot can't)

 Open source  Has become de-facto standard for statistic

analyses in many fields of science

 Huge library: 3700 additional packages at CRAN

slide-8
SLIDE 8

Why not GNU R?

 Unintuitive as soon as you want do to more than

plotting data points from a file (IMO)

 Strange syntax  Confusing language elements: vector vs. list, matrix vs.

dataframe vs. list of vectors, NA vs. NaN vs. NULL, etc.

 T

erribly slow if you program in idioms from traditional languages

 Iterating over elements of a list or vector with for():

slooooooow

 But applying an operator or function to an entire vector or

matrix: remarkably fast

 Aims at interactivity, not scripting.

 Reproducing a plot 3 years later: almost impossible   ... unless you cared for reproducibility right from the start, but

it's somewhat cumbersome and requires discipline on your side

slide-9
SLIDE 9

His tory of R

 Statistical programming language “S” developed

at Bell Labs since 1976 (at the same time as UNIX)

 Intended to interactively support research and

data analysis projects

 Exclusively licensed to Insightful (“S-Plus”)  “R”: Open source platform similar to S

 Developed by R. Gentleman and R. Ihaka (University of

Auckland, NZ) during the 1990s

 Most S-plus programs will run on R without modification!

slide-10
SLIDE 10

What R is and what it is not

 R is

 a programming language  a statistical package  an interpreter  Open Source  fast

 R is not

 a database  a collection of “black boxes”  a spreadsheet software package  commercially supported  fast

slide-11
SLIDE 11

What R is

 A domain-specific language:

Powerful tool for data analysis and statistics

 Data handling and storage: numeric, textual  Powerful vector algebra, matrix algebra  High-level data analytic and statistical functions  Graphics, plotting

 A programming language

 Language “built to deal with numbers and statistics”  Loops (slow!), branching, subroutines  Hash tables (strange) and regular expressions  Classes (“OO”)

 Has become a standard in many areas of research

that involve statistics

 Many, many, many, many additional modules

 http://cran.r-project.org/  apt-cache search r-cran

slide-12
SLIDE 12

What R is not

 is not a database, but connects to DBMSs  no spreadsheet view of data,

but can import .csv files in various flavours

 has no default click-point user interfaces,

but connects to Java, T clTk (and there’s “Rcommander”, which I haven’t tried

  • ut)

 language interpreter can be very slow,

but allows to call own C/C++ code

 no professional / commercial support

slide-13
SLIDE 13

R and s tatis tics

 Packaging: a crucial infrastructure to efficiently

produce, load and keep consistent software libraries from (many) different sources / authors

 Statistics: most packages deal with statistics and

data analysis

 State of the art: many statistical researchers

provide their methods as R packages

slide-14
SLIDE 14

Ins tallation

 T

  • obtain and install R on your computer:

 Ubuntu, Debian, etc.:

 apt-get install r-base

 Download:

 Go to http://cran.r-project.org/mirrors.html to choose a

mirror near you

 Click on your favourite operating system (Linux, Mac,

  • r Windows)

 Download and install the “base”

slide-15
SLIDE 15

Getting s tarted

 Call R from the shell:

user@host$ R

 Leave R, go back to shell:

> q() Save information (y/n/q)? y

slide-16
SLIDE 16

R: s es s ion management

 Your R objects are stored in a workspace  T

  • list the objects in your workspace (may be a lot):

> ls()

 T

  • remove objects which you don’t need any more:

> rm(weight, height, bmi)

 T

  • remove ALL objects in your workspace:

> rm(list=ls())

 T

  • save your workspace to a file:

> save.image()

 The default workspace file is ./.RData

 If you don't like that, then you should have called R from a

different directory ...

 Or try: > setwd(“../../new_directory/blabla/”)

slide-17
SLIDE 17

Firs t s teps : R as a calculator

Exponentiation is ^ and not **: > 5 + (6 + 7) * pi^2 [1] 133.3049 Values up to about ± 1.7 ∙ 10217: > 2.8e+90 * 6.3e+217 [1] 1.764e+308 By default, log() is the natural logarithm (base = e) > log(exp(1)) [1] 1 Named arguments; default values/optional arguments: > log(1000, base=10) [1] 3 R is case sensitive: > Sin(pi/3)^2 + cos(pi/3)^2 Error: couldn't find function "Sin" > sin(pi/3)^2 + cos(pi/3)^2 [1] 1

slide-18
SLIDE 18

R as a calculator and function plotter

> log2(32) [1] 5 > sqrt(2) [1] 1.414214 > seq(0, 5, length=6) [1] 0 1 2 3 4 5 > curve( sin(x)*sin(cos(x/2))+x/12, from=0, to=4*pi) > savePlot(“/tmp/weird-function.png”)

slide-19
SLIDE 19

Side remark: Three ways of doing plots in R

  • Standard R way
  • plot()
  • also: hist(), curve(), boxplot(), barplot(),…
  • I’ll be talking about this way of plotting in this talk
  • [Almost] same syntax as S-plus
  • Trellis / Lattice graphs
  • More powerful
  • … but I have no experience with it, so it’s not in this talk
  • Problems using additional packages with own plot routines
  • ggplot2 package
  • Very powerful, very user-friendly, good for explorative

analysis

  • … but I have no experience with it, so it’s not in this talk
  • Not much online documentation – so far you basically need

their book to use it

slide-20
SLIDE 20

Help and other res ources

 Starting the R installation help pages in the browser (if this

doen’t work, don’t worry, then you’ll only have them in a manpage-like way) > help.start()

 In general:

> help(functionname) > ?functionname

 If you don’t know the function you’re looking for:

> help.search(„quantile“) > ??”quantile”

 “What’s in this variable”?

> class(variableInQuestion) [1] “integer” > summary(variableInQuestion)

  • Min. 1st Qu. Median Mean 3rd Qu. Max.

4.000 5.250 8.500 9.833 13.250 19.000

 www.r-project.org  CRAN.r-project.org: Additional packages, like www.CPAN.org for Perl

slide-21
SLIDE 21

Ins talling additional R packages from CRAN

 T

  • install additional packages:

 Ubuntu, Debian, etc:  First, try the shell command apt-cache search

pkgname

 Others / if there’s no Ubuntu/Debian package:  You might need to have C, C++, Fortran (!) compilers

  • etc. installed on your machine

 Start R on your computer.  At the R prompt, type: > choose CRANmirror() > install.packages(c(“pkg1”, “pkg2”), dependencies=TRUE)

 After the package has been installed, make R use

it: > library(pkg1)

 Now you can use the new functions etc.  If it didn’t work, try re-starting R

slide-22
SLIDE 22

Remainder of this talk: Typical workflow

  • Read in the file
  • Explorative analysis
  • Some basic statistics
  • Some basic plot types
  • Save the plots
  • Advanced topics
  • More complex analyses and plots
  • More on R syntax
slide-23
SLIDE 23

Very s imple input

 T

ask: Read a file into a vector

 Input file looks like this:

1 2 17.5 99

 Read this into vector x:

x <- scan(“inputfile.txt”)

 There are more options  help(scan)

slide-24
SLIDE 24

Some examples

> length(x) 100 > min(x) [1] -2.4923 > max(x) [1] 2.969

slide-25
SLIDE 25

“What values does x typically have?” (1)

slide-26
SLIDE 26

If there’s heavy variation in the data…

  • What’s the typical income for persons in this room?
  • Now Bill Gates walks in. What’s the typical income now?
  • Mode (“Modalwert”): The value that occurs most often
  • Of course, usually only defined for categorical or discrete values
  • And what about local maxima? Or multiple maxima?

 Use with care

  • Unfortunately, not a built-in function in R. Instead use:

> tab.x <- table(x); names(t.x[which.max(t.x)]) > median(x)  the 50% quantile

  • 50% of values are greater than the median, 50% are lesser
  • Way less sensitive to outliers than mean (cf. Bill Gates example)
  • Estimator for the mean if the x are s ymmetrically dis tributed

> mean(x, trim=0.05)  trimmed mean

  • Disregard 5% smallest and 5% largest values for x
  • Idea: Outliers are not representative / not important for the system
  • Dangerous ! Very often, that is a wrong as s umption!
slide-27
SLIDE 27

“Unfortunately, this is not a built-in function”  Writing your own functions in R

> harmean <- function(x) { + return(length(x) / sum(x)) + } Fix an error using external text editor (N.B. should be 1/x, not x): > fix(harmean) Syntax error […blah…] in line 2. Use x <- edit() > fix(harmean) Baaaad! Now all of your changes are lost! Instead you should have done what R told you: > harmean <- edit() Now that we’re at it, let’s also fix our geometric mean: > geomean <- edit() Baaaad again! Now geomean becomes a copy of whatever you last edit()ed.  should have used fix(geomean) here

slide-28
SLIDE 28

Saving your own functions in R

  • Method 1: Automatic saving
  • > quit()

Save workspace image? [y/n/c]:

  • If you answer “y” now, your own functions will re-appear

the next time that you start R

  • …but only if you s tart it from the s ame working directory!
  • And where is that?

> getwd() [1] “/home/blabla/research/mydata”

  • Method 2:
  • Edit external text file with, e.g., Emacs
  • > source(“my-functions.R”)
  • What about the other way round? (i.e., R  text file)

> dump(list=c(“fnc1”, “fnc2”), file=“my-fun.R”)

slide-29
SLIDE 29

Back to the data. What about other s tatis tics ?

  • quantile(x, probs=c(0.05, 0.1, 0.9, 0.95))
  • Values that are 5% or 10% smaller or greater than all
  • thers
  • > summary(x)
  • Min. 1st Qu. Median Mean 3rd Qu. Max.
  • 9.38 -3.51 -0.03 0.54 3.96 11.67

25% and 75% quantile are called the quartiles

  • Help! summary() behaves strangely and doesn’t

calculate quantiles and mean!

  • Probably some non-numerical values slipped into your data
  •  R will not view the data as numbers
  • T

ypical culprits: “-”, “N/A”, “*”, “unknown” etc. in your file

slide-30
SLIDE 30

Meas uring dis pers ion

slide-31
SLIDE 31

Why is there als o the s tandard deviation if we already have the variance?

slide-32
SLIDE 32

Standard deviation, s tandard error

slide-33
SLIDE 33

Comparing dis pers ions

slide-34
SLIDE 34

And if I want to compare the s tandard errors , not the s tandard deviations ?

slide-35
SLIDE 35

What about Bill Gates and dis pers ion…?

Remove the “constant=1” to get an estimate for the std.dev.

(then you don’t have to multiply with 0.6745)

slide-36
SLIDE 36

Robus t s tatis tics

  • Classical statistics: mean, variance, …
  • Very sensitive to outliers
  • More exact if there are no outliers
  • Readers (and paper reviewers!) are accustomed to it
  • Robust statistics: median, MAD, …
  • Not sensitive to outliers
  • Less exact: One single datapoint represents the entire data set!
  • Readers and reviewers may not know them

(“The median, ah, yes, what’s that again… and WTF is MAD?”)

  • When in doubt, use robust statistics
  • But that’s just my personal opinion!
slide-37
SLIDE 37
  • Enough. Let’s plot!
  • Obvious approach: Scatter plot

> plot(x) > plot(x, type=“l”)

slide-38
SLIDE 38

Plotting (2)

Combination of type=“p” and type=“l”: > plot(x, type=“b”)

slide-39
SLIDE 39

But does this plot type actually reveal what we want to s ee?

  • Always as k this ques tion!
  • So, are you really interested in the temporal

development? Apparently, the values fluctuate randomly and are not really correlated…

  • Better alternatives:
  • Box–whisker plot
  • Histogram plot
  • Density plot
  • CDF plot
  • CCDF plot
  • Scatterplot against other data
  • QQ plot against postulated theoretical distribution
  • QQ plot against other data
slide-40
SLIDE 40

What you s hould not plot

  • A real res earcher never, ever us es pie charts !
  • There’s a good reason why R has no method for drawing

them

  • More on this in the slideset “How to lie with statistics”
  • Also, you should try avoiding 3D charts at all costs
  • 3D plot is plotted on 2D paper or 2D screen  Often very

difficult to make out the 3D structure

  • Much more difficult to understand (“so what does it actually

mean if a data point is in the upper left rear corner?”)

  • Only use 3D plots as a very last resort
  • Always remember:
  • You don’t make the plot just for you
  • You don’t make the plot to impress the reader
  • You make the plot to help the reader (and the reviewer!)

understand your point

slide-41
SLIDE 41

Box–whis ker plot

T ersely shows the distribution of your data: > boxplot(x)

75% quantile 25% quantile 50% quantile (i.e., median)

  • utlier points

Boundaries calculated from quantile ranges (Last data points

that are still within median±1.5 ∙ IQR)

slide-42
SLIDE 42

His togram plot: This is what you us ually want

> hist(x)

T extmasterformat bearbeiten

Zweite Ebene

Dritte Ebene

Vierte Ebene Fünfte Ebene

> hist(x, breaks=20)

T extmasterformat bearbeiten

Zweite Ebene

Dritte Ebene

Vierte Ebene Fünfte Ebene

slide-43
SLIDE 43

Dens ity plots (PDF)

  • “A histogram with an infinite number of

(interpolated) bars”

  • Interpolation using a density kernel function

> plot(density(x))

slide-44
SLIDE 44

Cumulative dens ity function plots (CDF)

  • Basically, it’s just the integral of the density function
  • But without the interpolation part
  • So it’s actually a sum, not an integral

> plot(ecdf(x))

  • Why CDF instead of density
  • r histogram?
  • Can immediately see quantiles
  • More expressive if points spread
  • ut very wide

e.g., 20% quantile

slide-45
SLIDE 45

Logarithmic axes

  • Use logarithmic scaling if the data is very

asymmetric

  • Always a good idea to play around with axis s caling!

> plot(ecdf(x), log=“x”) Error: logarithmic axis must have positive boundaries > plot(ecdf(x), log=“x”, xlim=range(x))

slide-46
SLIDE 46

atanh, as inh axes (Ralph: “The Kammenhuber s cale” – but I was not the firs t to dis cover it;-)

  • Properties of logarithmic scaling
  • Visual magnification towards 0
  • Visual compression towards +∞
  • Cannot dispay value 0.0 or negative values
  • But what if we wanted logarithm-like scaling for

negative and for positive values in one plot?

  • T

wo types of hyperbolic scaling

  • asinh axis scaling:
  • Visual compression towards −∞
  • Visual magnification towards 0
  • Visual compression towards +∞
  • [shifted and scaled] atanh axis scaling:
  • Visual magnification towards 0.0
  • Visual compression around 0.5
  • Visual magnification towards 1.0
slide-47
SLIDE 47

Linear vs . as inh X axis

  • Linear scale hides the fact that we have two peaks!
  • N.B. Interval boundaries are different
slide-48
SLIDE 48

Linear vs . atanh Y axis

  • Linear scale hides the fact that the temporal

development towards larger key lengths also took place at very short and very large key lengths, not

  • nly for popular key lengths
slide-49
SLIDE 49

Why does it work?

as inh s cale atanh s cale (s hifted)

slide-50
SLIDE 50

How to do it in R

  • Rather tricky!
  • R has no builtin support for this
  • Need to manually fiddle around with it
  • Basic idea for asinh-scaled X axis:

> plot(x=asinh(data), y=…, xaxt=“n”, …) > axis(side=1, at=asinh(c(-100, -10, -1, 0, 1, 10, 100)), labels=c(-100, -10, -1, 0, 1, 10, 100))

  • Basic idea for atanh-scaled Y axis:

> atanhS <- function(x) atanh(2*x-1) > plot(x=…, y=atanh(data), yaxt=“n”, …) > axis(side=2, at=atanhS(c(0.01, 0.1, 0.5, 0.9, 0.99)), labels=c(0.01, 0.1, 0.5, 0.9, 0.99))

  • Cumbersome  I wrote some R scripts that are

about to be released

slide-51
SLIDE 51

Nomenclature

  • Apparently, these axis scales have been known for

decades or centuries

  • Apparently, no consistent nomenclature for them
  • I call them “two-ended pseudo-logarithmic scale”,

since that describes what they look like

  • Hard to understand if you explain it purely orally:
  • ‘But hey, a double-logarithmic plot is easy to do in R, just

say plot(…, log=“xy”)’

  • That’s true, but a double-logarithmic plot is a plot with

logarithmic X and logarithmic Y axis – that’s an entirely different thing!

slide-52
SLIDE 52

Further information

  • Nils Kammenhuber: Two-ended pseudo-logarithmic
  • scaling. T

echnical Report, NET 2012, to appear

slide-53
SLIDE 53

Combining plots

  • Now I have multiple measurements that I want to

compare in one single plot

> plot(first.data, type=“l”, col=“black”, lty=1) > lines(second.data, col=“blue”, lty=2)

  • Or with points instead of lines:

> plot(first.data, type=“p”, col=“black”, pch=1) > points(second.data, col=“blue”, pch=2)

slide-54
SLIDE 54

Adjus ting axis limits

  • Problem: The axis limits are not readjusted after the

first call to plot()

> plot(first.data, type=“l”, col=“black”, lty=1, xlim=range(c(first.data, second.data)), ylim=range(c(first.data, second.data))) > lines(second.data, col=“blue”, lty=2)

  • Alternatively, can manually specify axis limits, e.g.,

for probabilities:

> plot(first.data, …, xlim=range(c(first.data, second.data)), ylim=c(0.0, 1.0))

slide-55
SLIDE 55

“I jus t ran the experiment for half an hour and I feel that I have enough data now.”

  • Maybe… maybe not.
  • The readers expect error bars or something like that
  • How do I get error bars?
  • Some researchers just use arith. mean ± 2 standard errors

(or ±3)

  • About 95.5% of all data falls within ±2σ of „real“ mean
  • About 99.7% of all data falls within ±3σ of „real“ mean
  • … but only if the data is actually normally dis tributed!
  • Better: calculate a confidence interval on the mean of your data
  • Semantics:

“ With a probability of 99% (confidence level; adjustable), the output of the process that generated this data has an arithmetic mean that lies within this interval”

  • Higher confidence  interval gets larger
  • Higher variability  interval gets larger
  • More data points  interval gets smaller
slide-56
SLIDE 56

How to calculate confidence intervals

slide-57
SLIDE 57

Confidence intervals : Be careful!

  • 1. Your data s hould be (roughly) normally dis tributed
  • …at least, not very asymmetric
  • Check it  QQ plots, and maybe statistical tests
  • If not the case  group the data and apply Central Limit

Theorem

  • 2. Your data should be independently distributed
  • Often not the case!
  • Check it  (1) Plot the data and look for a trend, (2) ACF plots
  • If not the case  group the data into larger groups and check

ACF again

  • 3. The Central Limit Theorem only works if the system

that creates your samples has finite variance

  • More on this later
slide-58
SLIDE 58

“Here, the error roughly looks like a Gauß curve, s o it mus t be normally dis tributed”

  • How do I prove that my data is

normally distributed?

  • Naïve approach:
  • Plot histogram and density of your data
  • Plot density function of normal distribution

into same diagram

  • “Here, they look the same”

> hist(mydata, freq=FALSE, xlim=c(0.0, 0.3)) > curve(dnorm(x, mean=mean(mydata), sd=sd(mydata)) , col=“blue”, lty=2, add=TRUE) > legend(… more about this later…) It looks normally distributed, but this data actually comes from the sum

  • f a Student-T and an exponential distribution…
slide-59
SLIDE 59

What’s wrong with that approach

  • 1. You can never “prove” that data is normally (or

exponentially, t, Pareto, …)-distributed

  • You only can present graphs, calculations, statistical tests
  • etc. that do not contradict that assumption
  • 2. The eye is easily fooled – many symmetrical

distributions look like Gauß’ bell curve

  • 3. There are way better ways to analyse this:
  • QQ plots
  • PP plots (not shown)
  • Statistical tests: χ² test, Kolmogorov–Smyrnov test, …

(not shown – they usually are „too picky“;-)

slide-60
SLIDE 60

Digres s ion #1: Theoretical probability dis tributions in R

  • A lot of distributions are built into R
  • Naming convention: For every probability

distribution XY, there are the following four functions

  • dxy() – the density function (I used that in the histogram)
  • pxy() – the probability function (≈cumulative density) (you

could use that for comparison in a CDF plot of real data)

  • qxy() – the quantile function (useful for QQ plots and some

statistical tests)

  • rxy() – draw random numbers from that distribution
  • Example:
  • rt(50, df=3) + rexp(50) – rnorm(50, sd=3):

Create 50 random variables from a combination of T, exponential and Normal distribution

slide-61
SLIDE 61

Digres s ion #2: Random number generators

slide-62
SLIDE 62

Random number generators : Literature

  • I–8 Discrete Event Simulation slideset on random

number generation

  • L’Ecuyer, Simard:

TestU01: a C library for empirical testing of random number generators ACM Transactions on Mathematical Software, Volume 33, No. 4, 2007

slide-63
SLIDE 63

Better alternative ins tead to comparing his togram vs . dens ity curve: QQ-Plots

  • Short for quantile–quantile plot
  • Compare the quantiles of two distributions
  • Imagine you have two distributions (e.g., #1 – empirical
  • bservations; #2 – postulated theoretical distribution,

e.g., normal distribution)

  • Calculate 1%, 2%, … 100% quantile for each distribution
  • T

ake the quantiles of 1st distribution as Y coordinates and quantiles of 2nd distribution as X coordinates

  • How to read them:
  • A straight (i.e., unbent) line means that the two

distributions are similar

“ In a sense, the only function that the human mind really understands well is a straight line. One good strategy for making a graphy easy to understand is to make it as linear as possible.” – John P . Boyd

  • Be careful about the ends
  • Also note the axis labels – they might be shifted
slide-64
SLIDE 64

QQ plot example

Histogram comparison: looks ~normal QQ plot: visible differences – not a straight line

slide-65
SLIDE 65

How to do a QQ plot in R

  • QQ plot against normal distribution:

> qqnorm(mydata) > qqline(mydata) #auxiliary straight line

  • QQ plot against some other distribution xy:

> qqplot(qxy(ppoints(data), …), data)

  • Nicer:

> qqplot(qxy(ppoints(data), …), data, xlab=“Theoretical quantiles”, ylab=“Sample quantiles”, main=“Comparison with xy distribution”)

  • QQ plots can be us ed to compare different data s ets :

> qqplot(data.internet, data.mysimulation)

slide-66
SLIDE 66

QQ plots with type=“l” or type=“b”

> qqnorm(data, type=“b”)

  • Huh!??

> qqnorm(sort(data), type=“b”)

slide-67
SLIDE 67

Be careful when reading QQ plots !

Don’t let yourselves be fooled! DAX revenues vs. normal distribution:

Good model for normal trading days Bad model for exceptional trading days

slide-68
SLIDE 68

Confidence intervals : Be careful!

  • 1. Your data should be (roughly) normally distributed
  • …at least, not very asymmetric
  • Check it  QQ plots, and maybe statistical tests
  • If not the case  group the data and apply Central Limit Theorem
  • 2. Your data should be independently distributed
  • Often not the case!
  • Check it  (1) Plot the data and look for a trend, (2) ACF plots
  • If not the case  group the data into larger groups and check ACF

again

  • 3. The Central Limit Theorem only works if the system that

creates your samples has finite variance

  • More on this later
slide-69
SLIDE 69

Statis tical tes ts

  • If you really want to be correct, then use a statistical

test once you’re satisfied with the QQ plot

  • Millions of statistical tests that can be used to check

“if the data follows some given theoretical distribution” (e.g., Gauß’ normal distribution)

  • χ² test: Very universal (i.e., can be used with just about any

distribution), but less powerful (i.e., needs more data)

  • Kolmogorov–Smyrnov test (only for a number of popular

distributions), more powerful (also works with less data)

  • Anderson–Darling test, …
  • Most of them are built into R
  • Check literature to see how to use them
slide-70
SLIDE 70

Statis tical tes ts : General obs ervations

  • A s tatis tical tes t never can prove that your data actually

follows s ome dis tribution XY!

  • Rather, it checks if it finds evidence agains t (!) this assumption.
  • If no such evidence found: “acquitted due to lack of evidence”, i.e.,

“could be XY distributed” (but not: “acquitted due to obvious innocence”!)

  • If such evidence found: “seems guilty from my point of view”, i.e.,

“to me, it doesn’t look XY distributed” – but in fact you might simply not have collected enough data yet, or you should have used a more powerful test:

  • The more restricted your test (the less distributions you

can use it for etc.), the better its power

  • As with confidence intervals, you select a confidence level
  • Semantics: “With =

α 5% error rate, no evidence could be found that contradicts the assumption that the data is XY-distributed”

  • In fact, confidence intervals are a special way to formulate a

Student-T-test

slide-71
SLIDE 71

Confidence intervals : Be careful!

  • 1. Your data should be (roughly) normally distributed
  • …at least, not very asymmetric
  • Check it  QQ plots, and maybe statistical tests
  • If not the case  group the data and apply Central Limit Theorem
  • 2. Your data should be independently distributed
  • Often not the case!
  • Check it  (1) Plot the data and look for a trend, (2) ACF plots
  • If not the case  group the data into larger groups and check ACF

again

  • 3. The Central Limit Theorem only works if the system that

creates your samples has finite variance

  • More on this later
slide-72
SLIDE 72

Forcing our data to be normally dis tributed: Making us e of the Central Limit Theorem

  • Often times, your data is not normally distributed
  • Central Limit Theorem, pragmatic version “for dummies”:
  • You have a large number of samples that do not at all look normally

distributed

  • Group them together into chunks of about 30–100 samples
  • Use groups of 30 if the data looks symmetric and roughly similar

to a Gauß bell curve

  • Use groups of 100 if the data looks very asymmetric
  • For each group, calculate the arithmetic mean of that group
  • Now a miracle happens:

The group means [roughly] follow a normal distribution!

  • Try it out in R yourselves!
  • Now you can use the group means as input for your

statistical tool that expects normally distributed data (e.g., confidence intervals)

slide-73
SLIDE 73

Confidence intervals : Be careful!

  • 1. Your data should be (roughly) normally distributed
  • …at least, not very asymmetric
  • Check it  QQ plots, and maybe statistical tests
  • If not the case  group the data and apply Central Limit

Theorem

  • 2. Your data should be independently distributed
  • Often not the case!
  • Check it  (1) Plot the data and look for a trend, (2) ACF

plots

  • If not the case  group the data into larger groups and

check ACF again

  • 3. The Central Limit Theorem only works if the s ys tem

that creates your s amples has finite variance

slide-74
SLIDE 74

Central limit theorem: limits

  • The central limit theorem only holds if the process

that creates your data has finite variance!

  • Of course, the sample variance is always finite:

sd(mydata) or var(mydata) always will give a finite value (…unless mydata contains an +Inf value)

  • But the process that created your samples can have an

infinite variance

  • “Infinite variance? Come on, that’s an esoteric special

case!”

  • No, it’s not – it can easily happen with power law

distributions…

  • …and power law distributions are ubiquitous!
  • More on this on later slides
slide-75
SLIDE 75

Confidence intervals : Be careful!

  • 1. Your data should be (roughly) normally distributed
  • …at least, not very asymmetric
  • Check it  QQ plots, and maybe statistical tests
  • If not the case  group the data and apply Central Limit Theorem
  • 2. Your data s hould be independently dis tributed
  • Often not the case!
  • Check it  (1) Plot the data and look for a trend, (2) ACF plots
  • If not the case  group the data into larger groups and check ACF

again

  • 3. The Central Limit Theorem only works if the system that

creates your samples has finite variance

  • More on this later
slide-76
SLIDE 76

Data mus t not contain s ignificant trends

  • Simply do plot(data) and visually check if there’s a

trend

  • e.g., values seem to grow over time…
  • …or seem to decay…
  • …or you see a curved shape (e.g., first upwards, then

downwards)

  • …etc.
  • AFAIK no really good way to check all these with a

statistical test

  • Even if you can’t make out a trend, you’ll have to

check independence using an ACF plot

slide-77
SLIDE 77

ACF plots : Are your s amples really independent?

  • Consider temporal development of queue length at a

router:

  • If large now  cannot change much in the next milliseconds
  • If small now  unlikely that it changes dramatically
  • If your data shows such autocorrelation

 Confidence intervals will be too small  If you don’t check this, you’re a cheater!

  • Check autocorrelation using ACF plots (cf. next slides)
  • If the plots suggest unacceptable autocorrelation, then …
  • Group the samples (as with central limit theorem)
  • If they are grouped already, make the groups larger
  • “Should I include these ACF plots into my paper?”
  • No – just say that “using ACF plots, we ensured that our samples

(or sample means) were sufficiently uncorrelated”

slide-78
SLIDE 78

Autocorrelation

slide-79
SLIDE 79

How to do and how to read ACF (autocorrelation function) plots

  • How to do them in R:

> acf(mydata)

  • What does it show:
  • Calculate autocorrelation for lag = 0, 1, 2, 3, …
  • Show boundaries for “random” autocorrelation
  • “random autocorrelation”: an autocorrelation within these

boundaries is very likely to be caused by random fluctuations, not by an actual systemic autocorrelation

  • How to read it:
  • GOOD: If all bars (except the silly first one) stay within these

boundaries, then we can be happy and assume the sample does not contain unacceptably large autocorrelation

  • BAD (i.e., we have indication for an unacceptable level of

autocorrelation): If further bars at the beginning cross the boundaries (erratic single bars further towards the right might be okay) IMO rather silly for lag=0, but R does it…

slide-80
SLIDE 80

The data

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-81
SLIDE 81

Good ACF plot

slide-82
SLIDE 82

The data

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-83
SLIDE 83

Bad ACF plot

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

With a grain of salt, we might tolerate this… … but not this

slide-84
SLIDE 84

The data

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-85
SLIDE 85

Terrible ACF plot

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-86
SLIDE 86

This number s huffling is not only good for confidence intervals …

  • Many statistical tools require that your data be
  • uncorrelated (“iid”)
  • and some require it to be normally distributed
  • You can apply the same tools in these cases (ACF

plots, grouping, QQ plots, central limit theorem, etc.)

slide-87
SLIDE 87

Digres s ion: How to pres ent a plot in a pres entation

I am a bad example, since I don’t adhere to these steps in this talk, but…: 1. What figures does this plot talk about? Why do you s how it? “This plot shows compares the development of the gnaricrosticity in relation to the bullawomptleness and will show this paper is great” 2. Explain the axes . (People tend to forget this, but it’s important!)

  • “The X axis displays the bullawomptleness in square Hertz per packet”
  • “The Y axis displays the gnaricrosticity in cubic degrees per nanobyte”

3. How do you read your plot? (Don’t interpret the data points yet!)

  • “Data points in the upper right corner are good, because blah.”
  • “Data points in the lower right corner are unfavourable, ….”

4. Why are there multiple lines /colours /… in your plot? “The blue dots represent measurements in Munich, the red dots represent those in Shanghai.” 5. Only now you can s tart interpreting what the plot s hows : “As you can see, we only have data points in the upper left corner, and only

  • ne in the lower right. Therefore I deserve the Fields medal.”
slide-88
SLIDE 88

CCDF plots

  • CDF plot: shows cumulative probability on Y axis
  • CCDF plot: shows 1 − cumulative probability on Y axis
  • What’s the point?
  • If we use a logarithmic Y axis, it makes a huge difference whether we

do a CDF plot or a CCDF plot

  • Almost always are plotted with logscale X and logscale Y axis
  • Can be used to spot Power Laws (cf. later slides)
  • Alternative names: Survival plot, survival function plot, …
  • How to do them in R?
  • Unfortunately, not a builtin function

(maybe I shouldn’t have mentioned this fact about this pie charts…)

  • Use my crude plot.ccdf.R
  • r download Fabian Schneider’s plot.ccdf.R
  • r try to find a package at CRAN
slide-89
SLIDE 89

A CDF plot with linear s cale

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-90
SLIDE 90

CDF plot, logarithmic X axis (s ame data)

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-91
SLIDE 91

CDF plot, log X and Y axes (s ame data)

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-92
SLIDE 92

CCDF plot, log X and Y axes (s ame data!!)

T extmasterformat bearbeiten

Zweite Ebene Dritte Ebene

Vierte Ebene

Fünfte Ebene

slide-93
SLIDE 93

Why s hould we care about Power Laws ?

  • Not everything out there is normally or exponentially

distributed

  • Actually, Power Laws and similar distributions are

rather widespread (more on this later)

  • Surprising and nasty properties! (see next slides)
  • En vogue in the scientific community since about the

1980s, 1990s (depending on subject)

Every time you see something resembling a straight line on a plot with doubly logarithmic axes, you are expected to instantly yell “A power law! A power law!”, regardless of whether that actually makes sense. The more log-log-straight lines in your paper, the merrier.

slide-94
SLIDE 94

Power Laws : Surpris ing properties

  • Depending on the exponent parameter, a power-law

distribution may have…

  • … infinite variance  very high variability
  • Cannot just measure, e.g., 100 values, take their

mean, and assume that the sample mean is a good estimator for the actual mean of the generating process – and the median is only a good estimator for the mean of a symmetric distribution!

  • Central Limit Theorem will not work  Can not simply

group a couple of values and trust they will be normally distributed around their mean

  • … or even an infinite mean!
  • Warning: Of course, the sample variance and sample mean

are always defined! (unless you have Inf or NaN values in your data)

  • Other surprising properties (more on this later)
  • Self-similarity, fractal character
  • Long-range dependence
slide-95
SLIDE 95

Nomenclature confus ion

slide-96
SLIDE 96

Some notions in the power law context

  • Used almost interchangeably:
  • Power Law, Power T

ail, Long T ail, Fat T ail, Heavy T ail

  • Pareto distribution
  • Zipf’s Law
  • 80 : 20 rule, Pareto principle
  • (Pseudo|Asymptotic|Stochastic) Self-similarity (of 1st, 2nd degree)
  • Fractal
  • Long-Range Dependency (LRD)
  • Slowly decaying variance
  • 1/f noise
  • Hurst parameter > 0.5
  • …but of course most of them do not exactly describe the same

systems – there sometimes are subtle differences!

N.B. We won’t describe them with much mathematical rigour …

slide-97
SLIDE 97

Self-s imilarity, fractals

Scale-free distribution: Not just a curiosity, but generates samples that are self-similar or fractal, i.e., more or less look the same on all kinds of scales:

Real world / Power law traffic: Variability remains the same across many time scales Unrealistic / Non-Power- Law traffic: Variability is quickly averaged

  • ut on

larger time scales

slide-98
SLIDE 98

80 : 20 rule (a.k.a. „Pareto principle“)

 The fraction W of the wealth in the hands of the

richest P of the the population is given by

W = P(α−2)/(α−1)

 Example: US wealth: α = 2.1

 Richest 20% of the population holds 86% of the wealth

 Of course, not only restricted to wealth, but a

fundamental property of Power-Law-distributed variables

slide-99
SLIDE 99

Power laws are s eemingly everywhere (1)

in Moby Dick scientific papers 1981-1997 AOL users visiting sites ‘97 bestsellers 1895-1965 AT&T customers on 1 day California 1910-1992

slide-100
SLIDE 100

Power Laws are s eemingly everywhere (2)

Moon Solar flares wars (1816–1980) richest individuals 2003 US family names 1990 US cities 2003

slide-101
SLIDE 101

Power Laws that “fall off at the end”

  • Most (all?) real-world data faces some fundamental

limit:

  • T
  • tal disk space world wide
  • Maximum energy the earth’s crust can store, etc.
  • More realistic: Truncated/T

amed Pareto distribution

  • Data follows a Power Law for several magnitudes (straight

line on CCDF). Then, exponential part kicks in (tail falls down in CCDF):

slide-102
SLIDE 102

Are there Power Laws in all thos e data s ets ? Or limited Power Laws with exponential tail?

  • A lot of ongoing arguments in scientific community
  • Popular alternative explanations:
  • T

runcated/T amed Pareto distribution (we just saw that)

  • Lognormal distribution (cf. Gong et al. 2005,

Mitzenmacher 2004)

  • Combined distributions: Head is lognormal, tail is Pareto /

Power-Law [and last portion of tail may have fall-off]

  • Double-Pareto distribution (cf. Mitzenmacher 2004)
  • Weibull distribution
  • Hyperexponential distribution
slide-103
SLIDE 103

More on Power Laws

  • …in my lecture (“Analyse von Systemperformanz”)

in 03.07.023 @ 14:00h

slide-104
SLIDE 104

Extremal values theory

Just to mention it:

  • Suppose you are interested in the maximum of a

sample

  • e.g., maximum water height
  • The longer the sample, the higher the maximum
  • Actually, the maximum is a random variable itself
  • How is this maximum distributed?
  • In practice: At what probability can we expect a +10m

water level during 1,000 years?

  • Extremal value theory: Regardless of the underlying

distribution, there are only three ways how the maxima can be distributed.

slide-105
SLIDE 105

R s yntax peculiarities

slide-106
SLIDE 106

More complicated: Reading / writing tables

 Write a table into a file:

> x <- rnorm(100, 1, 1) > write.table(x, file=“numbers.txt”) # There are more options => help(write.table)

 Read a table from a file:

> x <- read.table(“in.txt”, header=FALSE) # There are more options => help(read.table)

 Read a table from the Web:

> x <- read.table(“http://www.net.in.tum.de/ …”…)

 Read a table with pre-processing from a pipe (use

Perl to print 2nd row, but only if 1st row is >0): > x <- read.table(pipe(“perl -nae ‘if($F[0] > 0){ print $F[1]; }’”))

slide-107
SLIDE 107

Univers al: Us ing file handles

 File handles are about as universal as in Perl  Write two lines into a file:

> fh <- file(“output.txt”, “w”) # w = write > cat(“blah”, “blubb”, sep=“\n”, file=fh) > close(fh)

 Write into a file and compress it using gzip:

> fh <- gzfile(“output.txt.gz”, “w”) > cat(“blah blah blah”, … , file=fh)

 More examples: help(file)  Also try “filenames” like http://www.blabla.bla/data.gz

slide-108
SLIDE 108

Bas ic data types

slide-109
SLIDE 109

Objects

 Containers that contain data  T

ypes of objects: vect or , f act or , ar r ay, m at r i x, dat af r am e, l i st , f unct i on

 Attributes

 mode: numeric, character (=string!), complex, logical  length: number of elements in object

 Creation

 assign a value  create a blank object

slide-110
SLIDE 110

Identifiers (object names )

 must start with a letter (A-Z or a-z)  can contain letters, digits (0-9), periods (“.”)

 Periods have no s pecial meaning (I.e., unlike C or Java!)

 case-sensitive:

e.g., mydata different from MyData

 Do not us e the unders core “_”!

 People mostly use “.” instead of “_”

slide-111
SLIDE 111

As s ignment

 “<-” used to indicate assignment

x <- 4711 x <- “hello world!” x <- c(1,2,3,4,5,6,7) x <- c(1:7) x <- 1:4

 note: as of version 1.4 “

=“ is also a valid assignment operator, but its use is discouraged

slide-112
SLIDE 112

Bas ic (atomic) data types

 Logical (may be abbreviated

to T and F) > x <- TRUE; y <- F > x; y [1] TRUE [1] FALSE

 Numerical

> a <- 5; b <- sqrt(2) > a; b [1] 5 [1] 1.414214 Strings (called “ characters” !) > a <- "1"; b <- 1 > a; b [1] "1" [1] 1 > a <- “string" > b <- "a"; c <- a > a; b; c [1] “string" [1] "a" [1] “string"

slide-113
SLIDE 113

But there is more!

R can handle “big chunks of numbers” in elegant ways:

 Vector

 Ordered collection of data of the same data type  Example:

 Download timestamps  Last names of all students in lecture

 In R, a single number is a vector of length 1

 Matrix

 Rectangular table of data of the same data type  Example: Round-trip times measured at different times

from different vantage points

 Array

 Higher dimensional matrix of data of the same data type

 (Lists, data frames, factors, function objects, … →

later)

slide-114
SLIDE 114

Vectors

> Mydata<- c(2,3.5,-0.2) Vector (c=

“ concatenate” )

> colours<- c(“Black",

“Red",“Yellow")

String vector > x1 <- 25:30 > x1 [1] 25 26 27 28 29 30 Number sequence > colours[1] I n R, t he i ndi c e s s t ar t wi t h 1, not wi t h 0! [1] “Black" Addressing one element… > x1[3:5] [1] 27 28 29 …and multiple elements

slide-115
SLIDE 115

Vectors (continued)

 More examples with vectors:

> x <- c(5.2, 1.7, 6.3) > log(x) [1] 1.6486586 0.5306283 1.8405496 > y <- 1:5 > z <- seq(1, 1.4, by = 0.1) > y + z [1] 2.0 3.1 4.2 5.3 6.4 > length(y) [1] 5 > mean(y + z) [1] 4.2

slide-116
SLIDE 116

Subs etting

 Often necessary to extract a subset of a vector or

matrix

 R offers a couple of neat ways to do that:

> x <

  • c( "a", "b", "c", "d", "e", "f ",

"g", “a") > x[ 1] # f i r st ( ! ) el em ent > x[ 3: 5] # el em ent s 3. . 5 > x[ - ( 3: 5) ] # el em ent s 1 and 2 > x[ c( T, F, T, F, T, F, T, F) ] # even- i ndex el em ent s > x[ x < = “d”] # el em ent s “a”. . . ”d”, “a”

slide-117
SLIDE 117

Typical operations on vector elements

 T

est on the elements

 Yields a new boolean vector

 Extract the positive

elements

 Remove the given

elements

> Mydata [1] 2 3.5 -0.21 > Mydata > 0 [1] TRUE TRUE FALSE > Mydata[Mydata>0] [1] 2 3.5 > Mydata[-c(1,3)] [1] 3.5

slide-118
SLIDE 118

More vector operations

> x <- c(5,-2,3,-7) > y <- c(1,2,3,4)*10 Multiplication on all the elements > y [1] 10 20 30 40 > sort(x) Sorting a vector [1] -7 -2 3 5 > order(x) [1] 4 2 3 1 Element order for sorting > y[order(x)] [1] 40 20 30 10 Operation on all the components > rev(x) Reverse a vector [ 1] -7 3 -2 5

slide-119
SLIDE 119

Matrices

 Matrix: Rectangular table of data of the s ame type

> m <- matrix(1:12, 4, byrow = T); m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 > y <- -1:2 > m.new <- m + y > t(m.new) [,1] [,2] [,3] [,4] [1,] 0 4 8 12 [2,] 1 5 9 13 [3,] 2 6 10 14 > dim(m) [1] 4 3 > dim(t(m.new)) [1] 3 4

slide-120
SLIDE 120

Matrices

> x <- c(3,-1,2,0,-3,6) > x.mat <- matrix(x,ncol=2) M at r i x w i t h 2 cols > x.mat

[,1] [,2] [1,] 3 0 [2,] -1 -3 [3,] 2 6 > x.matB <- matrix(x,ncol=2,row=T) By-row creation > x.matB [,1] [,2] [1,] 3 -1 [2,] 2 0 [3,] -3 6

Matrix: Rectangular table of data of the same type

slide-121
SLIDE 121

Building s ubvectors and s ubmatrices

> x.matB[,2] 2nd column [1] -1 0 6 > x.matB[c(1,3),] 1st and 3rd lines [,1] [,2] [1,] 3 -1 [2,] -3 6 > x.mat[-2,] Everything but the 2nd line [,1] [,2] [1,] 3 0 [2,] 2 6

slide-122
SLIDE 122

Dealing with matrices

> dim(x.mat) D i m ensi on ( I . e. , si ze) [1] 3 2 > t(x.mat) Tr ansposi t i on

[,1] [,2] [,3]

[1,] 3 2 -3 [2,] -1 0 6 > x.mat %*% t(x.mat) M at r i x m ul t i pl i cat i on; al so see %o% [,1] [,2] [,3] [1,] 10 6 -15 [2,] 6 4 -6 [3,] -15 -6 45 > solve() I nver se of a squar e m at r i x > eigen() Ei genvect or s and ei genval ues

slide-123
SLIDE 123

Special values (1/3)

 R is designed to handle statistical data   Has to deal with missing / undefined / special

values

 Multiple ways of missing values

 NA: not available (i.e., this data is simply missing)  NaN: not a number (i.e., undefined, e.g., 23/0 or log(−1))  Inf, -Inf: infinity (e.g., log(0))  NULL

 NaN ≠ Inf ≠ NA ≠ NULL ≠ FALSE ≠ “” ≠ 0 (pairwise)  NA also may appear as Boolean value!

I.e., boolean value in R ∈ {TRUE, FALSE, NA} 

slide-124
SLIDE 124

Special values (2/3)

 NA: Numbers that are “not available” (missing)

> x <- c(1, 2, 3, NA) > x + 3 [1] 4 5 6 NA

 NULL: speci al NULL obj ect ( r ar el y of

i m por t ance)

 Can be r et ur ned by f unct i ons w

hose val ue i s undef i ned

 NaN: “Not a number”

> 0/0 [1] NaN

 Inf, -Inf: i ni f i ni t e

> log(0) [1] -Inf

slide-125
SLIDE 125

Special values (3/3): Weird comparis ons

Odd (but logical) interactions with equality tests, etc: > 3 == 3 [1] TRUE > 3 == NA [1] NA #but not “FALSE”! > NA == NA [1] NA #but not “TRUE”! > NaN == NaN [1] NA #but not “TRUE”! > 99999 >= Inf [1] FALSE > Inf == Inf [1] TRUE

slide-126
SLIDE 126

Filtering thos e nas ty NaNs etc.

> x <- c(1, 2, 3, NA, NaN, Inf, NULL) > x[is.na(x)] [1] NA NaN > x[! is.na(x)] [1] 1 2 3 Inf > x[! is.null(x)] [1] 1 2 3 NA NaN Inf > x[! is.nan(x)] [1] 1 2 3 NA Inf > x[is.finite(x)] [1] 1 2 3

slide-127
SLIDE 127

Lis ts

slide-128
SLIDE 128

Lis ts (1/4)

vector: an ordered collection of data of the s ame type. > a <

  • c( 7, 5, 1)

> a[ 2] [ 1] 5 lis t: an ordered collection of data of arbitrary types . > doe <

  • l i st ( nam

e= "j ohn", age= 28, m ar r i ed= F) > doe$nam e [ 1] "j ohn“ > doe$age [ 1] 28 Typically, vector/matrix elements are accessed by their index (= integer starting with 1), list elements by their name (= a string). But both types s upport both acces s methods .

slide-129
SLIDE 129

Lis ts (2/4)

 A list is an object consisting of objects called

components.

 Components of a list don’t need to be of the same

mode or type:

 list1 <- list(1, 2, TRUE, “a string”, 17)  list2 <- list(l1, 23, l1) # lists within lists:

possible

 A component of a list can be referred

 as: listname[[index]]  or as: listname$componentname, listname$`bla bla`

(backticks!)

slide-130
SLIDE 130

Lis ts (3/4)

 The names of components may be abbreviated down to

the minimum number of letters needed to identify them uniquely.

 Example:

> l <- list(blablabla=1, blubb=2, foo=3, bar=4) > l$bla [1] 1 > l$blu [1] 2

 Syntactic quicksand:  listname[[index]] is what you normally want  listname[index] returns a sub-list containing the element at

position index, i.e., it is the same as writing list(listname[[index]])

 There are functions whose return value is a list

(and not a vector / matrix / array)

slide-131
SLIDE 131

Lis ts are very flexible

> m

  • y. l i st <
  • l i st ( c( 5, 4, - 1) , c( "X1", "X2", "X3") )

> m

  • y. l i st

[ [ 1] ] : [ 1] 5 4 - 1 [ [ 2] ] : [ 1] "X1" "X2" "X3" > m

  • y. l i st [ [ 1] ]

[ 1] 5 4 - 1 > m

  • y. l i st <
  • l i st ( com

ponent 1= c( 5, 4, - 1) , com ponent 2= c( "X1", "X2", "X3") ) > m

  • y. l i st $com

ponent 2[ 2: 3] [ 1] "X2" "X3"

slide-132
SLIDE 132

Lis ts : Ses s ion

> Empl <- list(employee=“Anna”, spouse=“Fred”, children=3, child.ages=c(3,7,9)) > Empl[[1]]

# You’d achieve the same with: Empl$employee

“Anna” > Empl[[4]][2] 7

# You’d achieve the same with: Empl$child.ages[2]

> Empl$child.a [1] 3 7 9

# You can shortcut child.ages as child.a

> Empl[4]

# a subl i st consi st i ng of t he 4t h com ponent of Em pl

$child.ages [1] 3 7 9 > nam es( Em pl ) [ 1] “em pl oyee” “spouse” “chi l dr en” “chi l d. ages” > unl i st ( Em pl ) #

conver t s i t t o a vect or . M i xed t ypes w i l l be conver t ed t o st r i ngs, gi vi ng a st r i ng vect or .

slide-133
SLIDE 133

Back to matrices : Naming elements of a matrix

> x.mat [,1] [,2] [1,] 3

  • 1

[2,] 2 [3,]

  • 3

6 > dimnames(x.mat) <- list(c("Line1","Line2",“xyz"), c(“col1",“col2")) # assi gn nam es t o r ow s/ col um ns of m at r i x > x.mat col1 col2 Line1 3

  • 1

Line2 2 0 xyz

  • 3

6

slide-134
SLIDE 134

R as a “better gnuplot”: Graphics in R

slide-135
SLIDE 135

plot(): Scatterplots

 A scatterplot is a standard two-dimensional (X,Y)

plot

 Used to examine the relationship between two

(continuous) variables

 If x and y are vectors, then

pl ot ( x, y) produces a s catterplot of x against y

 I.e., do a point at coordinates (x[1], y[1]), then (x[2],

y[2]), etc.

 pl ot ( y) produces a time s eries plot if y is a

numeric vector or time series object.

 I.e., do a point a coordinates (1,y[1]), then (2, y[2]), etc.

 plot() takes lots of arguments to make it look

fancier => help(plot)

slide-136
SLIDE 136

Example: Graphics with pl ot ( )

> plot(rnorm(100),rnorm(100))

The function rnorm() Simulates a random normal distribution . Help ?rnorm, and ?runif, ?rexp, ?binom, ...

rnorm(100) rnorm(100)

  • 3
  • 2
  • 1

1 2

  • 2
  • 1

1 2 3

slide-137
SLIDE 137

Line plots

 Sometimes you don’t want just points  solution:

> plot(dataX, dataY, type=“l”)

 Or, points and lines between them:

> plot(dataX, dataY, type=“b”)

 Beware: If dataX is not nicely sorted, the lines will

jump erroneously across the coordinate system

 try

plot(rnorm(100,1,1), rnorm(100,1,1), type=“l”) and see what happens

slide-138
SLIDE 138

Graphical Parameters of pl ot ( )

plot(x,y, … type = “c”, # c m ay be p ( def aul t ) , l , b, s, o, h, n. Tr y i t . pch=“+”, # poi nt t ype. Use char act er or num ber s 1 – 18 lty=1, # l i ne t ype ( f or t ype= “l ”) . Use num ber s. lwd=2, # l i ne w i dt h ( f or t ype= “l ”) . Use num ber s. axes = “L” # L= F, T xlab = “string”, ylab=“string” # Label s on axes main = “string”, main =“string” # Ti t l e, subt i t l e f or pl ot xlim = c(lo,hi), ylim= c(lo,hi) # Ranges f or axes ) And som e m

  • r e.

Tr y i t out , pl ay ar ound, r ead help(plot) and help(par)

slide-139
SLIDE 139

More example graphics with plot()

> x <- s e q( - 2*pi , 2*pi , l e ngt h=100) > y <- s i n( x) > par ( m f r ow=c ( 2, 2) ) #m ul t i - pl ot > pl ot ( x, y, xl ab="x”, yl ab="Si n x") > pl ot ( x, y, t ype = "l ", m ai n=“A Li ne ") > pl ot ( x[ s e q( 5, 100, by=5) ] , y[ s e q( 5, 100, by=5) ] , t ype = "b", axe s =F) > pl ot ( x, y, t ype ="n", yl i m =c ( - 2, 1) > par ( m f r ow=c ( 1, 1) )

  • 6
  • 4
  • 2

2 4 6

  • 1.0
  • 0.5

0.0 0.5 1.0 x Sinus de x

  • 6
  • 4
  • 2

2 4 6

  • 1.0
  • 0.5

0.0 0.5 1.0

Une Ligne

x y x[seq(5, 100, by = 5)] y[seq(5, 100, by = 5)]

  • 6
  • 4
  • 2

2 4 6

  • 2.0
  • 1.0

0.0 0.5 1.0 x y

slide-140
SLIDE 140

Multiple data in one plot

Scatter plot

  • 1. > plot(firstdataX, firstdataY, col=“red”, pty=“1”, …)
  • 2. > points(seconddataX, seconddataY, col=“blue”,

pty=“2”)

  • 3. > points(thirddataX, thirddataY, col=“green”, pty=3)

Line plot

  • 1. > plot(firstdataX, firstdataY, col=“red”, lty=“1”, …)
  • 2. > lines(seconddataX, seconddataY, col=“blue”, lty=“2”,

…)

Caution:

. Only plot( ) command sets limits for axes! . Avoid using plot( …., xlim=c(bla,blubb),

ylim=c(laber,rhabarber))

(There are other ways to achieve this)

slide-141
SLIDE 141

Logarithmic s caling

 plot() can do logarithmic scaling

 plot(…. , log=“x”)  plot(…. , log=“y”)  plot(…. , log=“xy”)

 Double-log scaling can help you to see more. Try:

> x <- 1:10 > x.rand <- 1.2^x + rexp(10,1) > y <- 10*(21:30) > y.rand <- 1.15^y + rexp(10, 20000) > plot(x.rand, y.rand) > plot(x.rand, y.rand, log=“xy”)

slide-142
SLIDE 142

More nicing up your graph

> axis(1,at=c(2,4,5), Axis details (“ticks”, lEgend, …) legend("A","B","C")) Use xaxt="n" or yaxt="n" inside plot() > abline(lsfit(x,y)) Add an adjustment > abline(0,1) add a line of slope 1 and intercept 0 > legend(locator(1),…) Legends: very flexible

slide-143
SLIDE 143

His togram

 A histogram is a special kind of bar plot  It allows you to visualize the distribution of values for a

numerical variable. Naïvely:

 Divide range of measurement values into, say, 10 so-called

“bins”

 Put all values from, say, 1-10 into bin 1, from 11-20 into bin

2, etc.

 Count: how many values in bin 1? In bin 2? …  Then draw these counters

 When drawn with a density scale:

 the AREA (NOT height) of each bar is the proportion of

  • bservations in the interval

 the TOTAL AREA is 100% (or 1)

slide-144
SLIDE 144

R: making a his togram

 T

ype ?hist to view the help file

 Note some important arguments, esp breaks

 Simulate some data, make histograms varying the number of

bars (also called ‘bins’ or ‘cells’), e.g. > par(mfrow=c(2,2)) # set up multiple plots > simdata <-rchisq(100,8) # some random numbers > hist(simdata) # default number of bins > hist(simdata,breaks=2) # etc,4,20

slide-145
SLIDE 145
slide-146
SLIDE 146

R: s etting your own breakpoints

> bps <- c(0,2,4,6,8,10,15,25) > hist(simdata,breaks=bps)

slide-147
SLIDE 147

Dens ity plots

 Density: probability distribution  Naïve view of density:

 A “continuous”, “unbroken” histogram  “inifinite number of bins”, a bin is “inifinitesimally small”  Analogy: Histogram ~ sum, density ~ integral

 Calculate density and plot it

> x<-rnorm(200,0,1) #create random numbers > plot(density(x)) #compare this to: > hist(x)

slide-148
SLIDE 148

5 10 15 20 25 20 40 60 80 100 120 speed dist X

  • 10
  • 5

5 10 Y

  • 10
  • 5

5 10 Z

  • 2

2 4 6 8

Other graphical functions

See also: barplot() image() pairs() persp() piechart() polygon() library(modreg) scatter.smooth()

slide-149
SLIDE 149

Interactive Graphics Functions

 l ocat or ( n, t ype=

“p”) : Waits for the user to select locations on the current plot using the left mouse button. This continues until n (default=500) points have been selected.

 i dent i f y( x, y, l abel s) : Allow the user to

highlight any of the points defined by x and y.

 t ext ( x, y, ”H

ey”) : Write text at coordinate x,y .

slide-150
SLIDE 150

Input / output

slide-151
SLIDE 151

Simple input

 T

ask: Read a file into a vector

 Input file looks like this:

1 2 17.5 99

 Read this into vector x:

x <- scan(“inputfile.txt”)

 There are more options => help(scan)

slide-152
SLIDE 152

More complicated: Reading / writing tables

 Write a table into a file:

> x <- rnorm(100, 1, 1) > write.table(x, file=“numbers.txt”) # There are more options => help(write.table)

 Read a table from a file:

> x <- read.table(“in.txt”, header=FALSE) # There are more options => help(read.table)

 Read a table from the Web:

> x <- read.table(“http://www.net.in.tum.de/ …”…)

 Read a table with pre-processing from a pipe (use

Perl to print 2nd row, but only if 1st row is >0): > x <- read.table(pipe(“perl -nae ‘if($F[0] > 0){ print $F[1]; }’”))

slide-153
SLIDE 153

Univers al: Us ing file handles

 File handles are about as universal as in Perl  Write two lines into a file:

> fh <- file(“output.txt”, “w”) # w = write > cat(“blah”, “blubb”, sep=“\n”, file=fh) > close(fh)

 Write into a file and compress it using gzip:

> fh <- gzfile(“output.txt.gz”, “w”) > cat(“blah blah blah”, … , file=fh)

 More examples: help(file)  Also try “filenames” like http://www.blabla.bla/data.gz

slide-154
SLIDE 154

Graphical output: Saving your plots

 Output as (Encapsulated) PostScript:

> postscript(“outputfile.eps”) > plot(data) # You will not see this on screen! > … # do some more graphics > dev.off() # write into file

 There are many more options => help(postscript)  View the file using, e.g., gv program

 Output as PNG (bitmap):

Simply replace postscript() above by png(): > png(“outputfile.png”, width=800, height=600, pointsize=12, bg=“white”)

slide-155
SLIDE 155

Us eful built-in functions

slide-156
SLIDE 156

Us eful functions

> seq(2,12,by=2) [1] 2 4 6 8 10 12 > seq(4,5,length=5) [1] 4.00 4.25 4.50 4.75 5.00 > rep(4,10) [1] 4 4 4 4 4 4 4 4 4 4 > paste("V",1:5,sep="") [1] "V1" "V2" "V3" "V4" "V5" > LETTERS[1:7] [1] "A" "B" "C" "D" "E" "F" "G"

slide-157
SLIDE 157

Mathematical operations

Normal calculations : + - * / Powers: 2^5 or as well 2**5 Integer division: %/% Modulus: %% (7%%5 gives 2) Standard functions: abs(), sign(), log(), log10(), sqrt(), exp(), sin(), cos(), tan() To round: round(x,3) rounds to 3 figures after the point And also: floor(2.5) gives 2, ceiling(2.5) gives 3 All this works for matrics, vectors, arrays etc. as well!

slide-158
SLIDE 158

Vector functions

> vec <- c(5,4,6,11,14,19) > sum(vec) [1] 59 > prod(vec) [1] 351120 > mean(vec) [1] 9.833333 > var(vec) [1] 34.96667 > sd(vec) [1] 5.913262 And also: min() max() cummin() cummax() range()

slide-159
SLIDE 159

Logical functions

R knows two logical values: TRUE (short T) et FALSE (short F). And NA. Example: > 3 == 4 [1] FALSE > 4 > 3 [1] TRUE > x <- -4:3 > x > 1 [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE > sum(x[x>1]) [1] 5 > sum(x>1) [1] 2 Notez la différence !

== equals < less than > greater than <= less or equal >= greater or equal != not equal & and |

  • r
slide-160
SLIDE 160

Programming: Control s tructures and functions

slide-161
SLIDE 161

Grouped expres s ions in R

x = 1: 9 i f ( l engt h( x) < = 10) { x <

  • c( x, 10: 20) ; #

append 10… 20 t o vect or x pr i nt ( x) } el se { pr i nt ( x[ 1] ) }

slide-162
SLIDE 162

Loops in R

l i st <

  • c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

f or ( i i n l i st ) { x[ i ] <

  • r nor m

( 1) } j = 1 w hi l e( j < 10) { pr i nt ( j ) j <

  • j +

2 }

slide-163
SLIDE 163

Functions

 Functions do things with data

 “Input”: function arguments (0,1,2,…)  “Output”: function result (exactly one)

 Example:

> pleaseadd <- function(a,b) { + result <- a+b + return(result) + }

 Editing of functions:

> fix(pleaseadd) # opens pleaseadd() in editor Editor to be used determined by shell variable $EDITOR

slide-164
SLIDE 164

Calling Conventions for Functions

 T

wo ways of submitting parameters

 Arguments may be specified in the same order in which

they occur in function definition

 Arguments may be specified as name=value.

Here, the ordering is irrelevant.

 Above two rules can be mixed!

> t.test(x1, y1, var.equal=F , conf.level=.99) > t.test(var.equal=F , conf.level=.99, x1, y1)

slide-165
SLIDE 165

Mis s ing Arguments

 R function can handle missing arguments two ways:

 either by providing a default expression in the argument list of

definition

  • r

 by testing explicitly for missing arguments

> add <- function(x,y=0){x + y} > add(4) > add <- function(x,y){ if(missing(y)) x else x+y } > add(4)

slide-166
SLIDE 166

Variable Number of Arguments

 The special argument name “…” in the function

definition will match any number of arguments in the call.

 nargs() returns the number of arguments in the

current call.

slide-167
SLIDE 167

Variable Number of Arguments

> mean.of.all <- function(…) mean(c(…)) > mean.of.all(1:10,20:100,12:14) > mean.of.means <- function(…) { means <- numeric() for(x in list(…)) means <- c(means,mean(x)) mean(means) }

slide-168
SLIDE 168

Variable Number of Arguments

m

  • ean. of . m

eans <

  • f unct i on( …

) { n <

  • nar gs( )

m eans <

  • num

er i c( n) al l . x <

  • l i st ( …

) f or ( j i n 1: n) m eans[ j ] <

  • m

ean( al l . x[ [ j ] ] ) m ean( m eans) } m

  • ean. of . m

eans( 1: 10, 10: 100)

slide-169
SLIDE 169

Even more datatypes : Data frames and factors

slide-170
SLIDE 170

Data Frames (1/2)

 Vector: All components must be of same type

List: Components may have different types

 Matrix: All components must be of same type

=> Is there an equivalent to a List?

 Data frame:

 Data within each column must be of same type, but  Different columns may have different types (e.g., numbers,

boolean,…)

 Like a spreadsheet

Example: > cw <- chickwts > cw weight feed 11 309 linseed 23 243 soybean 37 423 sunflower …

slide-171
SLIDE 171

Data Frames (2/2)

 Data frame = special list with class “data.frame”.

 But: restrictions on lists that may be made into data frames.

 Components must be

 vectors (numeric, character, or logical)  Factors  numeric matrices  Lists  other data frames.

 Matrices, lists, and data frames provide as many variables to

the new data frame as they have columns, elements, or variables, respectively.

 Numeric vectors and factors are included as-is  Non-numeric vectors are coerced to be factors, whose levels

are the unique values appearing in the vector.

 Vector structures appearing as variables of the data frame

must all have the same length, and matrix structures must all have the same row size.

slide-172
SLIDE 172

Individual elements of a vector, matrix, array or data frame are accessed with “ [ ]” by specifying their index, or their name > cw = chickwts > cw weight feed 1 179 horsebean 11 309 linseed 23 243 soybean . . . > cw[3,2] [1] horsebean 6 Levels: casein horsebean linseed ... sunflower > cw [3,] weight feed 37 423 sunflower

Subs etting in data frames (1/2)

slide-173
SLIDE 173

Subs etting in data frames (2/2)

> an = Animals > an body brain Mountain beaver 1.350 8.1 Cow 465.000 423.0 Grey wolf 36.330 119.5 > an [3,] body brain Grey wolf 36.33 119.5

slide-174
SLIDE 174

Labels in data frames

> labels (an) [[1]] [1] "Mountain beaver" "Cow" [3] "Grey wolf" "Goat" [5] "Guinea pig" "Dipliodocus" [7] "Asian elephant" "Donkey" [9] "Horse" "Potar monkey" [11] "Cat" "Giraffe" [13] "Gorilla" "Human" [15] "African elephant" "Triceratops" [17] "Rhesus monkey" "Kangaroo" [19] "Golden hamster" "Mouse" [21] "Rabbit" "Sheep" [23] "Jaguar" "Chimpanzee" [25] "Rat" "Brachiosaurus" [27] "Mole" "Pig" [[2]] [1] "body" "brain"

slide-175
SLIDE 175

Factors

 A normal character string may contain arbitrary

text

 A factor may only take pre-defined values

 “Factor”: also called “category” or “enumerated type”  Similar to enum in C, C++ or Java 1.5

 help(factor)

slide-176
SLIDE 176

Has h tables

slide-177
SLIDE 177

Has h Tables

 In vectors, lists, dataframes, arrays:

 elements stored one after another  accessed in that order by their index == integer  or by the name of their row / column

 Now think of Perl’s hash tables, or

java.util.HashMap

 R has hash tables, too

slide-178
SLIDE 178

Has h Tables in R

In R, a hash table is the same as a workspace for variables, which is the same as an environment. > t ab = new . env( hash= T) > assi gn( "bt k", l i st ( cl onei d= 682638, f ul l nam e= "Br ut on agam m agl obul i nem i a t yr osi ne ki nase") , env= t ab) > l s( env= t ab) [ 1] "bt k" > get ( "bt k", env= t ab) $cl onei d [ 1] 682638 $f ul l nam e [ 1] "Br ut on agam m agl obul i nem i a t yr osi ne ki nase"

slide-179
SLIDE 179

Object orientation

slide-180
SLIDE 180

Object orientation

 primitive (or: atomic) data types in R are:  num

er i c ( i nt eger , doubl e, com pl ex)

 char act er  l ogi cal  f unct i on  out of these, vectors, matrices, arrays, lists can

be built

.

slide-181
SLIDE 181

Object orientation

 Object: a collection of atomic variables and/or

  • ther objects that belong together

 Similar to the previous list examples,

but there’s more to it.

 Parlance:

 class: the “abstract” definition of it  object: a concrete instance  method: other word for ‘function’  slot: a component of an object (I.e., object variable)

slide-182
SLIDE 182

Object orientation advantages

The usual suspects:

 Encapsulation (can use the objects and methods

someone else has written without having to care about the internals)

 Generic functions (e.g. plot, print)  Inheritance (hierarchical organization of

complexity)

slide-183
SLIDE 183

Object orientation in R

  • Don’t be confused:

There are two different ways of object orientation in R!

  • Old system: S3 classes
  • Still very widespread – many modules still use them
  • New system: S4 classes
  • A bit safer…
  • …but more boilerplate code to type
slide-184
SLIDE 184

Object orientation

l i br ar y( ' m e t hods ' ) s e t Cl as s ( ' m i c r oar r ay' , ## t he c l as s de f i ni t i on r e pr e s e nt at i on( ## i t s s l ot s qua = ' m at r i x' , s am pl e s = ' c har ac t e r ' , pr obe s = ' ve c t or ' ) , pr ot ot ype = l i s t ( ## and de f aul t val ue s qua = m at r i x( nr ow=0, nc ol =0) , s am pl e s = c har ac t e r ( 0) , pr obe s = c har ac t e r ( 0) ) ) dat = r e ad. de l i m ( ' . . / dat a/ al i z ade h/ l c 7b017r e x. DAT' ) z = c bi nd( dat $CH 1I , dat $CH 2I ) s e t M e t hod( ' pl ot ' , ## ove r l oad ge ne r i c f unc t i on ‘ pl ot ’ s i gnat ur e ( x=' m i c r oar r ay' ) , ## f or t hi s ne w c l as s f unc t i on( x, . . . ) pl ot ( x@ qua, xl ab=x@ s am pl e s [ 1] , yl ab=x@ s am pl e s [ 2] , pc h=' . ' , l og=' xy' ) ) m a = ne w( ' m i c r oar r ay' , ## i ns t ant i at e ( c ons t r uc t ) qua = z , s am pl e s = c ( ' br ai n' , ' f oot ' ) ) pl ot ( m a)

slide-185
SLIDE 185

Object orientation in R

The plot(pisa.linearmodel) command is different from plot(year,inclin) . plot(pisa.linearmodel) R recognizes that pisa.linearmodel is a “lm”

  • bject.

Thus it uses plot.lm() . Most R functions are object-oriented. For more details see ?methods and ?class