Analys ing data with GNU R Nils Kammenhuber Technis che Univers - - PowerPoint PPT Presentation
Analys ing data with GNU R Nils Kammenhuber Technis che Univers - - PowerPoint PPT Presentation
Analys ing data with GNU R Nils Kammenhuber Technis che Univers itt Mnchen Sources for s ome of thes e s lides Gilberto Cmara Instituto Nacional de Pesquisas Espaciais Manfred Jobmann T echnische Universitt Mnchen
Sources for s ome of thes e s lides
Gilberto Câmara
Instituto Nacional de Pesquisas Espaciais
Manfred Jobmann
T
echnische Universität München
Johannes Freudenberg
Cincinnati Children’s Hospital Medical Center
Marcel Baumgartner
Nestec S.A.
Jaeyong Lee
Penn State University
Jennifer Urbano Blackford, Ph.D
Department of Psychiatry, Kennedy Center
Wolfgang Huber
Why this talk on s tatis tics ?
“My new protocol increases performance.” How to prove that?
Analytical expressions usually cumbersome, and the
model usually is way too simple
Simulations T
estbed
Usually, you have measurements that you need to
analyse
How to prove the measurements really show the
intended effect – i.e., they don’t just show some random artefact?
Via more measurements Statistics
Performance
“My new routing protocol increases network
performance.” – Yeah, but what's “performance”?
Base unit:
Throughput (IP) Throughput (TCP, HTTP) Delay (IP) Delay (HTTP) Packet drop rate (IP) …and many more!
Aggregation type:
Average … but what type of average?
Does the throughput for a 10GB object count more than for a 2kB
- bject?
If not, should the weight be linear with the file size? Or log(size)? Should we use the arithmetic, harmonic or geometric mean?
Minimum Variation across time Variation across different hosts
Performance: Take-home mes s ages
Always think about the metric very intensely
Don't just pick the first performance metric that comes to
your mind – be imaginative and creative!
T
ry to find multiple metrics
Play around with them, compare them Insight evolves experiments evolve, metrics evolve One metric may reveal some particular effect, another
- ne may hide it. But both can be useful.
Log tons of data. All you can get hold of easily.
You never know what metrics you will come up with later There's time for rm and drop table later Warning: Document that data! Otherwise you won't have
tons of data, but tons of meaningless numbers.
Take-home mes s age (2)
In my opinion, intelligent behaviour consists of
20% clever answers 80% clever questions (which usually have trivial or obvious
answers)
So always try to find a useful question! Then...
Refine the question Find sub-questions and implied questions
Examples for rather ubiquitous “good” questions:
How do I measure network performance?
From whose standpoint? End user? Network operator? Is a one-dimensional metric enough for me?
How do I do a good simulation?
Am I using a realistic (or otherwise meaningful) workload? Am I using a topology that adequately reflects reality? Which aspects are unimportant and don't need to be included
in the simulation?
Can I analyse different aspects in different experiments?
Back to s tatis tics ! – Why GNU R?
Powerful tool
for plotting (gnuplot can do that, too) for statistical analyses (gnuplot can't)
Open source Has become de-facto standard for statistic
analyses in many fields of science
Huge library: 3700 additional packages at CRAN
Why not GNU R?
Unintuitive as soon as you want do to more than
plotting data points from a file (IMO)
Strange syntax Confusing language elements: vector vs. list, matrix vs.
dataframe vs. list of vectors, NA vs. NaN vs. NULL, etc.
T
erribly slow if you program in idioms from traditional languages
Iterating over elements of a list or vector with for():
slooooooow
But applying an operator or function to an entire vector or
matrix: remarkably fast
Aims at interactivity, not scripting.
Reproducing a plot 3 years later: almost impossible ... unless you cared for reproducibility right from the start, but
it's somewhat cumbersome and requires discipline on your side
His tory of R
Statistical programming language “S” developed
at Bell Labs since 1976 (at the same time as UNIX)
Intended to interactively support research and
data analysis projects
Exclusively licensed to Insightful (“S-Plus”) “R”: Open source platform similar to S
Developed by R. Gentleman and R. Ihaka (University of
Auckland, NZ) during the 1990s
Most S-plus programs will run on R without modification!
What R is and what it is not
R is
a programming language a statistical package an interpreter Open Source fast
R is not
a database a collection of “black boxes” a spreadsheet software package commercially supported fast
What R is
A domain-specific language:
Powerful tool for data analysis and statistics
Data handling and storage: numeric, textual Powerful vector algebra, matrix algebra High-level data analytic and statistical functions Graphics, plotting
A programming language
Language “built to deal with numbers and statistics” Loops (slow!), branching, subroutines Hash tables (strange) and regular expressions Classes (“OO”)
Has become a standard in many areas of research
that involve statistics
Many, many, many, many additional modules
http://cran.r-project.org/ apt-cache search r-cran
What R is not
is not a database, but connects to DBMSs no spreadsheet view of data,
but can import .csv files in various flavours
has no default click-point user interfaces,
but connects to Java, T clTk (and there’s “Rcommander”, which I haven’t tried
- ut)
language interpreter can be very slow,
but allows to call own C/C++ code
no professional / commercial support
R and s tatis tics
Packaging: a crucial infrastructure to efficiently
produce, load and keep consistent software libraries from (many) different sources / authors
Statistics: most packages deal with statistics and
data analysis
State of the art: many statistical researchers
provide their methods as R packages
Ins tallation
T
- obtain and install R on your computer:
Ubuntu, Debian, etc.:
apt-get install r-base
Download:
Go to http://cran.r-project.org/mirrors.html to choose a
mirror near you
Click on your favourite operating system (Linux, Mac,
- r Windows)
Download and install the “base”
Getting s tarted
Call R from the shell:
user@host$ R
Leave R, go back to shell:
> q() Save information (y/n/q)? y
R: s es s ion management
Your R objects are stored in a workspace T
- list the objects in your workspace (may be a lot):
> ls()
T
- remove objects which you don’t need any more:
> rm(weight, height, bmi)
T
- remove ALL objects in your workspace:
> rm(list=ls())
T
- save your workspace to a file:
> save.image()
The default workspace file is ./.RData
If you don't like that, then you should have called R from a
different directory ...
Or try: > setwd(“../../new_directory/blabla/”)
Firs t s teps : R as a calculator
Exponentiation is ^ and not **: > 5 + (6 + 7) * pi^2 [1] 133.3049 Values up to about ± 1.7 ∙ 10217: > 2.8e+90 * 6.3e+217 [1] 1.764e+308 By default, log() is the natural logarithm (base = e) > log(exp(1)) [1] 1 Named arguments; default values/optional arguments: > log(1000, base=10) [1] 3 R is case sensitive: > Sin(pi/3)^2 + cos(pi/3)^2 Error: couldn't find function "Sin" > sin(pi/3)^2 + cos(pi/3)^2 [1] 1
R as a calculator and function plotter
> log2(32) [1] 5 > sqrt(2) [1] 1.414214 > seq(0, 5, length=6) [1] 0 1 2 3 4 5 > curve( sin(x)*sin(cos(x/2))+x/12, from=0, to=4*pi) > savePlot(“/tmp/weird-function.png”)
Side remark: Three ways of doing plots in R
- Standard R way
- plot()
- also: hist(), curve(), boxplot(), barplot(),…
- I’ll be talking about this way of plotting in this talk
- [Almost] same syntax as S-plus
- Trellis / Lattice graphs
- More powerful
- … but I have no experience with it, so it’s not in this talk
- Problems using additional packages with own plot routines
- ggplot2 package
- Very powerful, very user-friendly, good for explorative
analysis
- … but I have no experience with it, so it’s not in this talk
- Not much online documentation – so far you basically need
their book to use it
Help and other res ources
Starting the R installation help pages in the browser (if this
doen’t work, don’t worry, then you’ll only have them in a manpage-like way) > help.start()
In general:
> help(functionname) > ?functionname
If you don’t know the function you’re looking for:
> help.search(„quantile“) > ??”quantile”
“What’s in this variable”?
> class(variableInQuestion) [1] “integer” > summary(variableInQuestion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 5.250 8.500 9.833 13.250 19.000
www.r-project.org CRAN.r-project.org: Additional packages, like www.CPAN.org for Perl
Ins talling additional R packages from CRAN
T
- install additional packages:
Ubuntu, Debian, etc: First, try the shell command apt-cache search
pkgname
Others / if there’s no Ubuntu/Debian package: You might need to have C, C++, Fortran (!) compilers
- etc. installed on your machine
Start R on your computer. At the R prompt, type: > choose CRANmirror() > install.packages(c(“pkg1”, “pkg2”), dependencies=TRUE)
After the package has been installed, make R use
it: > library(pkg1)
Now you can use the new functions etc. If it didn’t work, try re-starting R
Remainder of this talk: Typical workflow
- Read in the file
- Explorative analysis
- Some basic statistics
- Some basic plot types
- Save the plots
- Advanced topics
- More complex analyses and plots
- More on R syntax
Very s imple input
T
ask: Read a file into a vector
Input file looks like this:
1 2 17.5 99
Read this into vector x:
x <- scan(“inputfile.txt”)
There are more options help(scan)
Some examples
> length(x) 100 > min(x) [1] -2.4923 > max(x) [1] 2.969
“What values does x typically have?” (1)
If there’s heavy variation in the data…
- What’s the typical income for persons in this room?
- Now Bill Gates walks in. What’s the typical income now?
- Mode (“Modalwert”): The value that occurs most often
- Of course, usually only defined for categorical or discrete values
- And what about local maxima? Or multiple maxima?
Use with care
- Unfortunately, not a built-in function in R. Instead use:
> tab.x <- table(x); names(t.x[which.max(t.x)]) > median(x) the 50% quantile
- 50% of values are greater than the median, 50% are lesser
- Way less sensitive to outliers than mean (cf. Bill Gates example)
- Estimator for the mean if the x are s ymmetrically dis tributed
> mean(x, trim=0.05) trimmed mean
- Disregard 5% smallest and 5% largest values for x
- Idea: Outliers are not representative / not important for the system
- Dangerous ! Very often, that is a wrong as s umption!
“Unfortunately, this is not a built-in function” Writing your own functions in R
> harmean <- function(x) { + return(length(x) / sum(x)) + } Fix an error using external text editor (N.B. should be 1/x, not x): > fix(harmean) Syntax error […blah…] in line 2. Use x <- edit() > fix(harmean) Baaaad! Now all of your changes are lost! Instead you should have done what R told you: > harmean <- edit() Now that we’re at it, let’s also fix our geometric mean: > geomean <- edit() Baaaad again! Now geomean becomes a copy of whatever you last edit()ed. should have used fix(geomean) here
Saving your own functions in R
- Method 1: Automatic saving
- > quit()
Save workspace image? [y/n/c]:
- If you answer “y” now, your own functions will re-appear
the next time that you start R
- …but only if you s tart it from the s ame working directory!
- And where is that?
> getwd() [1] “/home/blabla/research/mydata”
- Method 2:
- Edit external text file with, e.g., Emacs
- > source(“my-functions.R”)
- What about the other way round? (i.e., R text file)
> dump(list=c(“fnc1”, “fnc2”), file=“my-fun.R”)
Back to the data. What about other s tatis tics ?
- quantile(x, probs=c(0.05, 0.1, 0.9, 0.95))
- Values that are 5% or 10% smaller or greater than all
- thers
- > summary(x)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 9.38 -3.51 -0.03 0.54 3.96 11.67
25% and 75% quantile are called the quartiles
- Help! summary() behaves strangely and doesn’t
calculate quantiles and mean!
- Probably some non-numerical values slipped into your data
- R will not view the data as numbers
- T
ypical culprits: “-”, “N/A”, “*”, “unknown” etc. in your file
Meas uring dis pers ion
Why is there als o the s tandard deviation if we already have the variance?
Standard deviation, s tandard error
Comparing dis pers ions
And if I want to compare the s tandard errors , not the s tandard deviations ?
What about Bill Gates and dis pers ion…?
Remove the “constant=1” to get an estimate for the std.dev.
(then you don’t have to multiply with 0.6745)
Robus t s tatis tics
- Classical statistics: mean, variance, …
- Very sensitive to outliers
- More exact if there are no outliers
- Readers (and paper reviewers!) are accustomed to it
- Robust statistics: median, MAD, …
- Not sensitive to outliers
- Less exact: One single datapoint represents the entire data set!
- Readers and reviewers may not know them
(“The median, ah, yes, what’s that again… and WTF is MAD?”)
- When in doubt, use robust statistics
- But that’s just my personal opinion!
- Enough. Let’s plot!
- Obvious approach: Scatter plot
> plot(x) > plot(x, type=“l”)
Plotting (2)
Combination of type=“p” and type=“l”: > plot(x, type=“b”)
But does this plot type actually reveal what we want to s ee?
- Always as k this ques tion!
- So, are you really interested in the temporal
development? Apparently, the values fluctuate randomly and are not really correlated…
- Better alternatives:
- Box–whisker plot
- Histogram plot
- Density plot
- CDF plot
- CCDF plot
- Scatterplot against other data
- QQ plot against postulated theoretical distribution
- QQ plot against other data
What you s hould not plot
- A real res earcher never, ever us es pie charts !
- There’s a good reason why R has no method for drawing
them
- More on this in the slideset “How to lie with statistics”
- Also, you should try avoiding 3D charts at all costs
- 3D plot is plotted on 2D paper or 2D screen Often very
difficult to make out the 3D structure
- Much more difficult to understand (“so what does it actually
mean if a data point is in the upper left rear corner?”)
- Only use 3D plots as a very last resort
- Always remember:
- You don’t make the plot just for you
- You don’t make the plot to impress the reader
- You make the plot to help the reader (and the reviewer!)
understand your point
Box–whis ker plot
T ersely shows the distribution of your data: > boxplot(x)
75% quantile 25% quantile 50% quantile (i.e., median)
- utlier points
Boundaries calculated from quantile ranges (Last data points
that are still within median±1.5 ∙ IQR)
His togram plot: This is what you us ually want
> hist(x)
T extmasterformat bearbeiten
Zweite Ebene
Dritte Ebene
Vierte Ebene Fünfte Ebene
> hist(x, breaks=20)
T extmasterformat bearbeiten
Zweite Ebene
Dritte Ebene
Vierte Ebene Fünfte Ebene
Dens ity plots (PDF)
- “A histogram with an infinite number of
(interpolated) bars”
- Interpolation using a density kernel function
> plot(density(x))
Cumulative dens ity function plots (CDF)
- Basically, it’s just the integral of the density function
- But without the interpolation part
- So it’s actually a sum, not an integral
> plot(ecdf(x))
- Why CDF instead of density
- r histogram?
- Can immediately see quantiles
- More expressive if points spread
- ut very wide
e.g., 20% quantile
Logarithmic axes
- Use logarithmic scaling if the data is very
asymmetric
- Always a good idea to play around with axis s caling!
> plot(ecdf(x), log=“x”) Error: logarithmic axis must have positive boundaries > plot(ecdf(x), log=“x”, xlim=range(x))
atanh, as inh axes (Ralph: “The Kammenhuber s cale” – but I was not the firs t to dis cover it;-)
- Properties of logarithmic scaling
- Visual magnification towards 0
- Visual compression towards +∞
- Cannot dispay value 0.0 or negative values
- But what if we wanted logarithm-like scaling for
negative and for positive values in one plot?
- T
wo types of hyperbolic scaling
- asinh axis scaling:
- Visual compression towards −∞
- Visual magnification towards 0
- Visual compression towards +∞
- [shifted and scaled] atanh axis scaling:
- Visual magnification towards 0.0
- Visual compression around 0.5
- Visual magnification towards 1.0
Linear vs . as inh X axis
- Linear scale hides the fact that we have two peaks!
- N.B. Interval boundaries are different
Linear vs . atanh Y axis
- Linear scale hides the fact that the temporal
development towards larger key lengths also took place at very short and very large key lengths, not
- nly for popular key lengths
Why does it work?
as inh s cale atanh s cale (s hifted)
How to do it in R
- Rather tricky!
- R has no builtin support for this
- Need to manually fiddle around with it
- Basic idea for asinh-scaled X axis:
> plot(x=asinh(data), y=…, xaxt=“n”, …) > axis(side=1, at=asinh(c(-100, -10, -1, 0, 1, 10, 100)), labels=c(-100, -10, -1, 0, 1, 10, 100))
- Basic idea for atanh-scaled Y axis:
> atanhS <- function(x) atanh(2*x-1) > plot(x=…, y=atanh(data), yaxt=“n”, …) > axis(side=2, at=atanhS(c(0.01, 0.1, 0.5, 0.9, 0.99)), labels=c(0.01, 0.1, 0.5, 0.9, 0.99))
- Cumbersome I wrote some R scripts that are
about to be released
Nomenclature
- Apparently, these axis scales have been known for
decades or centuries
- Apparently, no consistent nomenclature for them
- I call them “two-ended pseudo-logarithmic scale”,
since that describes what they look like
- Hard to understand if you explain it purely orally:
- ‘But hey, a double-logarithmic plot is easy to do in R, just
say plot(…, log=“xy”)’
- That’s true, but a double-logarithmic plot is a plot with
logarithmic X and logarithmic Y axis – that’s an entirely different thing!
Further information
- Nils Kammenhuber: Two-ended pseudo-logarithmic
- scaling. T
echnical Report, NET 2012, to appear
Combining plots
- Now I have multiple measurements that I want to
compare in one single plot
> plot(first.data, type=“l”, col=“black”, lty=1) > lines(second.data, col=“blue”, lty=2)
- Or with points instead of lines:
> plot(first.data, type=“p”, col=“black”, pch=1) > points(second.data, col=“blue”, pch=2)
Adjus ting axis limits
- Problem: The axis limits are not readjusted after the
first call to plot()
> plot(first.data, type=“l”, col=“black”, lty=1, xlim=range(c(first.data, second.data)), ylim=range(c(first.data, second.data))) > lines(second.data, col=“blue”, lty=2)
- Alternatively, can manually specify axis limits, e.g.,
for probabilities:
> plot(first.data, …, xlim=range(c(first.data, second.data)), ylim=c(0.0, 1.0))
“I jus t ran the experiment for half an hour and I feel that I have enough data now.”
- Maybe… maybe not.
- The readers expect error bars or something like that
- How do I get error bars?
- Some researchers just use arith. mean ± 2 standard errors
(or ±3)
- About 95.5% of all data falls within ±2σ of „real“ mean
- About 99.7% of all data falls within ±3σ of „real“ mean
- … but only if the data is actually normally dis tributed!
- Better: calculate a confidence interval on the mean of your data
- Semantics:
“ With a probability of 99% (confidence level; adjustable), the output of the process that generated this data has an arithmetic mean that lies within this interval”
- Higher confidence interval gets larger
- Higher variability interval gets larger
- More data points interval gets smaller
How to calculate confidence intervals
Confidence intervals : Be careful!
- 1. Your data s hould be (roughly) normally dis tributed
- …at least, not very asymmetric
- Check it QQ plots, and maybe statistical tests
- If not the case group the data and apply Central Limit
Theorem
- 2. Your data should be independently distributed
- Often not the case!
- Check it (1) Plot the data and look for a trend, (2) ACF plots
- If not the case group the data into larger groups and check
ACF again
- 3. The Central Limit Theorem only works if the system
that creates your samples has finite variance
“Here, the error roughly looks like a Gauß curve, s o it mus t be normally dis tributed”
- How do I prove that my data is
normally distributed?
- Naïve approach:
- Plot histogram and density of your data
- Plot density function of normal distribution
into same diagram
- “Here, they look the same”
> hist(mydata, freq=FALSE, xlim=c(0.0, 0.3)) > curve(dnorm(x, mean=mean(mydata), sd=sd(mydata)) , col=“blue”, lty=2, add=TRUE) > legend(… more about this later…) It looks normally distributed, but this data actually comes from the sum
- f a Student-T and an exponential distribution…
What’s wrong with that approach
- 1. You can never “prove” that data is normally (or
exponentially, t, Pareto, …)-distributed
- You only can present graphs, calculations, statistical tests
- etc. that do not contradict that assumption
- 2. The eye is easily fooled – many symmetrical
distributions look like Gauß’ bell curve
- 3. There are way better ways to analyse this:
- QQ plots
- PP plots (not shown)
- Statistical tests: χ² test, Kolmogorov–Smyrnov test, …
(not shown – they usually are „too picky“;-)
Digres s ion #1: Theoretical probability dis tributions in R
- A lot of distributions are built into R
- Naming convention: For every probability
distribution XY, there are the following four functions
- dxy() – the density function (I used that in the histogram)
- pxy() – the probability function (≈cumulative density) (you
could use that for comparison in a CDF plot of real data)
- qxy() – the quantile function (useful for QQ plots and some
statistical tests)
- rxy() – draw random numbers from that distribution
- Example:
- rt(50, df=3) + rexp(50) – rnorm(50, sd=3):
Create 50 random variables from a combination of T, exponential and Normal distribution
Digres s ion #2: Random number generators
Random number generators : Literature
- I–8 Discrete Event Simulation slideset on random
number generation
- L’Ecuyer, Simard:
TestU01: a C library for empirical testing of random number generators ACM Transactions on Mathematical Software, Volume 33, No. 4, 2007
Better alternative ins tead to comparing his togram vs . dens ity curve: QQ-Plots
- Short for quantile–quantile plot
- Compare the quantiles of two distributions
- Imagine you have two distributions (e.g., #1 – empirical
- bservations; #2 – postulated theoretical distribution,
e.g., normal distribution)
- Calculate 1%, 2%, … 100% quantile for each distribution
- T
ake the quantiles of 1st distribution as Y coordinates and quantiles of 2nd distribution as X coordinates
- How to read them:
- A straight (i.e., unbent) line means that the two
distributions are similar
“ In a sense, the only function that the human mind really understands well is a straight line. One good strategy for making a graphy easy to understand is to make it as linear as possible.” – John P . Boyd
- Be careful about the ends
- Also note the axis labels – they might be shifted
QQ plot example
Histogram comparison: looks ~normal QQ plot: visible differences – not a straight line
How to do a QQ plot in R
- QQ plot against normal distribution:
> qqnorm(mydata) > qqline(mydata) #auxiliary straight line
- QQ plot against some other distribution xy:
> qqplot(qxy(ppoints(data), …), data)
- Nicer:
> qqplot(qxy(ppoints(data), …), data, xlab=“Theoretical quantiles”, ylab=“Sample quantiles”, main=“Comparison with xy distribution”)
- QQ plots can be us ed to compare different data s ets :
> qqplot(data.internet, data.mysimulation)
QQ plots with type=“l” or type=“b”
> qqnorm(data, type=“b”)
- Huh!??
> qqnorm(sort(data), type=“b”)
Be careful when reading QQ plots !
Don’t let yourselves be fooled! DAX revenues vs. normal distribution:
Good model for normal trading days Bad model for exceptional trading days
Confidence intervals : Be careful!
- 1. Your data should be (roughly) normally distributed
- …at least, not very asymmetric
- Check it QQ plots, and maybe statistical tests
- If not the case group the data and apply Central Limit Theorem
- 2. Your data should be independently distributed
- Often not the case!
- Check it (1) Plot the data and look for a trend, (2) ACF plots
- If not the case group the data into larger groups and check ACF
again
- 3. The Central Limit Theorem only works if the system that
creates your samples has finite variance
Statis tical tes ts
- If you really want to be correct, then use a statistical
test once you’re satisfied with the QQ plot
- Millions of statistical tests that can be used to check
“if the data follows some given theoretical distribution” (e.g., Gauß’ normal distribution)
- χ² test: Very universal (i.e., can be used with just about any
distribution), but less powerful (i.e., needs more data)
- Kolmogorov–Smyrnov test (only for a number of popular
distributions), more powerful (also works with less data)
- Anderson–Darling test, …
- Most of them are built into R
- Check literature to see how to use them
Statis tical tes ts : General obs ervations
- A s tatis tical tes t never can prove that your data actually
follows s ome dis tribution XY!
- Rather, it checks if it finds evidence agains t (!) this assumption.
- If no such evidence found: “acquitted due to lack of evidence”, i.e.,
“could be XY distributed” (but not: “acquitted due to obvious innocence”!)
- If such evidence found: “seems guilty from my point of view”, i.e.,
“to me, it doesn’t look XY distributed” – but in fact you might simply not have collected enough data yet, or you should have used a more powerful test:
- The more restricted your test (the less distributions you
can use it for etc.), the better its power
- As with confidence intervals, you select a confidence level
- Semantics: “With =
α 5% error rate, no evidence could be found that contradicts the assumption that the data is XY-distributed”
- In fact, confidence intervals are a special way to formulate a
Student-T-test
Confidence intervals : Be careful!
- 1. Your data should be (roughly) normally distributed
- …at least, not very asymmetric
- Check it QQ plots, and maybe statistical tests
- If not the case group the data and apply Central Limit Theorem
- 2. Your data should be independently distributed
- Often not the case!
- Check it (1) Plot the data and look for a trend, (2) ACF plots
- If not the case group the data into larger groups and check ACF
again
- 3. The Central Limit Theorem only works if the system that
creates your samples has finite variance
Forcing our data to be normally dis tributed: Making us e of the Central Limit Theorem
- Often times, your data is not normally distributed
- Central Limit Theorem, pragmatic version “for dummies”:
- You have a large number of samples that do not at all look normally
distributed
- Group them together into chunks of about 30–100 samples
- Use groups of 30 if the data looks symmetric and roughly similar
to a Gauß bell curve
- Use groups of 100 if the data looks very asymmetric
- For each group, calculate the arithmetic mean of that group
- Now a miracle happens:
The group means [roughly] follow a normal distribution!
- Try it out in R yourselves!
- Now you can use the group means as input for your
statistical tool that expects normally distributed data (e.g., confidence intervals)
Confidence intervals : Be careful!
- 1. Your data should be (roughly) normally distributed
- …at least, not very asymmetric
- Check it QQ plots, and maybe statistical tests
- If not the case group the data and apply Central Limit
Theorem
- 2. Your data should be independently distributed
- Often not the case!
- Check it (1) Plot the data and look for a trend, (2) ACF
plots
- If not the case group the data into larger groups and
check ACF again
- 3. The Central Limit Theorem only works if the s ys tem
that creates your s amples has finite variance
Central limit theorem: limits
- The central limit theorem only holds if the process
that creates your data has finite variance!
- Of course, the sample variance is always finite:
sd(mydata) or var(mydata) always will give a finite value (…unless mydata contains an +Inf value)
- But the process that created your samples can have an
infinite variance
- “Infinite variance? Come on, that’s an esoteric special
case!”
- No, it’s not – it can easily happen with power law
distributions…
- …and power law distributions are ubiquitous!
- More on this on later slides
Confidence intervals : Be careful!
- 1. Your data should be (roughly) normally distributed
- …at least, not very asymmetric
- Check it QQ plots, and maybe statistical tests
- If not the case group the data and apply Central Limit Theorem
- 2. Your data s hould be independently dis tributed
- Often not the case!
- Check it (1) Plot the data and look for a trend, (2) ACF plots
- If not the case group the data into larger groups and check ACF
again
- 3. The Central Limit Theorem only works if the system that
creates your samples has finite variance
Data mus t not contain s ignificant trends
- Simply do plot(data) and visually check if there’s a
trend
- e.g., values seem to grow over time…
- …or seem to decay…
- …or you see a curved shape (e.g., first upwards, then
downwards)
- …etc.
- AFAIK no really good way to check all these with a
statistical test
- Even if you can’t make out a trend, you’ll have to
check independence using an ACF plot
ACF plots : Are your s amples really independent?
- Consider temporal development of queue length at a
router:
- If large now cannot change much in the next milliseconds
- If small now unlikely that it changes dramatically
- If your data shows such autocorrelation
Confidence intervals will be too small If you don’t check this, you’re a cheater!
- Check autocorrelation using ACF plots (cf. next slides)
- If the plots suggest unacceptable autocorrelation, then …
- Group the samples (as with central limit theorem)
- If they are grouped already, make the groups larger
- “Should I include these ACF plots into my paper?”
- No – just say that “using ACF plots, we ensured that our samples
(or sample means) were sufficiently uncorrelated”
Autocorrelation
How to do and how to read ACF (autocorrelation function) plots
- How to do them in R:
> acf(mydata)
- What does it show:
- Calculate autocorrelation for lag = 0, 1, 2, 3, …
- Show boundaries for “random” autocorrelation
- “random autocorrelation”: an autocorrelation within these
boundaries is very likely to be caused by random fluctuations, not by an actual systemic autocorrelation
- How to read it:
- GOOD: If all bars (except the silly first one) stay within these
boundaries, then we can be happy and assume the sample does not contain unacceptably large autocorrelation
- BAD (i.e., we have indication for an unacceptable level of
autocorrelation): If further bars at the beginning cross the boundaries (erratic single bars further towards the right might be okay) IMO rather silly for lag=0, but R does it…
The data
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
Good ACF plot
The data
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
Bad ACF plot
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
With a grain of salt, we might tolerate this… … but not this
The data
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
Terrible ACF plot
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
This number s huffling is not only good for confidence intervals …
- Many statistical tools require that your data be
- uncorrelated (“iid”)
- and some require it to be normally distributed
- You can apply the same tools in these cases (ACF
plots, grouping, QQ plots, central limit theorem, etc.)
Digres s ion: How to pres ent a plot in a pres entation
I am a bad example, since I don’t adhere to these steps in this talk, but…: 1. What figures does this plot talk about? Why do you s how it? “This plot shows compares the development of the gnaricrosticity in relation to the bullawomptleness and will show this paper is great” 2. Explain the axes . (People tend to forget this, but it’s important!)
- “The X axis displays the bullawomptleness in square Hertz per packet”
- “The Y axis displays the gnaricrosticity in cubic degrees per nanobyte”
3. How do you read your plot? (Don’t interpret the data points yet!)
- “Data points in the upper right corner are good, because blah.”
- “Data points in the lower right corner are unfavourable, ….”
4. Why are there multiple lines /colours /… in your plot? “The blue dots represent measurements in Munich, the red dots represent those in Shanghai.” 5. Only now you can s tart interpreting what the plot s hows : “As you can see, we only have data points in the upper left corner, and only
- ne in the lower right. Therefore I deserve the Fields medal.”
CCDF plots
- CDF plot: shows cumulative probability on Y axis
- CCDF plot: shows 1 − cumulative probability on Y axis
- What’s the point?
- If we use a logarithmic Y axis, it makes a huge difference whether we
do a CDF plot or a CCDF plot
- Almost always are plotted with logscale X and logscale Y axis
- Can be used to spot Power Laws (cf. later slides)
- Alternative names: Survival plot, survival function plot, …
- How to do them in R?
- Unfortunately, not a builtin function
(maybe I shouldn’t have mentioned this fact about this pie charts…)
- Use my crude plot.ccdf.R
- r download Fabian Schneider’s plot.ccdf.R
- r try to find a package at CRAN
A CDF plot with linear s cale
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
CDF plot, logarithmic X axis (s ame data)
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
CDF plot, log X and Y axes (s ame data)
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
CCDF plot, log X and Y axes (s ame data!!)
T extmasterformat bearbeiten
Zweite Ebene Dritte Ebene
Vierte Ebene
Fünfte Ebene
Why s hould we care about Power Laws ?
- Not everything out there is normally or exponentially
distributed
- Actually, Power Laws and similar distributions are
rather widespread (more on this later)
- Surprising and nasty properties! (see next slides)
- En vogue in the scientific community since about the
1980s, 1990s (depending on subject)
Every time you see something resembling a straight line on a plot with doubly logarithmic axes, you are expected to instantly yell “A power law! A power law!”, regardless of whether that actually makes sense. The more log-log-straight lines in your paper, the merrier.
Power Laws : Surpris ing properties
- Depending on the exponent parameter, a power-law
distribution may have…
- … infinite variance very high variability
- Cannot just measure, e.g., 100 values, take their
mean, and assume that the sample mean is a good estimator for the actual mean of the generating process – and the median is only a good estimator for the mean of a symmetric distribution!
- Central Limit Theorem will not work Can not simply
group a couple of values and trust they will be normally distributed around their mean
- … or even an infinite mean!
- Warning: Of course, the sample variance and sample mean
are always defined! (unless you have Inf or NaN values in your data)
- Other surprising properties (more on this later)
- Self-similarity, fractal character
- Long-range dependence
Nomenclature confus ion
Some notions in the power law context
- Used almost interchangeably:
- Power Law, Power T
ail, Long T ail, Fat T ail, Heavy T ail
- Pareto distribution
- Zipf’s Law
- 80 : 20 rule, Pareto principle
- (Pseudo|Asymptotic|Stochastic) Self-similarity (of 1st, 2nd degree)
- Fractal
- Long-Range Dependency (LRD)
- Slowly decaying variance
- 1/f noise
- Hurst parameter > 0.5
- …but of course most of them do not exactly describe the same
systems – there sometimes are subtle differences!
N.B. We won’t describe them with much mathematical rigour …
Self-s imilarity, fractals
Scale-free distribution: Not just a curiosity, but generates samples that are self-similar or fractal, i.e., more or less look the same on all kinds of scales:
Real world / Power law traffic: Variability remains the same across many time scales Unrealistic / Non-Power- Law traffic: Variability is quickly averaged
- ut on
larger time scales
80 : 20 rule (a.k.a. „Pareto principle“)
The fraction W of the wealth in the hands of the
richest P of the the population is given by
W = P(α−2)/(α−1)
Example: US wealth: α = 2.1
Richest 20% of the population holds 86% of the wealth
Of course, not only restricted to wealth, but a
fundamental property of Power-Law-distributed variables
Power laws are s eemingly everywhere (1)
in Moby Dick scientific papers 1981-1997 AOL users visiting sites ‘97 bestsellers 1895-1965 AT&T customers on 1 day California 1910-1992
Power Laws are s eemingly everywhere (2)
Moon Solar flares wars (1816–1980) richest individuals 2003 US family names 1990 US cities 2003
Power Laws that “fall off at the end”
- Most (all?) real-world data faces some fundamental
limit:
- T
- tal disk space world wide
- Maximum energy the earth’s crust can store, etc.
- More realistic: Truncated/T
amed Pareto distribution
- Data follows a Power Law for several magnitudes (straight
line on CCDF). Then, exponential part kicks in (tail falls down in CCDF):
Are there Power Laws in all thos e data s ets ? Or limited Power Laws with exponential tail?
- A lot of ongoing arguments in scientific community
- Popular alternative explanations:
- T
runcated/T amed Pareto distribution (we just saw that)
- Lognormal distribution (cf. Gong et al. 2005,
Mitzenmacher 2004)
- Combined distributions: Head is lognormal, tail is Pareto /
Power-Law [and last portion of tail may have fall-off]
- Double-Pareto distribution (cf. Mitzenmacher 2004)
- Weibull distribution
- Hyperexponential distribution
- …
More on Power Laws
- …in my lecture (“Analyse von Systemperformanz”)
in 03.07.023 @ 14:00h
Extremal values theory
Just to mention it:
- Suppose you are interested in the maximum of a
sample
- e.g., maximum water height
- The longer the sample, the higher the maximum
- Actually, the maximum is a random variable itself
- How is this maximum distributed?
- In practice: At what probability can we expect a +10m
water level during 1,000 years?
- Extremal value theory: Regardless of the underlying
distribution, there are only three ways how the maxima can be distributed.
R s yntax peculiarities
More complicated: Reading / writing tables
Write a table into a file:
> x <- rnorm(100, 1, 1) > write.table(x, file=“numbers.txt”) # There are more options => help(write.table)
Read a table from a file:
> x <- read.table(“in.txt”, header=FALSE) # There are more options => help(read.table)
Read a table from the Web:
> x <- read.table(“http://www.net.in.tum.de/ …”…)
Read a table with pre-processing from a pipe (use
Perl to print 2nd row, but only if 1st row is >0): > x <- read.table(pipe(“perl -nae ‘if($F[0] > 0){ print $F[1]; }’”))
Univers al: Us ing file handles
File handles are about as universal as in Perl Write two lines into a file:
> fh <- file(“output.txt”, “w”) # w = write > cat(“blah”, “blubb”, sep=“\n”, file=fh) > close(fh)
Write into a file and compress it using gzip:
> fh <- gzfile(“output.txt.gz”, “w”) > cat(“blah blah blah”, … , file=fh)
More examples: help(file) Also try “filenames” like http://www.blabla.bla/data.gz
Bas ic data types
Objects
Containers that contain data T
ypes of objects: vect or , f act or , ar r ay, m at r i x, dat af r am e, l i st , f unct i on
Attributes
mode: numeric, character (=string!), complex, logical length: number of elements in object
Creation
assign a value create a blank object
Identifiers (object names )
must start with a letter (A-Z or a-z) can contain letters, digits (0-9), periods (“.”)
Periods have no s pecial meaning (I.e., unlike C or Java!)
case-sensitive:
e.g., mydata different from MyData
Do not us e the unders core “_”!
People mostly use “.” instead of “_”
As s ignment
“<-” used to indicate assignment
x <- 4711 x <- “hello world!” x <- c(1,2,3,4,5,6,7) x <- c(1:7) x <- 1:4
note: as of version 1.4 “
=“ is also a valid assignment operator, but its use is discouraged
Bas ic (atomic) data types
Logical (may be abbreviated
to T and F) > x <- TRUE; y <- F > x; y [1] TRUE [1] FALSE
Numerical
> a <- 5; b <- sqrt(2) > a; b [1] 5 [1] 1.414214 Strings (called “ characters” !) > a <- "1"; b <- 1 > a; b [1] "1" [1] 1 > a <- “string" > b <- "a"; c <- a > a; b; c [1] “string" [1] "a" [1] “string"
But there is more!
R can handle “big chunks of numbers” in elegant ways:
Vector
Ordered collection of data of the same data type Example:
Download timestamps Last names of all students in lecture
In R, a single number is a vector of length 1
Matrix
Rectangular table of data of the same data type Example: Round-trip times measured at different times
from different vantage points
Array
Higher dimensional matrix of data of the same data type
(Lists, data frames, factors, function objects, … →
later)
Vectors
> Mydata<- c(2,3.5,-0.2) Vector (c=
“ concatenate” )
> colours<- c(“Black",
“Red",“Yellow")
String vector > x1 <- 25:30 > x1 [1] 25 26 27 28 29 30 Number sequence > colours[1] I n R, t he i ndi c e s s t ar t wi t h 1, not wi t h 0! [1] “Black" Addressing one element… > x1[3:5] [1] 27 28 29 …and multiple elements
Vectors (continued)
More examples with vectors:
> x <- c(5.2, 1.7, 6.3) > log(x) [1] 1.6486586 0.5306283 1.8405496 > y <- 1:5 > z <- seq(1, 1.4, by = 0.1) > y + z [1] 2.0 3.1 4.2 5.3 6.4 > length(y) [1] 5 > mean(y + z) [1] 4.2
Subs etting
Often necessary to extract a subset of a vector or
matrix
R offers a couple of neat ways to do that:
> x <
- c( "a", "b", "c", "d", "e", "f ",
"g", “a") > x[ 1] # f i r st ( ! ) el em ent > x[ 3: 5] # el em ent s 3. . 5 > x[ - ( 3: 5) ] # el em ent s 1 and 2 > x[ c( T, F, T, F, T, F, T, F) ] # even- i ndex el em ent s > x[ x < = “d”] # el em ent s “a”. . . ”d”, “a”
Typical operations on vector elements
T
est on the elements
Yields a new boolean vector
Extract the positive
elements
Remove the given
elements
> Mydata [1] 2 3.5 -0.21 > Mydata > 0 [1] TRUE TRUE FALSE > Mydata[Mydata>0] [1] 2 3.5 > Mydata[-c(1,3)] [1] 3.5
More vector operations
> x <- c(5,-2,3,-7) > y <- c(1,2,3,4)*10 Multiplication on all the elements > y [1] 10 20 30 40 > sort(x) Sorting a vector [1] -7 -2 3 5 > order(x) [1] 4 2 3 1 Element order for sorting > y[order(x)] [1] 40 20 30 10 Operation on all the components > rev(x) Reverse a vector [ 1] -7 3 -2 5
Matrices
Matrix: Rectangular table of data of the s ame type
> m <- matrix(1:12, 4, byrow = T); m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 > y <- -1:2 > m.new <- m + y > t(m.new) [,1] [,2] [,3] [,4] [1,] 0 4 8 12 [2,] 1 5 9 13 [3,] 2 6 10 14 > dim(m) [1] 4 3 > dim(t(m.new)) [1] 3 4
Matrices
> x <- c(3,-1,2,0,-3,6) > x.mat <- matrix(x,ncol=2) M at r i x w i t h 2 cols > x.mat
[,1] [,2] [1,] 3 0 [2,] -1 -3 [3,] 2 6 > x.matB <- matrix(x,ncol=2,row=T) By-row creation > x.matB [,1] [,2] [1,] 3 -1 [2,] 2 0 [3,] -3 6
Matrix: Rectangular table of data of the same type
Building s ubvectors and s ubmatrices
> x.matB[,2] 2nd column [1] -1 0 6 > x.matB[c(1,3),] 1st and 3rd lines [,1] [,2] [1,] 3 -1 [2,] -3 6 > x.mat[-2,] Everything but the 2nd line [,1] [,2] [1,] 3 0 [2,] 2 6
Dealing with matrices
> dim(x.mat) D i m ensi on ( I . e. , si ze) [1] 3 2 > t(x.mat) Tr ansposi t i on
[,1] [,2] [,3]
[1,] 3 2 -3 [2,] -1 0 6 > x.mat %*% t(x.mat) M at r i x m ul t i pl i cat i on; al so see %o% [,1] [,2] [,3] [1,] 10 6 -15 [2,] 6 4 -6 [3,] -15 -6 45 > solve() I nver se of a squar e m at r i x > eigen() Ei genvect or s and ei genval ues
Special values (1/3)
R is designed to handle statistical data Has to deal with missing / undefined / special
values
Multiple ways of missing values
NA: not available (i.e., this data is simply missing) NaN: not a number (i.e., undefined, e.g., 23/0 or log(−1)) Inf, -Inf: infinity (e.g., log(0)) NULL
NaN ≠ Inf ≠ NA ≠ NULL ≠ FALSE ≠ “” ≠ 0 (pairwise) NA also may appear as Boolean value!
I.e., boolean value in R ∈ {TRUE, FALSE, NA}
Special values (2/3)
NA: Numbers that are “not available” (missing)
> x <- c(1, 2, 3, NA) > x + 3 [1] 4 5 6 NA
NULL: speci al NULL obj ect ( r ar el y of
i m por t ance)
Can be r et ur ned by f unct i ons w
hose val ue i s undef i ned
NaN: “Not a number”
> 0/0 [1] NaN
Inf, -Inf: i ni f i ni t e
> log(0) [1] -Inf
Special values (3/3): Weird comparis ons
Odd (but logical) interactions with equality tests, etc: > 3 == 3 [1] TRUE > 3 == NA [1] NA #but not “FALSE”! > NA == NA [1] NA #but not “TRUE”! > NaN == NaN [1] NA #but not “TRUE”! > 99999 >= Inf [1] FALSE > Inf == Inf [1] TRUE
Filtering thos e nas ty NaNs etc.
> x <- c(1, 2, 3, NA, NaN, Inf, NULL) > x[is.na(x)] [1] NA NaN > x[! is.na(x)] [1] 1 2 3 Inf > x[! is.null(x)] [1] 1 2 3 NA NaN Inf > x[! is.nan(x)] [1] 1 2 3 NA Inf > x[is.finite(x)] [1] 1 2 3
Lis ts
Lis ts (1/4)
vector: an ordered collection of data of the s ame type. > a <
- c( 7, 5, 1)
> a[ 2] [ 1] 5 lis t: an ordered collection of data of arbitrary types . > doe <
- l i st ( nam
e= "j ohn", age= 28, m ar r i ed= F) > doe$nam e [ 1] "j ohn“ > doe$age [ 1] 28 Typically, vector/matrix elements are accessed by their index (= integer starting with 1), list elements by their name (= a string). But both types s upport both acces s methods .
Lis ts (2/4)
A list is an object consisting of objects called
components.
Components of a list don’t need to be of the same
mode or type:
list1 <- list(1, 2, TRUE, “a string”, 17) list2 <- list(l1, 23, l1) # lists within lists:
possible
A component of a list can be referred
as: listname[[index]] or as: listname$componentname, listname$`bla bla`
(backticks!)
Lis ts (3/4)
The names of components may be abbreviated down to
the minimum number of letters needed to identify them uniquely.
Example:
> l <- list(blablabla=1, blubb=2, foo=3, bar=4) > l$bla [1] 1 > l$blu [1] 2
Syntactic quicksand: listname[[index]] is what you normally want listname[index] returns a sub-list containing the element at
position index, i.e., it is the same as writing list(listname[[index]])
There are functions whose return value is a list
(and not a vector / matrix / array)
Lis ts are very flexible
> m
- y. l i st <
- l i st ( c( 5, 4, - 1) , c( "X1", "X2", "X3") )
> m
- y. l i st
[ [ 1] ] : [ 1] 5 4 - 1 [ [ 2] ] : [ 1] "X1" "X2" "X3" > m
- y. l i st [ [ 1] ]
[ 1] 5 4 - 1 > m
- y. l i st <
- l i st ( com
ponent 1= c( 5, 4, - 1) , com ponent 2= c( "X1", "X2", "X3") ) > m
- y. l i st $com
ponent 2[ 2: 3] [ 1] "X2" "X3"
Lis ts : Ses s ion
> Empl <- list(employee=“Anna”, spouse=“Fred”, children=3, child.ages=c(3,7,9)) > Empl[[1]]
# You’d achieve the same with: Empl$employee
“Anna” > Empl[[4]][2] 7
# You’d achieve the same with: Empl$child.ages[2]
> Empl$child.a [1] 3 7 9
# You can shortcut child.ages as child.a
> Empl[4]
# a subl i st consi st i ng of t he 4t h com ponent of Em pl
$child.ages [1] 3 7 9 > nam es( Em pl ) [ 1] “em pl oyee” “spouse” “chi l dr en” “chi l d. ages” > unl i st ( Em pl ) #
conver t s i t t o a vect or . M i xed t ypes w i l l be conver t ed t o st r i ngs, gi vi ng a st r i ng vect or .
Back to matrices : Naming elements of a matrix
> x.mat [,1] [,2] [1,] 3
- 1
[2,] 2 [3,]
- 3
6 > dimnames(x.mat) <- list(c("Line1","Line2",“xyz"), c(“col1",“col2")) # assi gn nam es t o r ow s/ col um ns of m at r i x > x.mat col1 col2 Line1 3
- 1
Line2 2 0 xyz
- 3
6
R as a “better gnuplot”: Graphics in R
plot(): Scatterplots
A scatterplot is a standard two-dimensional (X,Y)
plot
Used to examine the relationship between two
(continuous) variables
If x and y are vectors, then
pl ot ( x, y) produces a s catterplot of x against y
I.e., do a point at coordinates (x[1], y[1]), then (x[2],
y[2]), etc.
pl ot ( y) produces a time s eries plot if y is a
numeric vector or time series object.
I.e., do a point a coordinates (1,y[1]), then (2, y[2]), etc.
plot() takes lots of arguments to make it look
fancier => help(plot)
Example: Graphics with pl ot ( )
> plot(rnorm(100),rnorm(100))
The function rnorm() Simulates a random normal distribution . Help ?rnorm, and ?runif, ?rexp, ?binom, ...
rnorm(100) rnorm(100)
- 3
- 2
- 1
1 2
- 2
- 1
1 2 3
Line plots
Sometimes you don’t want just points solution:
> plot(dataX, dataY, type=“l”)
Or, points and lines between them:
> plot(dataX, dataY, type=“b”)
Beware: If dataX is not nicely sorted, the lines will
jump erroneously across the coordinate system
try
plot(rnorm(100,1,1), rnorm(100,1,1), type=“l”) and see what happens
Graphical Parameters of pl ot ( )
plot(x,y, … type = “c”, # c m ay be p ( def aul t ) , l , b, s, o, h, n. Tr y i t . pch=“+”, # poi nt t ype. Use char act er or num ber s 1 – 18 lty=1, # l i ne t ype ( f or t ype= “l ”) . Use num ber s. lwd=2, # l i ne w i dt h ( f or t ype= “l ”) . Use num ber s. axes = “L” # L= F, T xlab = “string”, ylab=“string” # Label s on axes main = “string”, main =“string” # Ti t l e, subt i t l e f or pl ot xlim = c(lo,hi), ylim= c(lo,hi) # Ranges f or axes ) And som e m
- r e.
Tr y i t out , pl ay ar ound, r ead help(plot) and help(par)
More example graphics with plot()
> x <- s e q( - 2*pi , 2*pi , l e ngt h=100) > y <- s i n( x) > par ( m f r ow=c ( 2, 2) ) #m ul t i - pl ot > pl ot ( x, y, xl ab="x”, yl ab="Si n x") > pl ot ( x, y, t ype = "l ", m ai n=“A Li ne ") > pl ot ( x[ s e q( 5, 100, by=5) ] , y[ s e q( 5, 100, by=5) ] , t ype = "b", axe s =F) > pl ot ( x, y, t ype ="n", yl i m =c ( - 2, 1) > par ( m f r ow=c ( 1, 1) )
- 6
- 4
- 2
2 4 6
- 1.0
- 0.5
0.0 0.5 1.0 x Sinus de x
- 6
- 4
- 2
2 4 6
- 1.0
- 0.5
0.0 0.5 1.0
Une Ligne
x y x[seq(5, 100, by = 5)] y[seq(5, 100, by = 5)]
- 6
- 4
- 2
2 4 6
- 2.0
- 1.0
0.0 0.5 1.0 x y
Multiple data in one plot
Scatter plot
- 1. > plot(firstdataX, firstdataY, col=“red”, pty=“1”, …)
- 2. > points(seconddataX, seconddataY, col=“blue”,
pty=“2”)
- 3. > points(thirddataX, thirddataY, col=“green”, pty=3)
Line plot
- 1. > plot(firstdataX, firstdataY, col=“red”, lty=“1”, …)
- 2. > lines(seconddataX, seconddataY, col=“blue”, lty=“2”,
…)
Caution:
. Only plot( ) command sets limits for axes! . Avoid using plot( …., xlim=c(bla,blubb),
ylim=c(laber,rhabarber))
(There are other ways to achieve this)
Logarithmic s caling
plot() can do logarithmic scaling
plot(…. , log=“x”) plot(…. , log=“y”) plot(…. , log=“xy”)
Double-log scaling can help you to see more. Try:
> x <- 1:10 > x.rand <- 1.2^x + rexp(10,1) > y <- 10*(21:30) > y.rand <- 1.15^y + rexp(10, 20000) > plot(x.rand, y.rand) > plot(x.rand, y.rand, log=“xy”)
More nicing up your graph
> axis(1,at=c(2,4,5), Axis details (“ticks”, lEgend, …) legend("A","B","C")) Use xaxt="n" or yaxt="n" inside plot() > abline(lsfit(x,y)) Add an adjustment > abline(0,1) add a line of slope 1 and intercept 0 > legend(locator(1),…) Legends: very flexible
His togram
A histogram is a special kind of bar plot It allows you to visualize the distribution of values for a
numerical variable. Naïvely:
Divide range of measurement values into, say, 10 so-called
“bins”
Put all values from, say, 1-10 into bin 1, from 11-20 into bin
2, etc.
Count: how many values in bin 1? In bin 2? … Then draw these counters
When drawn with a density scale:
the AREA (NOT height) of each bar is the proportion of
- bservations in the interval
the TOTAL AREA is 100% (or 1)
R: making a his togram
T
ype ?hist to view the help file
Note some important arguments, esp breaks
Simulate some data, make histograms varying the number of
bars (also called ‘bins’ or ‘cells’), e.g. > par(mfrow=c(2,2)) # set up multiple plots > simdata <-rchisq(100,8) # some random numbers > hist(simdata) # default number of bins > hist(simdata,breaks=2) # etc,4,20
R: s etting your own breakpoints
> bps <- c(0,2,4,6,8,10,15,25) > hist(simdata,breaks=bps)
Dens ity plots
Density: probability distribution Naïve view of density:
A “continuous”, “unbroken” histogram “inifinite number of bins”, a bin is “inifinitesimally small” Analogy: Histogram ~ sum, density ~ integral
Calculate density and plot it
> x<-rnorm(200,0,1) #create random numbers > plot(density(x)) #compare this to: > hist(x)
5 10 15 20 25 20 40 60 80 100 120 speed dist X
- 10
- 5
5 10 Y
- 10
- 5
5 10 Z
- 2
2 4 6 8
Other graphical functions
See also: barplot() image() pairs() persp() piechart() polygon() library(modreg) scatter.smooth()
Interactive Graphics Functions
l ocat or ( n, t ype=
“p”) : Waits for the user to select locations on the current plot using the left mouse button. This continues until n (default=500) points have been selected.
i dent i f y( x, y, l abel s) : Allow the user to
highlight any of the points defined by x and y.
t ext ( x, y, ”H
ey”) : Write text at coordinate x,y .
Input / output
Simple input
T
ask: Read a file into a vector
Input file looks like this:
1 2 17.5 99
Read this into vector x:
x <- scan(“inputfile.txt”)
There are more options => help(scan)
More complicated: Reading / writing tables
Write a table into a file:
> x <- rnorm(100, 1, 1) > write.table(x, file=“numbers.txt”) # There are more options => help(write.table)
Read a table from a file:
> x <- read.table(“in.txt”, header=FALSE) # There are more options => help(read.table)
Read a table from the Web:
> x <- read.table(“http://www.net.in.tum.de/ …”…)
Read a table with pre-processing from a pipe (use
Perl to print 2nd row, but only if 1st row is >0): > x <- read.table(pipe(“perl -nae ‘if($F[0] > 0){ print $F[1]; }’”))
Univers al: Us ing file handles
File handles are about as universal as in Perl Write two lines into a file:
> fh <- file(“output.txt”, “w”) # w = write > cat(“blah”, “blubb”, sep=“\n”, file=fh) > close(fh)
Write into a file and compress it using gzip:
> fh <- gzfile(“output.txt.gz”, “w”) > cat(“blah blah blah”, … , file=fh)
More examples: help(file) Also try “filenames” like http://www.blabla.bla/data.gz
Graphical output: Saving your plots
Output as (Encapsulated) PostScript:
> postscript(“outputfile.eps”) > plot(data) # You will not see this on screen! > … # do some more graphics > dev.off() # write into file
There are many more options => help(postscript) View the file using, e.g., gv program
Output as PNG (bitmap):
Simply replace postscript() above by png(): > png(“outputfile.png”, width=800, height=600, pointsize=12, bg=“white”)
Us eful built-in functions
Us eful functions
> seq(2,12,by=2) [1] 2 4 6 8 10 12 > seq(4,5,length=5) [1] 4.00 4.25 4.50 4.75 5.00 > rep(4,10) [1] 4 4 4 4 4 4 4 4 4 4 > paste("V",1:5,sep="") [1] "V1" "V2" "V3" "V4" "V5" > LETTERS[1:7] [1] "A" "B" "C" "D" "E" "F" "G"
Mathematical operations
Normal calculations : + - * / Powers: 2^5 or as well 2**5 Integer division: %/% Modulus: %% (7%%5 gives 2) Standard functions: abs(), sign(), log(), log10(), sqrt(), exp(), sin(), cos(), tan() To round: round(x,3) rounds to 3 figures after the point And also: floor(2.5) gives 2, ceiling(2.5) gives 3 All this works for matrics, vectors, arrays etc. as well!
Vector functions
> vec <- c(5,4,6,11,14,19) > sum(vec) [1] 59 > prod(vec) [1] 351120 > mean(vec) [1] 9.833333 > var(vec) [1] 34.96667 > sd(vec) [1] 5.913262 And also: min() max() cummin() cummax() range()
Logical functions
R knows two logical values: TRUE (short T) et FALSE (short F). And NA. Example: > 3 == 4 [1] FALSE > 4 > 3 [1] TRUE > x <- -4:3 > x > 1 [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE > sum(x[x>1]) [1] 5 > sum(x>1) [1] 2 Notez la différence !
== equals < less than > greater than <= less or equal >= greater or equal != not equal & and |
- r
Programming: Control s tructures and functions
Grouped expres s ions in R
x = 1: 9 i f ( l engt h( x) < = 10) { x <
- c( x, 10: 20) ; #
append 10… 20 t o vect or x pr i nt ( x) } el se { pr i nt ( x[ 1] ) }
Loops in R
l i st <
- c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
f or ( i i n l i st ) { x[ i ] <
- r nor m
( 1) } j = 1 w hi l e( j < 10) { pr i nt ( j ) j <
- j +
2 }
Functions
Functions do things with data
“Input”: function arguments (0,1,2,…) “Output”: function result (exactly one)
Example:
> pleaseadd <- function(a,b) { + result <- a+b + return(result) + }
Editing of functions:
> fix(pleaseadd) # opens pleaseadd() in editor Editor to be used determined by shell variable $EDITOR
Calling Conventions for Functions
T
wo ways of submitting parameters
Arguments may be specified in the same order in which
they occur in function definition
Arguments may be specified as name=value.
Here, the ordering is irrelevant.
Above two rules can be mixed!
> t.test(x1, y1, var.equal=F , conf.level=.99) > t.test(var.equal=F , conf.level=.99, x1, y1)
Mis s ing Arguments
R function can handle missing arguments two ways:
either by providing a default expression in the argument list of
definition
- r
by testing explicitly for missing arguments
> add <- function(x,y=0){x + y} > add(4) > add <- function(x,y){ if(missing(y)) x else x+y } > add(4)
Variable Number of Arguments
The special argument name “…” in the function
definition will match any number of arguments in the call.
nargs() returns the number of arguments in the
current call.
Variable Number of Arguments
> mean.of.all <- function(…) mean(c(…)) > mean.of.all(1:10,20:100,12:14) > mean.of.means <- function(…) { means <- numeric() for(x in list(…)) means <- c(means,mean(x)) mean(means) }
Variable Number of Arguments
m
- ean. of . m
eans <
- f unct i on( …
) { n <
- nar gs( )
m eans <
- num
er i c( n) al l . x <
- l i st ( …
) f or ( j i n 1: n) m eans[ j ] <
- m
ean( al l . x[ [ j ] ] ) m ean( m eans) } m
- ean. of . m
eans( 1: 10, 10: 100)
Even more datatypes : Data frames and factors
Data Frames (1/2)
Vector: All components must be of same type
List: Components may have different types
Matrix: All components must be of same type
=> Is there an equivalent to a List?
Data frame:
Data within each column must be of same type, but Different columns may have different types (e.g., numbers,
boolean,…)
Like a spreadsheet
Example: > cw <- chickwts > cw weight feed 11 309 linseed 23 243 soybean 37 423 sunflower …
Data Frames (2/2)
Data frame = special list with class “data.frame”.
But: restrictions on lists that may be made into data frames.
Components must be
vectors (numeric, character, or logical) Factors numeric matrices Lists other data frames.
Matrices, lists, and data frames provide as many variables to
the new data frame as they have columns, elements, or variables, respectively.
Numeric vectors and factors are included as-is Non-numeric vectors are coerced to be factors, whose levels
are the unique values appearing in the vector.
Vector structures appearing as variables of the data frame
must all have the same length, and matrix structures must all have the same row size.
Individual elements of a vector, matrix, array or data frame are accessed with “ [ ]” by specifying their index, or their name > cw = chickwts > cw weight feed 1 179 horsebean 11 309 linseed 23 243 soybean . . . > cw[3,2] [1] horsebean 6 Levels: casein horsebean linseed ... sunflower > cw [3,] weight feed 37 423 sunflower
Subs etting in data frames (1/2)
Subs etting in data frames (2/2)
> an = Animals > an body brain Mountain beaver 1.350 8.1 Cow 465.000 423.0 Grey wolf 36.330 119.5 > an [3,] body brain Grey wolf 36.33 119.5
Labels in data frames
> labels (an) [[1]] [1] "Mountain beaver" "Cow" [3] "Grey wolf" "Goat" [5] "Guinea pig" "Dipliodocus" [7] "Asian elephant" "Donkey" [9] "Horse" "Potar monkey" [11] "Cat" "Giraffe" [13] "Gorilla" "Human" [15] "African elephant" "Triceratops" [17] "Rhesus monkey" "Kangaroo" [19] "Golden hamster" "Mouse" [21] "Rabbit" "Sheep" [23] "Jaguar" "Chimpanzee" [25] "Rat" "Brachiosaurus" [27] "Mole" "Pig" [[2]] [1] "body" "brain"
Factors
A normal character string may contain arbitrary
text
A factor may only take pre-defined values
“Factor”: also called “category” or “enumerated type” Similar to enum in C, C++ or Java 1.5
help(factor)
Has h tables
Has h Tables
In vectors, lists, dataframes, arrays:
elements stored one after another accessed in that order by their index == integer or by the name of their row / column
Now think of Perl’s hash tables, or
java.util.HashMap
R has hash tables, too
Has h Tables in R
In R, a hash table is the same as a workspace for variables, which is the same as an environment. > t ab = new . env( hash= T) > assi gn( "bt k", l i st ( cl onei d= 682638, f ul l nam e= "Br ut on agam m agl obul i nem i a t yr osi ne ki nase") , env= t ab) > l s( env= t ab) [ 1] "bt k" > get ( "bt k", env= t ab) $cl onei d [ 1] 682638 $f ul l nam e [ 1] "Br ut on agam m agl obul i nem i a t yr osi ne ki nase"
Object orientation
Object orientation
primitive (or: atomic) data types in R are: num
er i c ( i nt eger , doubl e, com pl ex)
char act er l ogi cal f unct i on out of these, vectors, matrices, arrays, lists can
be built
.
Object orientation
Object: a collection of atomic variables and/or
- ther objects that belong together
Similar to the previous list examples,
but there’s more to it.
Parlance:
class: the “abstract” definition of it object: a concrete instance method: other word for ‘function’ slot: a component of an object (I.e., object variable)
Object orientation advantages
The usual suspects:
Encapsulation (can use the objects and methods
someone else has written without having to care about the internals)
Generic functions (e.g. plot, print) Inheritance (hierarchical organization of
complexity)
Object orientation in R
- Don’t be confused:
There are two different ways of object orientation in R!
- Old system: S3 classes
- Still very widespread – many modules still use them
- New system: S4 classes
- A bit safer…
- …but more boilerplate code to type
Object orientation
l i br ar y( ' m e t hods ' ) s e t Cl as s ( ' m i c r oar r ay' , ## t he c l as s de f i ni t i on r e pr e s e nt at i on( ## i t s s l ot s qua = ' m at r i x' , s am pl e s = ' c har ac t e r ' , pr obe s = ' ve c t or ' ) , pr ot ot ype = l i s t ( ## and de f aul t val ue s qua = m at r i x( nr ow=0, nc ol =0) , s am pl e s = c har ac t e r ( 0) , pr obe s = c har ac t e r ( 0) ) ) dat = r e ad. de l i m ( ' . . / dat a/ al i z ade h/ l c 7b017r e x. DAT' ) z = c bi nd( dat $CH 1I , dat $CH 2I ) s e t M e t hod( ' pl ot ' , ## ove r l oad ge ne r i c f unc t i on ‘ pl ot ’ s i gnat ur e ( x=' m i c r oar r ay' ) , ## f or t hi s ne w c l as s f unc t i on( x, . . . ) pl ot ( x@ qua, xl ab=x@ s am pl e s [ 1] , yl ab=x@ s am pl e s [ 2] , pc h=' . ' , l og=' xy' ) ) m a = ne w( ' m i c r oar r ay' , ## i ns t ant i at e ( c ons t r uc t ) qua = z , s am pl e s = c ( ' br ai n' , ' f oot ' ) ) pl ot ( m a)
Object orientation in R
The plot(pisa.linearmodel) command is different from plot(year,inclin) . plot(pisa.linearmodel) R recognizes that pisa.linearmodel is a “lm”
- bject.