Learning
Professional Development Opportunity for the
Flow Cytometry Core Facility
July 27th & August 24th, 2018
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com
An Introduction for New Users & Advanced Techniques
Learning An Introduction for New Users & Advanced Techniques - - PowerPoint PPT Presentation
Learning An Introduction for New Users & Advanced Techniques Professional Development Opportunity for the Flow Cytometry Core Facility July 27 th & August 24 th , 2018 LKG Consulting Email: consulting.lkg@gmail.com Website:
Professional Development Opportunity for the
Flow Cytometry Core Facility
July 27th & August 24th, 2018
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com
An Introduction for New Users & Advanced Techniques
The goal of this workshop is to introduce (or re- introduce) you to R Programming Language through reference tools and interactive examples. At the end of this workshop you will NOT be an R master (sorry), but you will be have a bunch of knowledge and tricks to help you on your ongoing R adventures.
Laura Gray-Steinhauer
www.ualberta.ca/~lkgray BSc in Mathematics, Statistics and Environmental Studies (UVIC, 2005) MSc in Forest Biology and Management (UofA, 2008) PhD in Forest Biology and Management (UofA, 2011)
Designated Professional Statistician with The Statistical Society of Canada (2014) Research: Climate Change, Policy Evaluation, Adaptation, Mitigation, Risk management for forest resources, Conservation…
Learning R: An Introduction for New Users & Advanced Techniques
Workshop Schedule
8:15 – 8:30 Arrive to the Lab & Start up the computers 8:30 – 8:45 Welcome to the Workshop (housekeeping & today’s goals)
Introduction to R
8:45 – 9:05 Unit 1: Getting started in R (script files, working directories, RStudio, etc.) 9:05 – 9:30 Unit 2: Data Preparation in R (import/export, missing values, modes, classes, etc.) 9:30 – 10:30 Work period (questions are welcome) 10:30 – 10:45 Break 10:45 – 11:15 Unit 3: Data Management in R (tidyr and dplyr packages, etc.) 11:15 – 12:15 Work period (questions are welcome) 12:15 – 1:00 Lunch
Advanced R
1:00 – 1:30 Unit 4: Control Structures (loops, apply functions, etc.) 1:30 – 1:45 Unit 5: Functions (build your own) 1:45– 2:45 Work period (questions are welcome) 2:45 – 3:00 Break 3:00 – 3:30 Unit 6: Graphics (ggplot2 package) 3:30 – 5:00 Work period (questions welcome) After 5:00 Enjoy your weekend!
than we will have time to go through in this course.
Gothic font (everything else is Arial)
indicate these could change depending on what you name your variables.
personal website:
www.ualberta.ca/~lkgray
permission to redistribute content
consulting.lkg@gmail.com
Introduction to R
https://cran.r-project.org/index.html
Working with numbers: 2+3 A=2+3 A a (oops, R is case sensitive) B=7 A+B C=A+B C Working with vectors: X=c(1,4,3,5,7) Y=c(5,7,9,4,8) mean(X) sd(X) X*10 Z=Y+3 Z boxplot(X,Y,Z) t.test(X,Y) t.test(X,Z) Working with tables & matrices: K=as.data.frame(cbind(X,Y, Z)) X=X*10 K (oops, nothing happened?) K$X=K$X*10 t(K) plot(K)
Console window – What happens?
re-run your code without having to type.
being the first thing you do when you open R.
Recommended for this workshop
Tell R where to look
setwd(“File path”)
models to be incorporated into the software you purchased – BUT with R most new techniques and updates are generally available within months
(see Appendix 1 in your workbook for flow packages)
execute many basic statistical tests and graphic commands, but packages are where you get the power
today)
The package the function is housed in { } Usage: Default settings for the function options Description: What the function does Arguments: Details of what the options control Value: Details of what information is created when the command is executed OR further details of the function options References: Where you can find more information about the function See Also: hyperlinks for alternative functions that may be useful Examples: Code you can copy/paste into your R Console and step through line-by-line to get a better idea of what the command is doing
the same as Windows
available in drop-down menu
highlighting code and pressing Command R
function names
Text Editor
https://www.rstudio.com/
Preferred among programmers, we will use it in this workshop
Script Panel Environment/History Panel Console Plots/Packages/Help Files
Preferred among programmers, we will use it in this workshop
Introduction to R
Saves information so you can use it later
e.g. scaler, vector, data table, matrix, ANOVA table, etc.
You create this! You name it!
everything is called data you might get confused).
What you are asking R to do for you
e.g. create a new variable, calculate a mean, etc.
accept other arguments to complete the desired action
function(data, argument1, argument2)
Vector – the basic data object, a list of information
e.g. data.v=c(1,2,3,4) data.v2=c("A","B","C")
Scaler – a vector of length=1
e.g. A=2
Matrix – a 2-dimensional vector. All columns in a matrix must be numeric and the same length
e.g. data.m=matrix(data.v, nrow=2, ncol=2, byrow=FALSE, dimnames=list(c("Row1","Row2"), c("Col1","Col2")))
Data Frame – is a general form of a matrix, but in a data frame different columns can have different modes.
e.g. x=c("Adam","Beth","Chris","Danielle") # Student names y=c(78,90,56,49) # Test scores z=c(TRUE,TRUE,TRUE,FALSE) # Passed? mydata=data.frame(x,y,z) # make a datatable colnames(mydata)=c("student","testScore","Passed") # assign column names mydata ## view data table
List – An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name
e.g. # example of a list with 4 components – a string, a numeric vector, a matrix, and a scaler W= list(name="RcourseExamples", mynumbers=data.v, mymatrix=data.m,myscaler=A) W ## view list
Ask R:
is.vector() is.matrix() is.data.frame() is.list()
Convert:
as.vector() as.matrix() as.data.frame() as.list()
a mutually exclusive classification according to the object’s basic structure.
value present
factor which include an imbedded order known as levels.
actions) have modes such as list or function.
Ask R:
is.numeric() is.chracter() is.factor() is.logical()
Convert:
as.numeric() as.character() as.factor() as.logical()
Investigate data structure:
str()
imaginary numbers)
(Section 2.2)
Import/Export based on file type:
read.csv(“File name”, … ) write.csv(objectName, “File name”…)
Metadata that is not required in R We need to “skip” these 6 lines when we import skip=6 Column names are present header=T Missing values represented by blank cell na.strings=“ ”
row.names=F
specify attributes in your data (…)
Missing Values:
is.na() na.rm=T (parameter in other functions)
na.strings=“ ” or “NA” or 0, etc.
are missing values (based on what R perceives as missing value – see first bullet point)
the missing values are located.
you can use the na.rm=TRUE parameter to remove missing values in the calculation (but they remain in the dataset).
Dates:
install.packages(“lubridate”) ymd() mdy() dmy() year() month() day() dmy_hms() hour() minute() second()
assign dates by specifying formats (clunky and sometimes a bit tricky)
because it simplifies working with dates.
year, month, day and hour, minute and second (if applicable)
reset)
spans between points, accounting for time zones, leap years, and daylight savings time.
Follow the Workbook Examples Any questions?
Get yourself a coffee… Next we’re learning how to manage data
Introduction to R
R Datasets:
library(datasets) data(datasetName)
measurements for 50 flowers from each of Iris setosa, versicolor, and virginica. https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
variables, merge, and append data.
combined they allow you to simply “tidy” your data with a few commands.
and link outputs to one another in a “chain”
Data Management:
Install.packages(“tidyr”,”dplyr”) library(tidyr) library(dplyr)
Without Pipelines:
a <- filter(data, variable == numeric_value) b <- summarise(a, Total = sum(variable)) c <- arrange(b, desc(Total))
With Pipelines:
data %>% filter(variable == “value”) %>% summarise(Total = sum(variable)) %>% arrange(desc(Total))
for data science (managing Big Data)
Data Management:
Install.packages(“tidyverse”) library(tidyverse)
and a value that contains the actual information.
function will transform wide from of data to long form.
columns.
ID Species Sepal Length Sepal Width Petal Length Petal Width 1 setosa 5.2 4.2 7.8 4.1 2 setosa 6.1 5.2 8.9 3.9 3 versicolor 4.7 3.6 5.8 3.8 4 virginica 2.1 5.7 6.3 2.4 … … … … … … ID Species Flower_att Measurement 1 setosa Sepal.Length 5.2 1 setosa Sepal.Width 4.2 1 setosa Petal.Length 7.8 1 setosa Petal.Width 4.1 … … … …
gather() spread()
names) and sep arguments (position where to split the column)
with the into and sep arguments. Shape Data Table:
gather(data, key, value) spread(data, key, value)
Shape Columns:
separate(data, column, into, sep) unite(data, column, into, sep)
Syntax Meaning == != Equal to Note equal to > >= < >= Less than Less than or equal to Greater than Greater than or equal to %in% Group membership
Syntax Meaning is.na !is.na Is NA Is not NA & | ! any all Boolean “AND” Boolean “OR” Boolean “NOT” Boolean “do any match criteria?” Boolean “do all match criteria?”
numerical, or alphabetical order – whatever is appropriate for the data class.
Subset:
select(data, c(columnNames)) filter(data, criteria)
Order:
arrange(data, orderCriteria)
existing variables.
summarize data by group
iris2<-group_by(iris2,Species) #group iris2 dataset by Species iris3<-summarize(iris2, Petal.Length.avg=mean(Petal.Length,na.rm=T), Petal.Width.avg=mean(Petal.Width, na.rm=T), Sepal.Length.avg=mean(Sepal.Length, na.rm=T), Sepal.Width.avg=mean(Sepal.Width, na.rm=T)) iris3 #view new dataset
page 28) New variables:
mutate(data, NewVariable=formula) summarize(GroupedData, functions)
values in y, and all columns from x and y.
returned.
returned.
returned.
in y, keeping just columns from x.
for each matching row of y, where a semi join will never duplicate rows of x.
values in y, keeping just columns from x.
y to merge by
Merge:
joinOption(data1, data2, by)
Append:
install.packages(“plyr”) library(plyr) rbind.fill(data1, data2)
ID Species Sepal Length Sepal Width Petal Length Petal Width 1 setosa 5.2 4.2 7.8 4.1 2 setosa 6.1 5.2 8.9 3.9
function
the column names) it returns and error
which will “stack” the data table and input NA in cells where the columns names do not match.
ID Species Sepal Length Sepal Width Stem Length Stem Width 5 setosa 8.7 2.2 11.8 1.1 6 setosa 7.2 4.2 12.9 2.9 ID Species Sepal Length Sepal Width Petal Length Petal Width Stem Length Stem Width 1 setosa 5.2 4.2 7.8 4.1 NA NA 2 setosa 6.1 5.2 8.9 3.9 NA NA 5 setosa 8.7 2.2 NA NA 11.8 1.1 6 setosa 7.2 4.2 NA NA 12.9 2.9
Follow the Workbook Examples Any questions?
Take a break & get some brain food… you’ll need it ready for this afternoon!
Advanced R
made a decision (referred to as “choosing a direction”) based
times, or if you want to run a piece of code if a certain condition is met.
(Section 4.1)
want to create conditional statements within other functions (like the mutate() function we previously learned) it is better to use the vectorized form of this function, ifelse()
If and else statements:
if(condition){ACTION} else() {ACTION} Ifelse(condition, TRUE ACTION, ELSE ACTION)
block of code.
commands until the end of a sequence for(i in 1:10){
Action #1 Action #2 Action #3 }
For loops:
for(i in sequence){ACTIONS}
Iterative variable (i) will start at 1, increase every run until it reaches 10 There will be 10 occurrences of this loop First bracket opens the loop Second bracket closes the loop Actions to complete in the loop
are different than for() statements.
condition being TRUE to stay within the loop.
loop is exited.
way of failing the condition – else the loop will continue exiting indefinitely.
loop.
Count<-1 while(count<10){ Action #1 Action #2 count<-count+1 }
while loops:
while(condition){ACTIONS}
First bracket opens the loop Second bracket closes the loop Condition to allow for eventual loop exit Set up condition (example- lots of ways to do this) Actions to complete in the loop
want to apply to the supplied data.
new variables.
(i.e. list, vector, table, matrix, etc.)
MARGIN (1-rows, 2-columns, c(1,2)-rows and columns) or the type of function used (vector, matrix, etc.)
Apply functions:
apply(data, MARGIN, FUN) lapply(data, FUN) sapply(data, FUN) vapply(data, FUN, format) tapply(data, FUN) mapply(data, FUN)
loops – which can be tedious to construct and debug
Advanced R
(Sections 5.1 to 5.3)
Steps to create a function:
1. Name your function (allows you to source this function later) 2. Set the arguments you want to be imputed into the function 3. Open function with bracket { 4. Set of commands you need to complete the purpose of your function 5. Return a value, statement, something…. 6. Close function with bracket }
the environment of the function – they don’t exist outside of the function.
function.name <- function(argument1, argument 2,…){ statements to complete the desired actions return (value) }
couple of tricks you can use.
value for each step so you can see where your function is making an error.
functions name, then execute your function.
will step through each line of the function.
there is an error it will abort and give you the error message.
likely have to change one of the “chunks” and all the functions that reference the piece are automatically updated.
you know what is going on.
going on.
entire function is completed.
the entire dataset because if there is an error it can be hard to find it if you have to wade through a bunch of extra data.
Follow the Workbook Examples Any questions?
Up next…. Fun with Graphics.
Advanced R
Reset graphics window (Section 6.1) Initiate a plot Specify where data for plot will come from Define the global plot aesthetics:
graphics.off() ggplot(data=data4, aes(x=Variable1, y=Variable 2, fill=colour, …)) +
(Section 6.2)
Use “+” to “join” graphing pieces Define the “geom” or type of graph you want to plot (Section 6.2):
graphics.off() ggplot(data=data4, aes(x=Variable1, y=Variable 2, fill=colour, …)) + geom_point()
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…)
Alternatively to global parameters you can define local plot aesthetics (Section 6.2):
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
You can “layer” multiple graphs on top of one another using the “+” The order that you layer in will result in the final image (Section 6.2)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1))
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
You can customize the x and y axes with the scale geoms (Section 6.6) The type of geom you use depends on the type of data you are plotting:
You can scale the axes by transformations (square root, logarithmic, etc.)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1)) + scale_x_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + scale_y_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue))
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
Add a title and/or subtitle to the plot (Section 6.7)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1)) + scale_x_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + scale_y_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + ggtitle(label=“Plot Title”, subtitle=“Plot Year”)
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
Add text to the plot (either data labels specified in global aes) or an individual text string Able to specify font family, and nudge labels to avoid overlaps with clustered points (Section 6.7)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1)) + scale_x_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + scale_y_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + ggtitle(label=“Plot Title”, subtitle=“Plot Year”) + geom_text(aes(x=value,y=value),label=“Text”,…)
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
Add shapes to highlight areas within your plot (Section 6.8)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1)) + scale_x_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + scale_y_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + ggtitle(label=“Plot Title”, subtitle=“Plot Year”) + geom_text(aes(x=value,y=value),label=“Text”,…) + geom_rect(xmin=value, xmax=value,…) + stat_ellipse()
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
Add vertical, horizontal or diagonal reference lines to your plot (Section 6.9)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1)) + scale_x_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + scale_y_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + ggtitle(label=“Plot Title”, subtitle=“Plot Year”) + geom_text(aes(x=value,y=value),label=“Text”,…) + geom_rect(xmin=value, xmax=value,…) + stat_ellipse() + geom_abline(slope=value, intercept=value, linetype=1,…)
graphics.off() ggplot(data=data4) + geom_point(aes(x=Variable1,y=Varaible2), colour=“red”, size=value…) +
Add legend depending on what type of legend you need – colour, size of points, or shape of points (Section 6.10)
geom_line(aes(x=Variable1,y=Varaible2, group=1), colour=“blue”, size= value, linetype=1)) + scale_x_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + scale_y_continuous(breaks=c(), labels=c(“Group1”,…), limits=range(minValue,maxValue)) + ggtitle(label=“Plot Title”, subtitle=“Plot Year”) + geom_text(aes(x=value,y=value),label=“Text”,…) + geom_rect(xmin=value, xmax=value,…) + stat_ellipse() + geom_abline(slope=value, intercept=value, linetype=1,…) + scale_colour_manual(values=col.spectrum, breaks=c(),...) scale_shape_manual(name=“Legend Name”, values=c(),...)
build-in functions to summarize data within the plot
Stat Function Description Use in geom function stat_bin() stat.count() Counts the number of observations in bins. Plots the count. stat=“bin” stat_bin_2d() Divides plane into rectangles, counts the number of cases in each rectangle then (by default) maps the number of cases to the rectangle. Also known as a heat map. stat=”bin2d” stat_smooth() Creates a smooth line. Plots the smoothed line. stat=”smooth” stat.sum() Adds data values. Plots result. stat=”sum” stat_identity() No summary. Plot data as is. stat=”identity”
ggplot(data6, aes(x=Obs,y=Species)) + geom_point(stat="sum") #Summed dataggplot(data6, aes(x=Obs,y=Species)) + stat_sum(geom="point") #Summed data
Transparency, and Colour within geom and aes() functions
multiple panels to show corrected relationships between variables.
more variables and plot the subsets of data together. The function works in rows and columns were individual plots are placed in “cells”.
Evenly scaled Scaled based on data range
Allows you to control:
size, colour, etc.)
marks, labels, etc.)
angle of text, etc.)
position, spacing between elements, position, etc.)
Use the modification on the JapserMigration dataset (Section 6.2, data6) and the code in Section 6.12 to build your own theme and create the plots above.
Follow the Workbook Examples Any questions?
I you have any further R questions please feel free to contact me Flow Cytometry Core Facility
LKG Consulting
Email: consulting.lkg@gmail.com Website: www.consultinglkg.com