plyr split-apply-combine for mortals sean anderson - PowerPoint PPT Presentation

Dec 04, 2023 •436 likes •672 views

plyr split-apply-combine for mortals sean anderson sean_anderson@sfu.ca why? 1. its everywhere 2. less code, simple syntax 3. it runs faster look familiar? > d year count 1 2000 16 2 2000 4 3 2000 12 4 2001 15

plyr split-apply-combine for mortals sean anderson sean_anderson@sfu.ca
why? 1. it’s everywhere 2. less code, simple syntax 3. it runs faster
look familiar? > d year count 1 2000 16 2 2000 4 3 2000 12 4 2001 15 5 2001 7 6 2001 12 7 2002 20 ...
why apply > for loop? less code subsetting saving results faster
> d year count 1 2000 16 2 2000 4 3 2000 12 4 2001 15 5 2001 7 6 2001 12 7 2002 20 ...
year mean 1 2000 10.66667 2 2001 11.33333 3 2002 13.66667
d.split <- split(d, d$year) results <- vector("list", length = length(d.split)) for(i in 1:length(d.split)) { temp <- d.split[[i]] temp.mean <- mean(temp$count) results[[i]] <- data.frame( year = unique(temp$year), mean = temp.mean) } do.call("rbind", results) inspired by Hadley Wickham: http://had.co.nz/plyr/
apply(array, 1 or 2, func) sapply(vector, func) lapply(list, func) tapply(vector, index, func) aggregate(object, by, func) ...
d.split <- split(d, d$year) result <- lapply(d.split, function(x) mean(x$count)) result <- unlist(result) result <- data.frame(year = unique(d$year), mean = result) row.names(result) <- NULL
enter plyr
ddply(d, "year", summarize, mean = mean(count))
d.split <- split(d, d$year) results <- vector("list", length = length(d.split)) for(i in 1:length(d.split)) { temp <- d.split[[i]] temp.mean <- mean(temp$count) results[[i]] <- data.frame( year = unique(temp$year), mean = temp.mean) } do.call("rbind", results)
output input ddply()
d - data frame l - list a - array _ - discard
ddply(data, "split", function)
ddply(d, "year", summarise, mean.count = mean(count))
year mean 1 2000 10.66667 2 2001 11.33333 3 2002 13.66667
ddply(d, "year", transform, total.count = sum(count))
year count total 1 2000 16 32 2 2000 4 32 3 2000 12 32 4 2001 15 34 5 2001 7 34 6 2001 12 34 7 2002 20 41 8 2002 15 41 9 2002 6 41
ddply(d, "year", function(x) { browser() }) Browse[1]> x year count 1 2000 16 2 2000 4 3 2000 12 Browse[1]> Q >
library(doMC) registerDoMC(2) # 2 cores ddply(d, f, .parallel = TRUE))
# fail gracefully: failwith(default, f)
remember 1. it’s everywhere 2. less code, simple syntax 3. it runs faster (sometimes) use it.

Recommend

Lecture 4: Tools for data analysis, exploration, and transformation: plyr and reshape2 LSA 2013,

Lecture 4: Tools for data analysis, exploration, and transformation: plyr and reshape2 LSA 2013, LI539 Mixed Effect Models Dave Kleinschmidt Brain and Cognitive Sciences University of Rochester December 3, 2013 Data manipulation and

947 views • 61 slides

Time Correlated Single Photon Counting Anindya Datta Department of Chemistry Indian Institute of

Time Correlated Single Photon Counting Anindya Datta Department of Chemistry Indian Institute of Technology Bombay Powai, Mumbai 400 076 Fluorescence Decay Following Pulsed Excitation t * * N t N 0 exp

283 views • 14 slides

Monitoring of the Beam Time-Structure in Hall B Hovanes Egiyan Jefferson Lab Topics of

Monitoring of the Beam Time-Structure in Hall B Hovanes Egiyan Jefferson Lab Topics of Discussion ! The science behind the CLAS experiments ! Charged particle identification in CLAS (PID) ! Why do we need to monitor RF structure? ! Existing

566 views • 24 slides

Martin Widmann School of Geography, Earth and Environmental Sciences University of Birmingham

Non-local MOS and spatial representativity Martin Widmann School of Geography, Earth and Environmental Sciences University of Birmingham with large contributions from J.M. Eden and D. Maraun, VALUE training school, ICTP Trieste, 4. November

519 views • 23 slides

http://www.open-mpi.org/ Open MPI Mini-Talks Introduction and Overview Jeff Squyres,

Open MPI Join the Revolution Supercomputing November, 2005 http://www.open-mpi.org/ Open MPI Mini-Talks Introduction and Overview Jeff Squyres, Indiana University Advanced Point-to-Point Architecture Tim Woodall, Los Alamos

1.16k views • 82 slides

Sampling Sampling In [1]: % matplotlib inline from matplotlib import pyplot as plt import mxnet

Sampling Sampling In [1]: % matplotlib inline from matplotlib import pyplot as plt import mxnet as mx from mxnet import nd import numpy as np import math In [2]: import random for i in range(10): print(random.random()) 0.8778660335481027

444 views • 8 slides

Introduction Variability in Data Summarizing variability in a data set CS 239

Introduction Variability in Data Summarizing variability in a data set CS 239 Estimating variability in sample data Experimental Methodologies for System Software Peter Reiher April 10, 2007 Lecture 3 Lecture 3 Page 1 Page 2 CS

525 views • 13 slides

Statistics I Chapter 3 Describing Data through Statistics Ling-Chieh Kung Department of

Statistics I Chapter 3, Fall 2012 1 / 65 Statistics I Chapter 3 Describing Data through Statistics Ling-Chieh Kung Department of Information Management National Taiwan University September 19, 2012 Statistics I Chapter 3, Fall

1.03k views • 65 slides

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13 double descent and implicit regularization overparametrization generalizes well implicit

357 views • 13 slides

= x ... What is a Statistic ? What are Statistic s ? A quantity that is computed

Why do we need statistics? CS533 Modeling and Performance 1. Noise, noise, noise, noise, noise! Evaluation of Network and Computer Systems Statistics for Performance Evaluation OK not really this type of noise (Chapters 12-15) Why Do

774 views • 20 slides

Measuring inequality - Week 9 ECON1910 - Poverty and distribution in developing countries

Measuring inequality - Week 9 ECON1910 - Poverty and distribution in developing countries Readings: Ray chapter 6 5. March 2010 (Readings: Ray chapter 6) Measuring inequality - Week 9 5. March 2010 1 / 30 Why care about economic inequality?

476 views • 30 slides

Learning Deep Broadband Network@HOME Hongjoo LEE Who am I? Machine Learning Engineer

Learning Deep Broadband Network@HOME Hongjoo LEE Who am I? Machine Learning Engineer Fraud Detection System Software Defect Prediction Software Engineer Email Services (40+ mil. users) High traffic server (IPC,

630 views • 59 slides

Clustering Data Mining: Concepts and October 18, 2019 Techniques 1 Chapter 8. Cluster Analysis

Clustering Data Mining: Concepts and October 18, 2019 Techniques 1 Chapter 8. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods

893 views • 32 slides

Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Data Mining for

Data Mining for Knowledge Management Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Data Mining for Knowledge Management Thanks for slides to: Jiawei Han Eamonn Keogh Jeff Ullman 2 Data

580 views • 26 slides

Where should Background Research contributions infrastructure be Supporting

10/24/2014 Outline Introduction (VM) Virtual Machine Research goals (PM) Physical Machine Challenges Research questions Background Research contributions PHD Dissertation Defense Supporting Infrastructure Research

297 views • 13 slides

Data Mining Fundamentals Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University

EE226 Big Data Mining Lecture 2 Data Mining Fundamentals Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University http://jhc.sjtu.edu.cn/public/courses/EE226/ Please check https://oc.sjtu.edu.cn/login/ canvas for slides, announcement,

796 views • 61 slides

) ( (6-1) ( = P X ) B f ( x ) dx . X B Note that represents

6. Mean, Variance, Moments and Characteristic Functions For a r.v X , its p.d.f represents complete information f X ( x ) about it, and for any Borel set B on the x -axis ) ( (6-1) ( = P X ) B f ( x ) dx . X B

731 views • 50 slides

Data Analysis and Approximate Models Laurie Davies Fakult at Mathematik, Universit at

Data Analysis and Approximate Models Laurie Davies Fakult at Mathematik, Universit at Duisburg-Essen CRiSM Workshop: Non-likelihood Based Statistical Modelling, University of Warwick, 7-9 September 2015 Is statistics too difficult?

1.27k views • 69 slides

Computing Case of Interval . . . Standard-Deviation-to-Mean What is Known What We Do in This

A Practical Problem: . . . How This Problem is . . . A Standard Way to . . . Selecting the Parameter k Computing Case of Interval . . . Standard-Deviation-to-Mean What is Known What We Do in This Talk and Theorem 1 Theorem 2

807 views • 31 slides

Standard Deviation MDM4U: Mathematics of Data Management A deviation is the difference between any

s t a t i s t i c s o f o n e v a r i a b l e s t a t i s t i c s o f o n e v a r i a b l e Standard Deviation MDM4U: Mathematics of Data Management A deviation is the difference between any value in a data set and the mean. For a population, a

579 views • 3 slides

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10 Goals for today What are mean, variance, and standard deviation? What is the difference between distribution mean/variance and sample mean/variance?

996 views • 46 slides

M5S2 - Confidence Intervals for population mean with population standard deviation unknown

M5S2 - Confidence Intervals for population mean with population standard deviation unknown Professor Jarad Niemi STAT 226 - Iowa State University October 11, 2018 Professor Jarad Niemi (STAT226@ISU) M5S2 - Confidence Intervals October 11,

323 views • 10 slides

Describing Data Part 1: Centrality and Variability INFO-1301, Quantitative Reasoning 1

Describing Data Part 1: Centrality and Variability INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder February 6, 2017 Prof. Michael Paul Descriptive Statistics Statistics that summarize a dataset Provide information

521 views • 22 slides

Map Reduce and Design Patterns Lecture 1 Fang Yu Software Security Lab. Department of

Chapter 1 Chapter 2 Map Reduce and Design Patterns Lecture 1 Fang Yu Software Security Lab. Department of Management Information Systems College of Commerce, National Chengchi University http://soslab.nccu.edu.tw Cloud Computation, March

794 views • 14 slides