Outline Experimental Evaluation in Computer Science: A Motivation - PDF document

Outline Experimental Evaluation in Computer Science: A • Motivation Quantitative Study • Related Work • Methodology • Observations Paul Lukowicz, Ernst A. Heinz, Lutz • Accuracy Prechelt and Walter F. Tichy • Conclusions • Future work! Journal of Systems and Software January 1995 Related Work Introduction • 1979 surveys say experiments lacking • Large part of CS research new designs – 1994 say experimental CS under funded – systems, algorithms, models • 1980, Denning defines experimental CS • Objective study needs experiments – “ Measuring an apparatus in order to test a hypothesis ” • Hypothesis – “If we do not live up to traditional science standards, no one will take us seriously” – Experimental study often neglected in CS • Articles on role of experiments in various CS • If accepted, CS inferior to natural sciences, disciplines • 1990 experimental CS seen as growing, but engineering and applied math • Paper ‘scientifically’ tests hypothesis 1994 – “Falls short of science on all levels” • No systematic attempt to assess research Select CS Papers Methodology • Sample broad set of CS publications (200 papers) • Select Papers – ACM Transactions on Computer Systems (TOCS), volumes 9-11 • Classify – ACM Transactions on Programming Languages • Results and Systems (TOPLAS), volumes 14-15 • Analysis – IEEE Transactions on Software Engineering (TSE), volume 19 • Dissemination (this paper) – Proceedings of 1993 Conference on Programming Language Design and Implementation • Random Sample (50 papers) – 74 titles by ACM via INSPEC (24 discarded) + 30 refereed 1

Select Comparison Papers Classify • Neural Computing (72 papers) – Neural Computation, volume 5 – Interdsciplinary: bio, CS, math, medicine … – Neural networks, neural modeling … – Young field (1990) and CS overlap • Optical Engineering (75 papers) – Optical Engineering, volume 33, no 1 and 3 – Applied optics, opto-mech, image proc. • Same person read most – Contributors from: ee, astronomy, optics… – Applied, like CS, but longer history • Two read all, save NC Subclasses of Design and Major Categories Modeling • Formal Theory • Amount of physical space for experiments – Formally tractable: theorem’s and proofs – Setups, Results, Analysis • Design and Modeling • 0-10%, 11-20%, 21-50%, 51%+ • To shallow? Assumptions: – Systems, techniques, models – Cannot be formally proven ! require experiments – Amount of space proportional to importance by • Empirical Work authors and reviewers – Amount of space correlated to importance to – Analyze performance of known objects • Hypothesis Testing research • Also, concerned with those that had no – Describe hypotheses and test experimental evaluation at all • Other – Ex: surveys Assessing Experimental Outline Evaluation • Look for execution of apparatus, techniques or methods, models validated • Motivation • Tables, graphs, section headings… • Related Work • No assessment of quality • Methodology • But count only ‘true’ experimental work • Observations – Repeatable • Accuracy – Objective (ex: benchmark) • No demonstrations, no examples • Conclusions • Future work! • Some simulations – Supplies data for other experiments – Trace driven 2

Observation of Major Categories Observation of Major Categories • Majority is design and modeling • The CS samples have lower percentage of empirical work than OE and NC • Hypothesis testing is rare (4 articles out of 403!) • Combine hypothesis testing with empirical Observation of Design Sub- Observation of Design Sub- Classes Classes • Higher percentage with no evaluation for CS • Many more NC+OE with 20%+ than in CS vs. NC+OE (43% vs. 14%) • Software engineering (TSE and TOPLAS) worse than random Groupwork: How Experimental is Observation of Design Sub- WPI CS? Classes • Take 2 papers: KDDRG, PEDS, SERG, DSRG, AIDG, GTRG • Read abstract, flip through • Categorize: – Formal Theory – Design and Modelling + Count pages for experiments – Empirical – Hypothesis Testing – Other • Shows percentage that have 20%+ or more • Swap with another group to experimental evaluation 3

Outline Accuracy of Study • Deals with humans, so subjective • Psychology techniques to get objective • Motivation • Related Work measure • Methodology – Large number of users ! Beyond resources (and a lot of work!) • Observations – Provide papers, so other can provide data • Accuracy • Conclusions • Systematic errors • Future work – Classification errors – Paper selection bias Systematic Error: Classification Systematic Error: Classification • Classification ambiguity – Large between Theory and Design-0% (26%) – Design-0% and Other (10%) – Design-0% with simulations (20%) • Counting inaccuracy – 15% from counting experiment space differently • Classification differences between 468 article classification pairs Overall Accuracy (Maximize Distortion) Systematic Error: Paper Selection No Experimental • Journals may not be representative of CS Evaluation – PLDI proceedings is a ‘case study’ of conferences • Random sample may not be “random” – Influenced by INSPEC database holdings – Further influenced by library holdings • Statistical error if selection within journals do 20%+ Space for not represent journals Experiments 4

Conclusion Guidelines • 40% of CS design articles lack experiments • Higher standards for design papers – Non-CS around 10% • Recognize empirical as first class science • 70% of CS have less than 20% space • Need more publicly available benchmarks – NC and OE around 40% • Need rules for how to conduct repeatable • CS conferences no worse than journals! experiments • Youth of CS is not to blame • Tenure committees and funding orgs need to • Experiment difficulty not to blame recognize work involved in experimental CS • Look in the mirror – Harder in physics – Psychology methods can help • Field as a whole neglects importance 5

Outline Experimental Evaluation in Computer Science: A Motivation - PDF document

Outline Experimental Evaluation in Computer Science: A Motivation Quantitative Study Related Work Methodology Observations Paul Lukowicz, Ernst A. Heinz, Lutz Accuracy Prechelt and Walter F. Tichy Conclusions Future

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Experimental evaluation of an Experimental evaluation of an open source implementation of open

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Experimental Analysis Marco Chiarandini Department of Mathematics & Computer Science

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

In vitro tests and experimental animal In vitro tests and experimental animal In vitro tests and

What Can Experimental Philosophy Do? David Chalmers Cast of Characters n X-Phi: Experimental

EMBEDDED SYSTEMS BASICS WORKSHOP by ELC Skyward SKYWARD EXPERIMENTAL ROCKETRY SKYWARD

Experimental Design and Probability Introduction to course Robin Elahi Experimental Design and

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Finding Hamiltonian Cycle in Graphs of Bounded Treewidth: Experimental Evaluation 1 Marcin

A Very Short History Feldman & Sutherland (1979) Rejuvenating Experimental Computer Science

Systematic Uncertainties Frank Ellinghaus University of Mainz Terascale School: Statistics

ha v en IN TE R ME D IATE IMP OR TIN G DATA IN R Filip Scho uw enaars Instr u ctor , DataCamp

Optimizing SDS for the Age of Flash Krutika Dhananjay, Raghavendra Gowdappa, Manoj Pillai @Red

Introducing FreeBSD in new environment The good, the bad and the ugly Baptiste Daroussin

L ECTURE 15: S ENSORS (F OR S TATE E STIMATION ) 1 I NSTRUCTOR : G IANNI A. D I C ARO NAVIGATION

Introduction to Statistics for Quantitative Analysis To support P113, P121, P114, P122, P141 labs

Typical coverage for a sun-synchronous satellite NADIR 4 days GLINT 1 day 7 days ~3.5

Regression Modeling A Conceptual Introduction James H. Steiger Department of Psychology and

Outline Experimental Evaluation in Computer Science: A Motivation - PDF document

Outline Experimental Evaluation in Computer Science: A Motivation Quantitative Study Related Work Methodology Observations Paul Lukowicz, Ernst A. Heinz, Lutz Accuracy Prechelt and Walter F. Tichy Conclusions Future

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Experimental evaluation of an Experimental evaluation of an open source implementation of open

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Experimental Analysis Marco Chiarandini Department of Mathematics &amp; Computer Science

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

In vitro tests and experimental animal In vitro tests and experimental animal In vitro tests and

What Can Experimental Philosophy Do? David Chalmers Cast of Characters n X-Phi: Experimental

EMBEDDED SYSTEMS BASICS WORKSHOP by ELC Skyward SKYWARD EXPERIMENTAL ROCKETRY SKYWARD

Experimental Design and Probability Introduction to course Robin Elahi Experimental Design and

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Finding Hamiltonian Cycle in Graphs of Bounded Treewidth: Experimental Evaluation 1 Marcin

A Very Short History Feldman &amp; Sutherland (1979) Rejuvenating Experimental Computer Science

Systematic Uncertainties Frank Ellinghaus University of Mainz Terascale School: Statistics

ha v en IN TE R ME D IATE IMP OR TIN G DATA IN R Filip Scho uw enaars Instr u ctor , DataCamp

Optimizing SDS for the Age of Flash Krutika Dhananjay, Raghavendra Gowdappa, Manoj Pillai @Red

Introducing FreeBSD in new environment The good, the bad and the ugly Baptiste Daroussin

L ECTURE 15: S ENSORS (F OR S TATE E STIMATION ) 1 I NSTRUCTOR : G IANNI A. D I C ARO NAVIGATION

Introduction to Statistics for Quantitative Analysis To support P113, P121, P114, P122, P141 labs

Typical coverage for a sun-synchronous satellite NADIR 4 days GLINT 1 day 7 days ~3.5

Regression Modeling A Conceptual Introduction James H. Steiger Department of Psychology and

Experimental Analysis Marco Chiarandini Department of Mathematics & Computer Science

A Very Short History Feldman & Sutherland (1979) Rejuvenating Experimental Computer Science