Assessing Migration Risk for Scientific Formats Chris Frisz, Sam - - PowerPoint PPT Presentation

assessing migration risk for scientific formats
SMART_READER_LITE
LIVE PREVIEW

Assessing Migration Risk for Scientific Formats Chris Frisz, Sam - - PowerPoint PPT Presentation

Assessing Migration Risk for Scientific Formats Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Presented 7 December 2011 Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing


slide-1
SLIDE 1

Assessing Migration Risk for Scientific Formats

Chris Frisz, Sam Waggoner, and Geoffrey Brown

Indiana University Bloomington

Presented 7 December 2011

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-2
SLIDE 2

Overview

Introduction

Motivation Hypothesis Approach

Background

Data set used Formats studied Conversion issues encountered

Tools written Results and discussion Conclusions

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-3
SLIDE 3

Motivation

Many migration tools exist for converting from obsolete to standard data formats.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-4
SLIDE 4

Motivation

Many migration tools exist for converting from obsolete to standard data formats. Mismatches in source and target formats introduce risk for migration.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-5
SLIDE 5

Motivation

Many migration tools exist for converting from obsolete to standard data formats. Mismatches in source and target formats introduce risk for migration. Automatic tools often fail silently when converting inconsistent features.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-6
SLIDE 6

Motivation (cont.)

Creating migration tools is hard.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-7
SLIDE 7

Motivation (cont.)

Creating migration tools is hard. Development often requires large programs written over a long time.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-8
SLIDE 8

Motivation (cont.)

Creating migration tools is hard. Development often requires large programs written over a long time. Migration is easier using existing tools.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-9
SLIDE 9

Hypothesis

Where migration tools already exist, they work well on the majority of data files despite differences in formats.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-10
SLIDE 10

Hypothesis

Where migration tools already exist, they work well on the majority of data files despite differences in formats. The remainder of the files can be identified for rarely-used, risky features.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-11
SLIDE 11

Hypothesis

Where migration tools already exist, they work well on the majority of data files despite differences in formats. The remainder of the files can be identified for rarely-used, risky features. Data files are separated into many that are “safe” to migrate versus a few that are “risky.”

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-12
SLIDE 12

Hypothesis (in visual form)

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-13
SLIDE 13

Approach

Wrote simple and fast analysis tools to categorize files by migration risk through deep inspection.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-14
SLIDE 14

Approach

Wrote simple and fast analysis tools to categorize files by migration risk through deep inspection. Identified 4 scientific formats with migration risks from a data set of U.S. Government documents.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-15
SLIDE 15

Approach

Wrote simple and fast analysis tools to categorize files by migration risk through deep inspection. Identified 4 scientific formats with migration risks from a data set of U.S. Government documents.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-16
SLIDE 16

Approach

Wrote simple and fast analysis tools to categorize files by migration risk through deep inspection. Identified 4 scientific formats with migration risks from a data set of U.S. Government documents. Found that the vast majority of files show few to no migration risks. This comes with some caveats.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-17
SLIDE 17

Format Overview

Lotus 1-2-3

A formerly popular spreadsheet program migratable to Excel with some calculation differences.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-18
SLIDE 18

Format Overview

Lotus 1-2-3

A formerly popular spreadsheet program migratable to Excel with some calculation differences.

CDF and netCDF

Array-based data formats with common roots but evolved with some different data representation and encoding features.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-19
SLIDE 19

Format Overview

Lotus 1-2-3

A formerly popular spreadsheet program migratable to Excel with some calculation differences.

CDF and netCDF

Array-based data formats with common roots but evolved with some different data representation and encoding features.

HDF

Hierarchical format for relating data artifacts that underwent significant changes from version 4 to 5.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-20
SLIDE 20

Data Set

Set of 2747 CD-ROM images from the United States Government Printing Office.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-21
SLIDE 21

Data Set

Set of 2747 CD-ROM images from the United States Government Printing Office. Thirty-six (36) images contained 14,022 Lotus 1-2-3, version 1 files. Sixty-eight (68) images contained 61,247 CDF files. Four (4) images contained 3,162 netCDF files. Two (2) images contained 2,213 HDF files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-22
SLIDE 22

Data Set (cont.)

Lotus 1-2-3 files published from many different U.S. agencies:

CDC Census Bureau

  • Dept. of Education

Office of Business and Management

CDF and HDF files primarily from NASA. NetCDF files came from University of Maine, Dept. of Climatology.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-23
SLIDE 23

Formats – Lotus 1-2-3

Primary spreadsheet application used in the 1980s and early 1990s, but was supplanted by Microsoft Excel.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-24
SLIDE 24

Formats – Lotus 1-2-3

Primary spreadsheet application used in the 1980s and early 1990s, but was supplanted by Microsoft Excel. Microsoft provided conversion from 1-2-3 to Excel through 2003.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-25
SLIDE 25

Formats – Lotus 1-2-3

Primary spreadsheet application used in the 1980s and early 1990s, but was supplanted by Microsoft Excel. Microsoft provided conversion from 1-2-3 to Excel through 2003. Differences between the formats were documented by Microsoft and retrieved from knowledge base articles.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-26
SLIDE 26

Formats – Lotus 1-2-3 – Conversion issues

Operations calculated differently

@MOD @VLOOKUP @HLOOKUP

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-27
SLIDE 27

Formats – Lotus 1-2-3 – Conversion issues (cont.)

Exponentiation ( ˆ ) and unary negation ( - ) differ in order of

  • perations.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-28
SLIDE 28

Formats – Lotus 1-2-3 – Conversion issues (cont.)

Exponentiation ( ˆ ) and unary negation ( - ) differ in order of

  • perations.

Exponentiation was evaluated first in Lotus 1-2-3.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-29
SLIDE 29

Formats – Lotus 1-2-3 – Conversion issues (cont.)

Exponentiation ( ˆ ) and unary negation ( - ) differ in order of

  • perations.

Exponentiation was evaluated first in Lotus 1-2-3. Negation was evaluated first in Excel.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-30
SLIDE 30

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-31
SLIDE 31

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-32
SLIDE 32

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42 = −16

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-33
SLIDE 33

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42 = −16 In Excel: −42

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-34
SLIDE 34

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42 = −16 In Excel: −42

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-35
SLIDE 35

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42 = −16 In Excel: −42 = 16

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-36
SLIDE 36

Formats – Lotus 1-2-3 – Example

In Lotus 1-2-3: −42 = −16 In Excel: −42 = 16 Traditional mathematical order of operations favors Lotus.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-37
SLIDE 37

Formats – Lotus 1-2-3 – Conversion issues (cont.)

Comparison/logical operators (i.e. = or #and#) and string concatenation (&) also differ in order of operations.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-38
SLIDE 38

Formats – Lotus 1-2-3 – Conversion issues (cont.)

Comparison/logical operators (i.e. = or #and#) and string concatenation (&) also differ in order of operations.

Comparison and logical operators were evaluated first in Lotus 1-2-3.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-39
SLIDE 39

Formats – Lotus 1-2-3 – Conversion issues (cont.)

Comparison/logical operators (i.e. = or #and#) and string concatenation (&) also differ in order of operations.

Comparison and logical operators were evaluated first in Lotus 1-2-3. Concatenation was evaluated first in Excel.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-40
SLIDE 40

Formats – Lotus 1-2-3 – Conversion Issues – Example

In Lotus 1-2-3: “Fo”&“o” = “Foo”

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-41
SLIDE 41

Formats – Lotus 1-2-3 – Conversion Issues – Example

In Lotus 1-2-3: “Fo”&“o” = “Foo”

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-42
SLIDE 42

Formats – Lotus 1-2-3 – Conversion Issues – Example

In Lotus 1-2-3: “Fo”&“o” = “Foo” → False

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-43
SLIDE 43

Formats – Lotus 1-2-3 – Conversion Issues – Example

In Lotus 1-2-3: “Fo”&“o” = “Foo” → False In Excel: “Fo”&“o” = “Foo”

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-44
SLIDE 44

Formats – Lotus 1-2-3 – Conversion Issues – Example

In Lotus 1-2-3: “Fo”&“o” = “Foo” → False In Excel: “Fo”&“o” = “Foo”

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-45
SLIDE 45

Formats – Lotus 1-2-3 – Conversion Issues – Example

In Lotus 1-2-3: “Fo”&“o” = “Foo” → False In Excel: “Fo”&“o” = “Foo” → True

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-46
SLIDE 46

Formats – CDF and netCDF

CDF and netCDF are both file formats utilized for multidimensional data.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-47
SLIDE 47

Formats – CDF and netCDF

CDF and netCDF are both file formats utilized for multidimensional data. Often used to represent image, climate, and elevation data.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-48
SLIDE 48

Formats – CDF/netCDF Layout

Record rVariable rVariable . . . rVariable Number 1 2 n 1 !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! . . . !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! 2 !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! . . . !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! 3 !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! . . . !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!!

Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-49
SLIDE 49

Formats – CDF/netCDF Layout

Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-50
SLIDE 50

Formats – CDF/netCDF Layout

Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-51
SLIDE 51

Formats – CDF/netCDF Layout

Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-52
SLIDE 52

Formats – CDF/netCDF – Background

CDF originally developed by NASA.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-53
SLIDE 53

Formats – CDF/netCDF – Background

CDF originally developed by NASA. NetCDF developed later by NCAR based on the CDF.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-54
SLIDE 54

Formats – CDF/netCDF – Background

CDF originally developed by NASA. NetCDF developed later by NCAR based on the CDF. Both formats still currently supported.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-55
SLIDE 55

Formats – CDF/netCDF – Background (cont.)

Separate development allowed for evolution of different features.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-56
SLIDE 56

Formats – CDF/netCDF – Background (cont.)

Separate development allowed for evolution of different features. Overall functionality remained similar.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-57
SLIDE 57

Formats – CDF/netCDF – Background (cont.)

Separate development allowed for evolution of different features. Overall functionality remained similar. Primary conversion path between CDF and netCDF was through NASA’s Data Translation Web Service (DTWS).

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-58
SLIDE 58

Formats – CDF – Conversion Issues

Features present in CDF, not in netCDF:

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-59
SLIDE 59

Formats – CDF – Conversion Issues

Features present in CDF, not in netCDF:

Multi-file format for organizing variables into different files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-60
SLIDE 60

Formats – CDF – Conversion Issues

Features present in CDF, not in netCDF:

Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-61
SLIDE 61

Formats – CDF – Conversion Issues

Features present in CDF, not in netCDF:

Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Epoch data type for high-resolution time data.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-62
SLIDE 62

Formats – CDF – Conversion Issues

Features present in CDF, not in netCDF:

Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Epoch data type for high-resolution time data.

Multi-file and native-mode differences were identified in CDF documentation.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-63
SLIDE 63

Formats – CDF – Conversion Issues

Features present in CDF, not in netCDF:

Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Epoch data type for high-resolution time data.

Multi-file and native-mode differences were identified in CDF documentation. Epoch data type mismatch was discovered through DTWS source code review.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-64
SLIDE 64

Formats – netCDF – Conversion Issues

Features present in netCDF, not in CDF:

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-65
SLIDE 65

Formats – netCDF – Conversion Issues

Features present in netCDF, not in CDF:

Descriptive named dimensions usable for data access

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-66
SLIDE 66

Formats – netCDF – Conversion Issues

Features present in netCDF, not in CDF:

Descriptive named dimensions usable for data access Support for up 32 dimensions per variable (versus CDF’s 10)

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-67
SLIDE 67

Formats – netCDF – Conversion Issues

Features present in netCDF, not in CDF:

Descriptive named dimensions usable for data access Support for up 32 dimensions per variable (versus CDF’s 10)

Named dimensions mismatch was documented in NASA’s CDF FAQ.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-68
SLIDE 68

Formats – netCDF – Conversion Issues

Features present in netCDF, not in CDF:

Descriptive named dimensions usable for data access Support for up 32 dimensions per variable (versus CDF’s 10)

Named dimensions mismatch was documented in NASA’s CDF FAQ. Maximum dimension mismatch was discovered through netCDF API code review.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-69
SLIDE 69

Formats – HDF

Hierarchical data format for relating and interacting with hetergenous data sets.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-70
SLIDE 70

Formats – HDF

Hierarchical data format for relating and interacting with hetergenous data sets. Organized similarly to Unix file system with Vgroups like directories and Vdata like files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-71
SLIDE 71

Formats – HDF layout

345&4"+"&<+.01+0.%,

T UVW IX WRYZ VXR U1Z [ S=1VXW W=YU1S U=Y1XU S=ZY1I I=XV1Z \ 2?5?3?9 9?3?5?2 A?C?"?7 ]?G?;?B B?;?G?]

=",+%.&->":% 9**'+"+2'* ?"$%++% <12%*+2(21&4"+"&<%+ ^P+;<797B#4:7*42;$2))2@_ @/"+"

!"7:$%&'$A7;#$3*4<274:$*4# #L2B,;#$*A$#23"$%&'$92<2

^!25;#_ ^XF57<?$USF57<$249$(#4#)2;$ /2:<#)_

T UVW IX WRYZ VXR U1Z [ S=1VXW W=YU1S U=Y1XU S=ZY1I I=XV1Z \ 2539 9352 AC"7 ]G;B B;G] !"7:$%&'$A7;#$3*4<274:$*4# #L2B,;#$*A$#23"$%&'$92<2$<@,#=$

@:.'0A ^()*+,$*A$%&'$92<2$:<)+3<+)#:_

<@,#=$ $

Image courtesy of the HDF Group. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-72
SLIDE 72

Formats – HDF – Background

Developed by the National Center for Supercomputing Applications.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-73
SLIDE 73

Formats – HDF – Background

Developed by the National Center for Supercomputing Applications. Support provided by the HDF Group.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-74
SLIDE 74

Formats – HDF – Background

Developed by the National Center for Supercomputing Applications. Support provided by the HDF Group. Most recent version was HDF5.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-75
SLIDE 75

Formats – HDF – Background (cont.)

Previous versions were backwards compatible.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-76
SLIDE 76

Formats – HDF – Background (cont.)

Previous versions were backwards compatible. HDF5 drastically changed data model and broke backwards compatibility.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-77
SLIDE 77

Formats – HDF – Background (cont.)

Previous versions were backwards compatible. HDF5 drastically changed data model and broke backwards compatibility. HDF Group provided both conversion API and automatic tool.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-78
SLIDE 78

Formats – HDF – Conversion Issues

Merging Vgroups with elements sharing the same name resulted in renaming of one element.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-79
SLIDE 79

Formats – HDF – Conversion Issues

Merging Vgroups with elements sharing the same name resulted in renaming of one element.

This was only relevant for manual conversion.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-80
SLIDE 80

Formats – HDF – Conversion Issues

Merging Vgroups with elements sharing the same name resulted in renaming of one element.

This was only relevant for manual conversion.

Data object shared between Vgroups were copied on conversion.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-81
SLIDE 81

Formats – HDF – Conversion Issues – Example

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-82
SLIDE 82

Formats – HDF – Conversion Issues – Example

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-83
SLIDE 83

Formats – HDF – Conversion Issues – Example

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-84
SLIDE 84

Formats – HDF – Conversion Issues – Example

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-85
SLIDE 85

Formats – HDF – Conversion Issues

Merging Vgroups with elements sharing the same name resulted in renaming of one element.

This was only relevant for manual conversion.

Data object shared between Vgroups were copied on conversion. Unnamed data objects were given default names

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-86
SLIDE 86

Formats – HDF – Conversion Issues

Merging Vgroups with elements sharing the same name resulted in renaming of one element.

This was only relevant for manual conversion.

Data object shared between Vgroups were copied on conversion. Unnamed data objects were given default names The HDF Group documented all of these issues for the HDF4-to-HDF5 conversion API and automated tool.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-87
SLIDE 87

Tools – Lotus 1-2-3

We wrote a C program to traverse 1-2-3 files and parse formulas.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-88
SLIDE 88

Tools – Lotus 1-2-3

We wrote a C program to traverse 1-2-3 files and parse formulas. It identified presence of @MOD, @VLOOKUP, or @HLOOKUP in formulas.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-89
SLIDE 89

Tools – Lotus 1-2-3

We wrote a C program to traverse 1-2-3 files and parse formulas. It identified presence of @MOD, @VLOOKUP, or @HLOOKUP in formulas. The program also conservatively reported presence of both exponentiation and negation or logical/comparison operators and string concatenation.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-90
SLIDE 90

Tools – Lotus 1-2-3 (cont.)

Tool consisted of approximately 500 lines.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-91
SLIDE 91

Tools – Lotus 1-2-3 (cont.)

Tool consisted of approximately 500 lines. Processed our entire data set in less than 15 mintues.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-92
SLIDE 92

Tools – CDF and netCDF

We wrote C programs for each CDF and netCDF.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-93
SLIDE 93

Tools – CDF and netCDF

We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-94
SLIDE 94

Tools – CDF and netCDF

We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. NetCDF program was 150 lines using the version 4.1.3 API from Unidata.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-95
SLIDE 95

Tools – CDF and netCDF

We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. NetCDF program was 150 lines using the version 4.1.3 API from Unidata. Processed entire 61,000-file data set in 55 minutes.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-96
SLIDE 96

Tools – CDF and netCDF

We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. NetCDF program was 150 lines using the version 4.1.3 API from Unidata. Processed entire 61,000-file data set in 55 minutes. NetCDF tool exhibited similar performance.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-97
SLIDE 97

Tools – HDF

Yet again, wrote a C program.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-98
SLIDE 98

Tools – HDF

Yet again, wrote a C program. Written in 900 lines using the 4.2.6 API from the HDF Group.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-99
SLIDE 99

Tools – HDF

Yet again, wrote a C program. Written in 900 lines using the 4.2.6 API from the HDF Group. This tool was longer because of large number of interfaces.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-100
SLIDE 100

Tools – HDF

Yet again, wrote a C program. Written in 900 lines using the 4.2.6 API from the HDF Group. This tool was longer because of large number of interfaces. Processed all HDF files in our data set within 1.5 minutes.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-101
SLIDE 101

Results – Lotus 1-2-3

We ran our analysis tool on 14,022 version 1 files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-102
SLIDE 102

Results – Lotus 1-2-3

We ran our analysis tool on 14,022 version 1 files. It detected a single file containing 7 formulas with possible

  • rder of operations mismatches between 1-2-3 and Excel.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-103
SLIDE 103

Results – Lotus 1-2-3 (cont.)

Example formula from the file: @IF($EJ$85=“NA”, +“ ”&$EJ$85,+$EJ$85)

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-104
SLIDE 104

Results – Lotus 1-2-3 (cont.)

Example formula from the file: @IF($EJ$85=“NA”, +“ ”&$EJ$85,+$EJ$85) The other six also followed this form.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-105
SLIDE 105

Results – Lotus 1-2-3 (cont.)

Example formula from the file: @IF($EJ$85=“NA”, +“ ”&$EJ$85,+$EJ$85) The other six also followed this form. Logical comparison and string concatenation appeared in the same formula, but would not conflict if converted to Excel.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-106
SLIDE 106

Discussion – Lotus 1-2-3

The vast majority of files can be converted conventially without risk.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-107
SLIDE 107

Discussion – Lotus 1-2-3

The vast majority of files can be converted conventially without risk. Only a few files may require a more robust conversion process

  • r by-hand translation.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-108
SLIDE 108

Discussion – Lotus 1-2-3

The vast majority of files can be converted conventially without risk. Only a few files may require a more robust conversion process

  • r by-hand translation.

All 14,022 files in our data set could have been converted without risk after manually verifying a single file.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-109
SLIDE 109

Results – CDF

Our tool ran on 61,247 CDF version 2 files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-110
SLIDE 110

Results – CDF

Our tool ran on 61,247 CDF version 2 files. 14,574 (23.8%) files with no potential conversion risk to netCDF.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-111
SLIDE 111

Results – CDF

Our tool ran on 61,247 CDF version 2 files. 14,574 (23.8%) files with no potential conversion risk to netCDF. 46,669 (76.2%) utilized the Epoch data type.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-112
SLIDE 112

Results – CDF

Our tool ran on 61,247 CDF version 2 files. 14,574 (23.8%) files with no potential conversion risk to netCDF. 46,669 (76.2%) utilized the Epoch data type. 4 files used multi-file format.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-113
SLIDE 113

Results – CDF

Our tool ran on 61,247 CDF version 2 files. 14,574 (23.8%) files with no potential conversion risk to netCDF. 46,669 (76.2%) utilized the Epoch data type. 4 files used multi-file format. There were no files which used native encoding.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-114
SLIDE 114

Discussion – CDF

Use of Epoch data type was prevalent (76.2%).

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-115
SLIDE 115

Discussion – CDF

Use of Epoch data type was prevalent (76.2%). CDF API included functions to convert Epochs to strings.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-116
SLIDE 116

Discussion – CDF

Use of Epoch data type was prevalent (76.2%). CDF API included functions to convert Epochs to strings.

DTWS tool used this method during conversion.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-117
SLIDE 117

Discussion – CDF

Use of Epoch data type was prevalent (76.2%). CDF API included functions to convert Epochs to strings.

DTWS tool used this method during conversion. Tools for converting date string formats are widely available (i.e. Unix).

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-118
SLIDE 118

Discussion – CDF

Use of Epoch data type was prevalent (76.2%). CDF API included functions to convert Epochs to strings.

DTWS tool used this method during conversion. Tools for converting date string formats are widely available (i.e. Unix).

Multi-file format was handled by DTWS tools, despite its rare appearance.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-119
SLIDE 119

Discussion – CDF (cont.)

The results indicated a minimal migration risk for converting CDF to netCDF, which supported our hypothesis.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-120
SLIDE 120

Results – netCDF

We ran our tool on 3,162 netCDF files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-121
SLIDE 121

Results – netCDF

We ran our tool on 3,162 netCDF files. All files included named dimensions.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-122
SLIDE 122

Results – netCDF

We ran our tool on 3,162 netCDF files. All files included named dimensions.

We expected this result.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-123
SLIDE 123

Results – netCDF

We ran our tool on 3,162 netCDF files. All files included named dimensions.

We expected this result.

No files included variables with more than CDF’s maximum 10 dimensions.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-124
SLIDE 124

Results – netCDF

We ran our tool on 3,162 netCDF files. All files included named dimensions.

We expected this result.

No files included variables with more than CDF’s maximum 10 dimensions.

This indicated it was a rare feature.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-125
SLIDE 125

Discussion – netCDF

Dimensions names (present in all netCDF datasets) were not saved in conversion.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-126
SLIDE 126

Discussion – netCDF

Dimensions names (present in all netCDF datasets) were not saved in conversion. This represented actual metadata loss.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-127
SLIDE 127

Discussion – netCDF

Dimensions names (present in all netCDF datasets) were not saved in conversion. This represented actual metadata loss. Though raw data was preserved in conversion, this conflicted with our hypothesis.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-128
SLIDE 128

Discussion – netCDF (cont.)

One possible solution was to save names in a separate metadata file.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-129
SLIDE 129

Discussion – netCDF (cont.)

One possible solution was to save names in a separate metadata file. We were not aware of an existing tool to do this.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-130
SLIDE 130

Results – HDF

Tool ran on 352 HDF3 and 1,861 HDF4 (2,213 total) files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-131
SLIDE 131

Results – HDF

Tool ran on 352 HDF3 and 1,861 HDF4 (2,213 total) files. 324 (14.6%) files with no conversion risks.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-132
SLIDE 132

Results – HDF

Tool ran on 352 HDF3 and 1,861 HDF4 (2,213 total) files. 324 (14.6%) files with no conversion risks. 1,891 (85.4%) with multiple Vgroups containing objects with the same name.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-133
SLIDE 133

Results – HDF

Tool ran on 352 HDF3 and 1,861 HDF4 (2,213 total) files. 324 (14.6%) files with no conversion risks. 1,891 (85.4%) with multiple Vgroups containing objects with the same name. 1,889 (85.4%) with data objects shared between Vgroups.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-134
SLIDE 134

Results – HDF

Tool ran on 352 HDF3 and 1,861 HDF4 (2,213 total) files. 324 (14.6%) files with no conversion risks. 1,891 (85.4%) with multiple Vgroups containing objects with the same name. 1,889 (85.4%) with data objects shared between Vgroups. No unnamed data objects.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-135
SLIDE 135

Discussion – HDF

Duplicate Vdata object names were irrelevant for automatic conversion.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-136
SLIDE 136

Discussion – HDF

Duplicate Vdata object names were irrelevant for automatic conversion. Shared object copying broke data relationships from the source files.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-137
SLIDE 137

Discussion – HDF (cont.)

Issues would not manifest when converting for purely archival reasons.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-138
SLIDE 138

Discussion – HDF (cont.)

Issues would not manifest when converting for purely archival reasons. This overall supported our hypothesis with a caveat.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-139
SLIDE 139

Conclusions

Existing conversion tools could safely convert the vast majority of files in general.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-140
SLIDE 140

Conclusions

Existing conversion tools could safely convert the vast majority of files in general. Caveats:

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-141
SLIDE 141

Conclusions

Existing conversion tools could safely convert the vast majority of files in general. Caveats:

NetCDF-to-CDF conversion loses metadata and requires a separate solution.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-142
SLIDE 142

Conclusions

Existing conversion tools could safely convert the vast majority of files in general. Caveats:

NetCDF-to-CDF conversion loses metadata and requires a separate solution. HDF4-to-HDF5 conversion breaks data relationships and is

  • nly completely safe for archival purposes.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-143
SLIDE 143

Conclusions (cont.)

The results for our data set overall supported our hypothesis.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-144
SLIDE 144

Conclusions (cont.)

The results for our data set overall supported our hypothesis. Our findings supported use of simple and fast tools for migration risk analysis

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-145
SLIDE 145

Conclusions (cont.)

The results for our data set overall supported our hypothesis. Our findings supported use of simple and fast tools for migration risk analysis Open formats (e.g. CDF, netCDF, HDF) are easier to analyze than proprietary ones (i.e. Lotus 1-2-3).

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-146
SLIDE 146

Acknowledgements

The authors would like to gratefully acknowledge the support of the Data to Insight Center, a partnership of the School of Informatics and Computing, Digital Libraries and Pervasive Technology Institute at Indiana University. This research funded in part by a grant provided by the Lilly Endowment Inc.

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

slide-147
SLIDE 147

Time for questions and comments

Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats