Best Practices for Timely and Trusted Data Acquisition, Curation - PowerPoint PPT Presentation

Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy Campbell, Indy Gupta, Tim Spila, Michael Chan, Kenton McHenry, Tommy O’Brien, Aaron Scwartz- Duval Project funded by NSF ACI DIBBs grant.

Outline Motivation • • Problem Description and Challenges • 4CeeD Approach • Lessons Learned So Far • Best Practices So Far

Motivation Consideration of National • Academy studies -> 20 years from discovery of new materials to fabrication of next- generation devices • Need for REAL TIME and TRUSTED Capture, Curation, Correlation, and Coordination of http://www.build-electronic- circuits.com/integrated-circuit/ materials-to-devices digital data before full archiving and publishing

Current State of Data Collection at Microscopes Current situation for experimental data • involves manual processes for data capture and storage leading to poor documentation of results Data transfer is often done via “sneaker-net” • techniques using flash-drives or email • “Best” results and images are kept, but what is “best” is determined by a narrow, specific scientific objective. “Imperfect” data is often discarded or not available for others to review.

Effects of Current State • Measurements on multiple instruments for a new material may not be well correlated due to mechanisms to encode the linkages between measurements. Novel device prototypes can be difficult to reproduce due • to a lack of proper capture of “recipes” used. • In addition, previous experiments in the deposition systems may affect subsequent experiments. • Curation of system information can greatly improve the reproducibility and understanding of results.

Steps towards Problem Solution • Determine Physical Environments for Data Collection • Understand Physical and Digital Processes that are going on during material and semiconductor fabrication research • Determine Instruments for Investigation • Determine Cyber-and-Data Infrastructure for Real- time and Trusted Data Collection from Instruments • Design and Develop Distributed Data Collection Tool • Identify Test Users , Test Tool and Extract Feedback • Feed Feedback to Distributed Data Collection Tool

Materials and Semiconductor Fabrication Cyber-Physical Environments Micro-Nano Technology Laboratory • Growth and characterization of photonics, microelectronics, nanotechnology and biotechnology Materials Research Laboratory Research in condensed matter physics, • materials chemistry and materials science Facilities for nanostructural and • nanochemical analysis

Microscope Data: Development Process (Example) SiO 2 Mask SiN x Plasma Lithography Diffusion Deposition Deposition Etching Profilometry SIMS Profilometry Profilometry Optical microscopy Ellipsometry SEM Ellipsometry SEM SiN x Device Oxidation Lithography Metallization Removal Characterization Profilometry SEM SEM Optical microscopy SEM SPA Optical microscopy

Collected Data from Microscope (Oxidation Step) An example of the result from an experiment at MNTL

Challenges for Real-Time and Trusted Data Collection • Understanding user requirements for data curation • Development of policies for protecting data during research project and making data available after research project is completed Creating a system that is able to handle many • different types of work processes • Ability to read and display images and data from many different sources, many of which are proprietary Networking challenges for collecting data • * Networked Microscopes

Our Approach 4CeeD: Timely and Trusted Capture, Curation, Correlation, Coordination and Distribution

4CeeD Approach: Cyber-infrastructure Client Curators Edge Computing Cloud Coordinator

User User Coordinator view, edit, share data (via Webapp) Process, coordinate, correlate data from multiple sources MRL MNTL upload DM3, images, upload DM3, metadata, text images, metadata, text Curator Curator bulk data transfer (via API) Curator Curator Cloudlet Cloudlet ... Curator Curator

4CeeD Curator

4CeeD Curator Goals at Microscope  Enable researchers to have a Digital Logbook System • Data is organized by researcher and by sample name • Recipes are collected and related to the deposition equipment used • Analytical data is collected as it is created and contains metadata needed to reproduce measurements

4CeeD Curator – Input Data Collection Create or Select A Collection Create or Select Dataset Upload Files Optional: Choose template and enter metadata

4CeeD Curator (Modified Clowder) Architecture Web Browser Custom Clients Client Server Load balancer (nginx) Clowder External Webapp Webapp Webapp Software (Scala/Play) (Scala/Play) (Scala/Play) Event Bus (RabbitMQ) Multimedia Text Search Data/Metada Multimedia Search Data/ Text Search (Elastic ta Search (Versus) (Elastic Metadata search) (MongoDB) (Versus) search) (MongoDB) Extractor 1 Extractor 2 (Java) (Python)

4CeeD Coordinator

Data Infrastructure’s Challenges (1)  Heterogeneity of the types of job and input data Extract metadata Extract DM3 structured Index Classify Index information parsing Analyze TEM data processing workflow image SEM data processing workflow • How to model complex interactions between jobs’ tasks?

Data Infrastructure’s Challenges (2)  Changing workload •Static resource allocation and rule- based provisioning are not suitable  Flexible provisioning •QoS-based, cost-based provisioning

Coordinator Data Processing Flow  Coordinator models jobs’ tasks on data as task workflow on incoming data  Data processing job is abstracted as workflow to support flexibility & applicability Extract Classify Example of data processing sentiment metadata workflow Index End Start Analyze image

Front-end Database / Coordinator’s front-end File system 1 A B C Job type From To Job 1 A B Control plane 1 B C 1 Start A Resource Job invoker Broker(s) 1 C End manager ... ... ... A B B C Start End A B C Extrac Classif Index t y Sub Pub Sub Pub Sub Pub A B C Compute plane TEM data processing workflow A’s Consumers B’s Consumers C’s Consumers

4CeeD Pub-Sub Subsystem  New publish subscribe-based system to support executing heterogeneous workflows • Leverage of flexibility of asynchronous message passing mechanism of pub/sub system Apache Kafka •However: • Out-of-the box pub/sub systems do not support executing workflows • Resource management is done manually by user

4CeeD Resource Management - Job request rates Resource Resource - Average response time scheduler monitor - Topics’ message queues and consumers statistics Resource allocator Resource manager A’s B’s C’s Consumers Consumers Consumers

4Ceed Coordinator System Implementation Modified Front-end Leverage Clowder’s Webapp & APIs Control plane Resource managers & other control plane programs implemented in Python - RabbitMQ as message Compute plane queue - Consumers implemented as Docker’s container. - Kubernetes is used for container orchestration

Evaluation  Case study: Executing scientific workflows

Efficient real-time resource provisioning m = (1, 1, 1, 1) m = (2, 2, 1, 4) Our proposed approach efficiently provisions resources to cope with bursty workload

Lessons Learned of 4CeeD so far • Huge inefficiency exist at microscopes: (1) users spend time on deciding which data to delete; (2) users spend time on data conversions to view data, instead of data collection • There are security concerns: (1) users want to keep data secure and private until published, (2) instruments run on old not-patched Windows software • Metadata related to data is lost: (1) some metadata is not properly extracted from images, (2) some metadata is not even captured • Current cloud solutions are not all suitable for backend storage and processing of microscope data

Best Practices of 4CeeD so far • Talk to users and introduce them to Data Tools • CyberFab 2016 Workshop, May 24, 2016, Urbana • Consider cloud solutions and do not reinvent everything • Develop open frameworks to enable integration • Do integration with other tools towards sustainable tool suite • Talk and collaborate with other developers of data infrastructures

Best Practices for Timely and Trusted Data Acquisition, Curation - PowerPoint PPT Presentation

Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy

Timely Common Knowledge Strategy Timely Characterising Asymmetric Distributed Coordination via

Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Welcome to data visualization best practices in R Nick Strayer Instructor DataCamp

TRUSTED MEMORY Software-Based O-Chip Memory Protection for RISC-V Trusted Execution

Acquisition of Software Intensive Systems A Best Practices Survey of the Rail Road Industry 1

Web Push Notifications Whois They are for push Whois They are for push Timely

Timely Dataflow with Heterogeneous Systems eg timely + arrayfire = win? Nat McAleese

Ticket to Work Program Timely Progress Review Basics 1 Objectives Discuss Timely Progress

1 Best Practices Conversational UX Design 2 Best Practices Conversational UX Design SET THE

Best Practices: Electronics Cooling Ruben Bons - CD-adapco Best Practices Outline Geometry

Your Trusted China Partner Market Overview Your Trusted Partner in China Source: McKinsey &

Trusted Mobile Platforms: Part 1: An introduction to trusted computing Chris Mitchell Royal

Software Acquisition Best Practices: Experiences from the Space Systems Domain Suellen Eslinger

Bars and dots: point data Nick Strayer Instructor DataCamp Visualization Best Practices in R

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

INGRESS CAPITAL Confidential Information Memorandum January 2016 CONFIDENTIAL CONFIDENTIAL

Wasatch Front Waste & Recycling District Public Hearing December 9, 2013 20 14 Budget

MC GROUP Q2 2018 and FY2018 Results Summary Public Company Limited 6 September 2018 A leading

MC GROUP Public Company Limited Q1 2018 Results Summary 21 May 2018 Moto-apparel Denim style

HAUPPAUGE PUBLIC SCHOOLS BUDGET HEARING 2015 - 2016 BUDGET HIGHLIGHTS PROPOSED BUDGET:

Intu Properties plc Presentation to Trafford Centre Noteholders 28 September 2015 Intu

Increasing community resilience in the face of climate change Ruth Wolstenholme Sniffer,

Authors Anmol Sheth MOJO: Christian Doerr A Distributed Physical Layer Department of

Best Practices for Timely and Trusted Data Acquisition, Curation - PowerPoint PPT Presentation

Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy

Timely Common Knowledge Strategy Timely Characterising Asymmetric Distributed Coordination via

Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Welcome to data visualization best practices in R Nick Strayer Instructor DataCamp

TRUSTED MEMORY Software-Based O-Chip Memory Protection for RISC-V Trusted Execution

Acquisition of Software Intensive Systems A Best Practices Survey of the Rail Road Industry 1

Web Push Notifications Whois They are for push Whois They are for push Timely

Timely Dataflow with Heterogeneous Systems eg timely + arrayfire = win? Nat McAleese

Ticket to Work Program Timely Progress Review Basics 1 Objectives Discuss Timely Progress

1 Best Practices Conversational UX Design 2 Best Practices Conversational UX Design SET THE

Best Practices: Electronics Cooling Ruben Bons - CD-adapco Best Practices Outline Geometry

Your Trusted China Partner Market Overview Your Trusted Partner in China Source: McKinsey &amp;

Trusted Mobile Platforms: Part 1: An introduction to trusted computing Chris Mitchell Royal

Software Acquisition Best Practices: Experiences from the Space Systems Domain Suellen Eslinger

Bars and dots: point data Nick Strayer Instructor DataCamp Visualization Best Practices in R

City of Piedmont Best Best &amp; Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

INGRESS CAPITAL Confidential Information Memorandum January 2016 CONFIDENTIAL CONFIDENTIAL

Wasatch Front Waste &amp; Recycling District Public Hearing December 9, 2013 20 14 Budget

MC GROUP Q2 2018 and FY2018 Results Summary Public Company Limited 6 September 2018 A leading

MC GROUP Public Company Limited Q1 2018 Results Summary 21 May 2018 Moto-apparel Denim style

HAUPPAUGE PUBLIC SCHOOLS BUDGET HEARING 2015 - 2016 BUDGET HIGHLIGHTS PROPOSED BUDGET:

Intu Properties plc Presentation to Trafford Centre Noteholders 28 September 2015 Intu

Increasing community resilience in the face of climate change Ruth Wolstenholme Sniffer,

Authors Anmol Sheth MOJO: Christian Doerr A Distributed Physical Layer Department of

Your Trusted China Partner Market Overview Your Trusted Partner in China Source: McKinsey &

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

Wasatch Front Waste & Recycling District Public Hearing December 9, 2013 20 14 Budget