best practices for timely and trusted data acquisition
play

Best Practices for Timely and Trusted Data Acquisition, Curation - PowerPoint PPT Presentation

Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy


  1. Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy Campbell, Indy Gupta, Tim Spila, Michael Chan, Kenton McHenry, Tommy O’Brien, Aaron Scwartz- Duval Project funded by NSF ACI DIBBs grant.

  2. Outline Motivation • • Problem Description and Challenges • 4CeeD Approach • Lessons Learned So Far • Best Practices So Far

  3. Motivation Consideration of National • Academy studies -> 20 years from discovery of new materials to fabrication of next- generation devices • Need for REAL TIME and TRUSTED Capture, Curation, Correlation, and Coordination of http://www.build-electronic- circuits.com/integrated-circuit/ materials-to-devices digital data before full archiving and publishing

  4. Current State of Data Collection at Microscopes Current situation for experimental data • involves manual processes for data capture and storage leading to poor documentation of results Data transfer is often done via “sneaker-net” • techniques using flash-drives or email • “Best” results and images are kept, but what is “best” is determined by a narrow, specific scientific objective. “Imperfect” data is often discarded or not available for others to review.

  5. Effects of Current State • Measurements on multiple instruments for a new material may not be well correlated due to mechanisms to encode the linkages between measurements. Novel device prototypes can be difficult to reproduce due • to a lack of proper capture of “recipes” used. • In addition, previous experiments in the deposition systems may affect subsequent experiments. • Curation of system information can greatly improve the reproducibility and understanding of results.

  6. Steps towards Problem Solution • Determine Physical Environments for Data Collection • Understand Physical and Digital Processes that are going on during material and semiconductor fabrication research • Determine Instruments for Investigation • Determine Cyber-and-Data Infrastructure for Real- time and Trusted Data Collection from Instruments • Design and Develop Distributed Data Collection Tool • Identify Test Users , Test Tool and Extract Feedback • Feed Feedback to Distributed Data Collection Tool

  7. Materials and Semiconductor Fabrication Cyber-Physical Environments Micro-Nano Technology Laboratory • Growth and characterization of photonics, microelectronics, nanotechnology and biotechnology Materials Research Laboratory Research in condensed matter physics, • materials chemistry and materials science Facilities for nanostructural and • nanochemical analysis

  8. Microscope Data: Development Process (Example) SiO 2 Mask SiN x Plasma Lithography Diffusion Deposition Deposition Etching Profilometry SIMS Profilometry Profilometry Optical microscopy Ellipsometry SEM Ellipsometry SEM SiN x Device Oxidation Lithography Metallization Removal Characterization Profilometry SEM SEM Optical microscopy SEM SPA Optical microscopy

  9. Collected Data from Microscope (Oxidation Step) An example of the result from an experiment at MNTL

  10. Challenges for Real-Time and Trusted Data Collection • Understanding user requirements for data curation • Development of policies for protecting data during research project and making data available after research project is completed Creating a system that is able to handle many • different types of work processes • Ability to read and display images and data from many different sources, many of which are proprietary Networking challenges for collecting data • * Networked Microscopes

  11. Our Approach 4CeeD: Timely and Trusted Capture, Curation, Correlation, Coordination and Distribution

  12. 4CeeD Approach: Cyber-infrastructure Client Curators Edge Computing Cloud Coordinator

  13. User User Coordinator view, edit, share data (via Webapp) Process, coordinate, correlate data from multiple sources MRL MNTL upload DM3, images, upload DM3, metadata, text images, metadata, text Curator Curator bulk data transfer (via API) Curator Curator Cloudlet Cloudlet ... Curator Curator

  14. 4CeeD Curator

  15. 4CeeD Curator Goals at Microscope  Enable researchers to have a Digital Logbook System • Data is organized by researcher and by sample name • Recipes are collected and related to the deposition equipment used • Analytical data is collected as it is created and contains metadata needed to reproduce measurements

  16. 4CeeD Curator – Input Data Collection Create or Select A Collection Create or Select Dataset Upload Files Optional: Choose template and enter metadata

  17. 4CeeD Curator (Modified Clowder) Architecture Web Browser Custom Clients Client Server Load balancer (nginx) Clowder External Webapp Webapp Webapp Software (Scala/Play) (Scala/Play) (Scala/Play) Event Bus (RabbitMQ) Multimedia Text Search Data/Metada Multimedia Search Data/ Text Search (Elastic ta Search (Versus) (Elastic Metadata search) (MongoDB) (Versus) search) (MongoDB) Extractor 1 Extractor 2 (Java) (Python)

  18. 4CeeD Coordinator

  19. Data Infrastructure’s Challenges (1)  Heterogeneity of the types of job and input data Extract metadata Extract DM3 structured Index Classify Index information parsing Analyze TEM data processing workflow image SEM data processing workflow • How to model complex interactions between jobs’ tasks?

  20. Data Infrastructure’s Challenges (2)  Changing workload •Static resource allocation and rule- based provisioning are not suitable  Flexible provisioning •QoS-based, cost-based provisioning

  21. Coordinator Data Processing Flow  Coordinator models jobs’ tasks on data as task workflow on incoming data  Data processing job is abstracted as workflow to support flexibility & applicability Extract Classify Example of data processing sentiment metadata workflow Index End Start Analyze image

  22. Front-end Database / Coordinator’s front-end File system 1 A B C Job type From To Job 1 A B Control plane 1 B C 1 Start A Resource Job invoker Broker(s) 1 C End manager ... ... ... A B B C Start End A B C Extrac Classif Index t y Sub Pub Sub Pub Sub Pub A B C Compute plane TEM data processing workflow A’s Consumers B’s Consumers C’s Consumers

  23. 4CeeD Pub-Sub Subsystem  New publish subscribe-based system to support executing heterogeneous workflows • Leverage of flexibility of asynchronous message passing mechanism of pub/sub system Apache Kafka •However: • Out-of-the box pub/sub systems do not support executing workflows • Resource management is done manually by user

  24. 4CeeD Resource Management - Job request rates Resource Resource - Average response time scheduler monitor - Topics’ message queues and consumers statistics Resource allocator Resource manager A’s B’s C’s Consumers Consumers Consumers

  25. 4Ceed Coordinator System Implementation Modified Front-end Leverage Clowder’s Webapp & APIs Control plane Resource managers & other control plane programs implemented in Python - RabbitMQ as message Compute plane queue - Consumers implemented as Docker’s container. - Kubernetes is used for container orchestration

  26. Evaluation  Case study: Executing scientific workflows

  27. Efficient real-time resource provisioning m = (1, 1, 1, 1) m = (2, 2, 1, 4) Our proposed approach efficiently provisions resources to cope with bursty workload

  28. Lessons Learned of 4CeeD so far • Huge inefficiency exist at microscopes: (1) users spend time on deciding which data to delete; (2) users spend time on data conversions to view data, instead of data collection • There are security concerns: (1) users want to keep data secure and private until published, (2) instruments run on old not-patched Windows software • Metadata related to data is lost: (1) some metadata is not properly extracted from images, (2) some metadata is not even captured • Current cloud solutions are not all suitable for backend storage and processing of microscope data

  29. Best Practices of 4CeeD so far • Talk to users and introduce them to Data Tools • CyberFab 2016 Workshop, May 24, 2016, Urbana • Consider cloud solutions and do not reinvent everything • Develop open frameworks to enable integration • Do integration with other tools towards sustainable tool suite • Talk and collaborate with other developers of data infrastructures

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend