Best Practices for Timely and Trusted Data Acquisition, Curation - - PowerPoint PPT Presentation

best practices for timely and trusted data acquisition
SMART_READER_LITE
LIVE PREVIEW

Best Practices for Timely and Trusted Data Acquisition, Curation - - PowerPoint PPT Presentation

Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy


slide-1
SLIDE 1

Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment

Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy Campbell, Indy Gupta, Tim Spila, Michael Chan, Kenton McHenry, Tommy O’Brien, Aaron Scwartz- Duval

Project funded by NSF ACI DIBBs grant.

slide-2
SLIDE 2

Outline

  • Motivation
  • Problem Description and Challenges
  • 4CeeD Approach
  • Lessons Learned So Far
  • Best Practices So Far
slide-3
SLIDE 3

Motivation

  • Consideration of National

Academy studies -> 20 years from discovery of new materials to fabrication of next- generation devices

  • Need for REAL TIME and

TRUSTED Capture, Curation, Correlation, and Coordination of materials-to-devices digital data before full archiving and publishing

http://www.build-electronic- circuits.com/integrated-circuit/

slide-4
SLIDE 4

Current State of Data Collection at Microscopes

  • Current situation for experimental data

involves manual processes for data capture and storage leading to poor documentation of results

  • Data transfer is often done via “sneaker-net”

techniques using flash-drives or email

  • “Best” results and images are kept, but what

is “best” is determined by a narrow, specific scientific objective. “Imperfect” data is often discarded or not available for others to review.

slide-5
SLIDE 5

Effects of Current State

  • Measurements on multiple instruments for a new

material may not be well correlated due to mechanisms to encode the linkages between measurements.

  • Novel device prototypes can be difficult to reproduce due

to a lack of proper capture of “recipes” used.

  • In addition, previous experiments in the deposition

systems may affect subsequent experiments.

  • Curation of system information can greatly improve the

reproducibility and understanding of results.

slide-6
SLIDE 6

Steps towards Problem Solution

  • Determine Physical Environments for Data

Collection

  • Understand Physical and Digital Processes that are

going on during material and semiconductor fabrication research

  • Determine Instruments for Investigation
  • Determine Cyber-and-Data Infrastructure for Real-

time and Trusted Data Collection from Instruments

  • Design and Develop Distributed Data Collection Tool
  • Identify Test Users, Test Tool and Extract Feedback
  • Feed Feedback to Distributed Data Collection Tool
slide-7
SLIDE 7

Materials and Semiconductor Fabrication Cyber-Physical Environments

Micro-Nano Technology Laboratory

  • Growth and characterization of

photonics, microelectronics, nanotechnology and biotechnology

Materials Research Laboratory

  • Research in condensed matter physics,

materials chemistry and materials science

  • Facilities for nanostructural and

nanochemical analysis

slide-8
SLIDE 8

Microscope Data: Development Process (Example)

SiO2 Mask Deposition Diffusion Plasma Etching Oxidation Device Characterization Metallization Lithography Lithography SiNx Deposition SiNx Removal Profilometry Ellipsometry SIMS SEM Profilometry SEM SEM Profilometry SEM Optical microscopy SPA Optical microscopy Profilometry Ellipsometry SEM Optical microscopy

slide-9
SLIDE 9

Collected Data from Microscope (Oxidation Step)

An example of the result from an experiment at MNTL

slide-10
SLIDE 10

Challenges for Real-Time and Trusted Data Collection

  • Understanding user requirements for data

curation

  • Development of policies for protecting data

during research project and making data available after research project is completed

  • Creating a system that is able to handle many

different types of work processes

  • Ability to read and display images and data

from many different sources, many of which are proprietary

  • Networking challenges for collecting data

*Networked Microscopes

slide-11
SLIDE 11

Our Approach 4CeeD: Timely and Trusted Capture, Curation, Correlation, Coordination and Distribution

slide-12
SLIDE 12

4CeeD Approach: Cyber-infrastructure

Client Curators Edge Computing Cloud Coordinator

slide-13
SLIDE 13

Curator Curator Curator Cloudlet Coordinator Curator Curator Curator Cloudlet MNTL MRL

upload DM3, images, metadata, text upload DM3, images, metadata, text view, edit, share data (via Webapp) bulk data transfer (via API) User User Process, coordinate, correlate data from multiple sources

...

slide-14
SLIDE 14

4CeeD Curator

slide-15
SLIDE 15

4CeeD Curator Goals at Microscope

  • Enable researchers to have a Digital Logbook

System

  • Data is organized by researcher and by sample name
  • Recipes are collected and related to the deposition

equipment used

  • Analytical data is collected as it is created and

contains metadata needed to reproduce measurements

slide-16
SLIDE 16

4CeeD Curator – Input Data Collection

Create or Select A Collection Create or Select Dataset Upload Files Optional: Choose template and enter metadata

slide-17
SLIDE 17

4CeeD Curator (Modified Clowder)

Architecture

Load balancer (nginx) Data/Metada ta

(MongoDB)

Event Bus (RabbitMQ)

Extractor 1 (Java) Extractor 2 (Python) Text Search (Elastic search) Webapp (Scala/Play) Webapp (Scala/Play) Webapp (Scala/Play) Clowder External Software Web Browser Custom Clients

Client Server

Multimedia Search

(Versus)

Multimedia Search

(Versus) Text Search (Elastic search)

Data/ Metadata

(MongoDB)

slide-18
SLIDE 18

4CeeD Coordinator

slide-19
SLIDE 19

Data Infrastructure’s Challenges (1)

  • Heterogeneity of the types of job and input data

DM3 parsing Extract metadata Analyze image Index

Extract structured information

Classify Index SEM data processing workflow TEM data processing workflow

  • How to model complex interactions

between jobs’ tasks?

slide-20
SLIDE 20

Data Infrastructure’s Challenges (2)

  • Changing workload
  • Static resource allocation and rule-

based provisioning are not suitable

  • Flexible provisioning
  • QoS-based, cost-based provisioning
slide-21
SLIDE 21

Coordinator Data Processing Flow

  • Coordinator models jobs’ tasks on data as task

workflow on incoming data

  • Data processing job is abstracted as workflow to

support flexibility & applicability

Extract metadata Start

Classify sentiment

Analyze image Index End Example of data processing workflow

slide-22
SLIDE 22

Coordinator’s front-end A’s Consumers Job invoker

1 A B C

A B C

Database / File system B’s Consumers C’s Consumers

B C

Broker(s) Sub Pub Sub Pub Sub Pub

A B

Job type From To 1 A B 1 B C 1 Start A 1 C End ... ... ...

Control plane Compute plane

Job

End Start

Resource manager Front-end

Extrac t Classif y

Index TEM data processing workflow

B A C

slide-23
SLIDE 23

4CeeD Pub-Sub Subsystem

  • New publish subscribe-based system to support

executing heterogeneous workflows

  • Leverage of flexibility of asynchronous message passing

mechanism of pub/sub system

  • However:
  • Out-of-the box pub/sub systems do not support

executing workflows

  • Resource management is done manually by user

Apache Kafka

slide-24
SLIDE 24

4CeeD Resource Management

Resource scheduler Resource monitor Resource allocator

  • Job request rates
  • Average response time
  • Topics’ message

queues and consumers statistics A’s Consumers B’s Consumers C’s Consumers Resource manager

slide-25
SLIDE 25

Control plane Compute plane Front-end

4Ceed Coordinator System Implementation

Leverage Clowder’s Webapp & APIs Resource managers &

  • ther control plane

programs implemented in Python

  • RabbitMQ as message

queue

  • Consumers

implemented as Docker’s container.

  • Kubernetes is used for

container orchestration Modified

slide-26
SLIDE 26

Evaluation

  • Case study: Executing scientific workflows
slide-27
SLIDE 27

Efficient real-time resource provisioning

Our proposed approach efficiently provisions resources to cope with bursty workload

m = (1, 1, 1, 1) m = (2, 2, 1, 4)

slide-28
SLIDE 28

Lessons Learned of 4CeeD so far

  • Huge inefficiency exist at microscopes: (1) users spend

time on deciding which data to delete; (2) users spend time on data conversions to view data, instead of data collection

  • There are security concerns: (1) users want to keep data

secure and private until published, (2) instruments run

  • n old not-patched Windows software
  • Metadata related to data is lost: (1) some metadata is

not properly extracted from images, (2) some metadata is not even captured

  • Current cloud solutions are not all suitable for backend

storage and processing of microscope data

slide-29
SLIDE 29

Best Practices of 4CeeD so far

  • Talk to users and introduce them to Data Tools
  • CyberFab 2016 Workshop, May 24, 2016, Urbana
  • Consider cloud solutions and do not reinvent

everything

  • Develop open frameworks to enable integration
  • Do integration with other tools towards

sustainable tool suite

  • Talk and collaborate with other developers of

data infrastructures