and Observational Science The Convergence of Data-Intensive and - - PowerPoint PPT Presentation

and observational science
SMART_READER_LITE
LIVE PREVIEW

and Observational Science The Convergence of Data-Intensive and - - PowerPoint PPT Presentation

The Revolution in Experimental and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure Professor Tony Hey Chief Data Scientist STFC tony.hey@stfc.ac.uk The Background X-Info The evolution of


slide-1
SLIDE 1

The Revolution in Experimental and Observational Science

The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Professor Tony Hey Chief Data Scientist STFC tony.hey@stfc.ac.uk

slide-2
SLIDE 2

The Background

slide-3
SLIDE 3

X-Info

  • The evolution of X-Info and Comp-X for each discipline X
  • How to codify and represent our knowledge
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it
  • How to reorganize it
  • How to share with others
  • Query and Vis tools
  • Building and executing models
  • Integrating data and Literature
  • Documenting experiments
  • Curation and long-term

preservation

The Generic Problems

Experiments & Instruments Simulations Literature Other Archives

facts facts facts facts Questions Answers

Slide thanks to Jim Gray

slide-4
SLIDE 4

What X-info Needs from Computer Science

(not drawn to scale)

Science Data & Questions

Scientists

Database To stor

  • re data

Execute Querie ies

Systems

Data Minin ing Algor

  • rit

ithms

Miners

Questio ion & Answer Visuali lizatio ion

Tools

Slide thanks to Jim Gray

slide-5
SLIDE 5

Thousand years ago – Experimental Science

  • Description of natural phenomena

Last few hundred years – Theoretical Science

  • Newton’s Laws, Maxwell’s Equations…

Last few decades – Computational Science

  • Simulation of complex phenomena

Today – Data-Intensive Science

  • Scientists overwhelmed with data sets

from many different sources

  • Data captured by instruments
  • Data generated by simulations
  • Data generated by sensor networks

e-Science and the Fourth Paradigm

2 2 2 .

3 4 a c G a a               

eScience is the set of tools and technologies to support data federation and collaboration

  • For analysis and data mining
  • For data visualization and exploration
  • For scholarly communication and dissemination

With thanks to Jim Gray

slide-6
SLIDE 6

Artificial Neural Networks

Input Layer Hidden Layer Output Layer

slide-7
SLIDE 7

Machine Learning

  • Neural networks are one example of a

Machine Learning (ML) algorithm

  • Deep Neural Networks are now exciting the

whole of the IT industry since they enable us to:

  • Build computing systems that

improve with experience

  • Solve extremely hard problems
  • Extract more value from Big Data
  • Approach human intelligence

e.g. natural language processing

  • The change in the Word Error Rate (WER)

with time for the NIST “Switchboard” data.

  • This shows the dramatic improvement

made in the last few years using Deep Neural Networks

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

Data Science and the UK Science and Technology Facilities Council

slide-12
SLIDE 12

UK Science and Technology Facilities Council (STFC)

Daresbury Laboratory Sci-Tech Dasresbury Campus Warrington, Cheshire

slide-13
SLIDE 13

Big Data and Cognitive Computing: Hartree Centre collaboration with IBM Research

slide-14
SLIDE 14

Central Laser Facility ISIS (Spallation

Neutron Source)

Diamond Light Source LHC Tier 1 computing JASMIN Super-Data-Cluster

Rutherford Appleton Lab and the Harwell Campus

slide-15
SLIDE 15
slide-16
SLIDE 16
  • Assist universities in developing,

maintaining and distributing computer programs

  • Promoting the best computational

methods

  • Each focuses on a specific area of

research

  • Funded by the UK's EPSRC, PPARC

and BBSRC Research Councils

Collaborative Computational Projects: The CCP's

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19

The Diamond Synchrotron

slide-20
SLIDE 20

Diamond Light Source

slide-21
SLIDE 21

Science Examples

Pharmaceutical manufacture & processing Casting aluminium Structure of the Histamine H1 receptor Non-destructive imaging of fossils

slide-22
SLIDE 22
  • 2007 No detector faster than ~10 MB/sec
  • 2009 Pilatus 6M system 60 MB/s
  • 2011 25Hz Pilatus 6M 150 MB/s
  • 2013 100Hz Pilatus 6M 600 MB/sec
  • 2013 ~10 beamlines with 10 GbE

detectors (mainly Pilatus and PCO Edge)

  • 2016 Percival detector 6GB/sec

1 10 100 1000 10000 2007 2012

Detector Performance (MB/s)

Data Rates

Thanks to Mark Heron

slide-23
SLIDE 23

Thanks to Mark Heron

Cumulative Amount of Data Generated By Diamond

1 2 3 4 5 6 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16

Cumulative Amount of Data Generated By Diamond

Data Size in PB

slide-24
SLIDE 24

Nucleous

Cryo-SXT Data

  • Noisy data, missingwedge artifacts, missing

boundaries

  • Tens to hundreds of organelles per dataset
  • Tedious to manually annotate
  • Cell types can look different
  • Few previous annotations available
  • Automated techniques usually fail

Segmentation Neuronal-like mammalian cell line; single slice Nucleous Cytoplasm

Challenges: Data

  • B24: Cryo Transmission X-ray Microscopy beamline at DLS
  • Data Collection: Tilt series from ±65° with 0.5° step size
  • Reconstructed volumes up to 1000x1000x600 voxels
  • Voxel resolution: ~40nm currently
  • Total depth: up to 10μm
  • GOAL: Study structure and morphological changes of whole cells

3D Volume Data

Segmentation of Cryo-soft X-ray Tomography (Cryo-SXT) data

Computer Vision Laboratory B24 beamline Data Analysis Software Group

scientificsoftware@diamond.ac.uk

slide-25
SLIDE 25

Data Preprocessing

Raw Slice Gaussian Filter Total Variation

Data Representation

SuperVoxels (SV) SV Boundaries

SuperVoxels:

  • Groups of similar and adjacent voxels in 3D
  • Preserve volume boundaries
  • Reduce noise when representing data
  • Reduce problem complexity several orders of magnitude
  • Use Local clustering in {xyz + λ * intensity}space

Nucleous

Workflow

Data Preprocessing Data Representation Feature Extraction User’s Manual Segmentations Classification Refinement

scientificsoftware@diamond.ac.uk

slide-26
SLIDE 26

Data Representation

Voxel Grid Supervoxel Graph

946 x 946 x 200 = 180M voxels 180M / (10x10x10) = 180K supervoxels Initial Grid with uniformly sampled seeds Local k-means in a small window around seeds Nucleous

Workflow

Data Preprocessing Data Representation Feature Extraction User’s Manual Segmentations Classification Refinement

scientificsoftware@diamond.ac.uk

slide-27
SLIDE 27

Nucleous

Workflow

Data Preprocessing Data Representation Feature Extraction User’s Manual Segmentations Classification

Feature Extraction

Features are extracted from voxels to represent their appearance:

  • Intensity-based filters (Gaussian Convolutions)
  • Textural filters (eigenvalues of Hessian and Structure Tensor)

User Annotation + Machine Learning

Refinement

User Annotations Predictions Refinement

Using a few user annotations along the volume as an input:

  • A machine learning classifier (i.e. Random Forest) is trained to

discriminate between different classes (i.e. Nucleus and Cytoplasm) and predict the class of each SuperVoxel in the volume.

  • A Markov Random Field (MRF) is then used to refine the predictions.

scientificsoftware@diamond.ac.uk

slide-28
SLIDE 28

SuRVoS Workbench

(Su)per-(R)egion (Vo)lume (S)egmentation Coming soon: https://github.com/DiamondLightSource/SuRVoS Imanol Luengo <imanol.luengo@nottingham.ac.uk>, Michele C. Darrow, Matthew C. Spink, Ying Sun, Wei Dai, Cynthia Y. He, Wah Chiu, Elizabeth Duke, Mark Basham, Andrew P. French, Alun W. Ashton

scientificsoftware@diamond.ac.uk

slide-29
SLIDE 29

The ISIS Neutron and Muon Facility

slide-30
SLIDE 30

ISIS

slide-31
SLIDE 31
  • 30 neutron instruments
  • 3 muon instruments
  • 1400 individual users per year making 3000 visits
  • 800 experiments per year resulting in 450 publications
  • Diverse science
  • Fundamental condensed matter physics
  • Functional materials e.g. multiferroics, spintronics
  • Chemical spectroscopy e.g. catalysis and hydrogen storage
  • Engineering e.g. stress and fatigue in power plants and

transportation

  • Solvents in industry
  • Structure of pharmaceutical compounds, biological

membranes

ISIS

slide-32
SLIDE 32

Peak Assignment in Inelastic Neutron Scattering

  • Vibrational motion of atoms

crucial for many properties of a material -e.g., how well it conducts electricity or heat

  • Peaks in INS spectrum correspond

to specific atomic vibrations

  • Peak assignment: what specific

vibrational motions of atoms give rise to specific peaks ?

INS Spectrum of crystalline benzene

  • S. Parker and S. Mukhopadhyay (ISIS)
slide-33
SLIDE 33

Modelling & Simulation for INS Peak Assignment

Calculated INS Spectrum of crystalline benzene

  • INS spectra can be computed for a given

atomic structure

  • Calculations allow us to see what specific

vibrational motion of atoms occur, and at what frequency

  • L. Liborio
slide-34
SLIDE 34

Materials Workbench

  • K. Dymkowski
slide-35
SLIDE 35

The Central Laser Facility

slide-36
SLIDE 36
  • National imaging facility with peer-

reviewed, funded access

  • Located in Research Complex at

Harwell

  • Cluster of microscopes and lasers

and expert end-to-end multidisciplinary support

  • Operations and some development

funded by STFC

  • Key developments funded through

external grant – BBSRC, MRC

OCTOPUS Facility in the CLF

With thanks to Dan Rolfe

slide-37
SLIDE 37

Example: EGFR cell signalling in cancer

  • Driven OCTOPUS single molecule

developments

  • User in plant cell imaging now

catching up in scale of challenge

  • Part of a PhD project:
  • 1 experimental technique
  • 50 experimental conditions
  • 30 datasets for each condition
  • 1000 single molecule tracks for

each condition

  • Multiple properties & events
  • f interest in each track
  • Comparison of just one

property…

With thanks to Dan Rolfe

slide-38
SLIDE 38
slide-39
SLIDE 39

Large scale comparisons

With thanks to Dan Rolfe

slide-40
SLIDE 40

Multidimensional single molecule tracking

  • Automated registration & tracking in

multiple channels

  • Computer vision
  • Bayesian feature detection from

astronomical galaxy detection

  • Instrumental metadata from acquisition
  • Flexible specification of many

instrument configurations

Rolfe et al 2011, Euro Biophys J, 2011

With thanks to Dan Rolfe

slide-41
SLIDE 41

The JASMIN Environmental Science Super Data Cluster

slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

Large data sets: satellite observations

slide-45
SLIDE 45
slide-46
SLIDE 46

Why JASMIN?

  • Urgency to provide better environmental

predictions

  • Need for higher-resolution models
  • HPC to perform the computation
  • Huge increase in observational capability/capacity

But…

  • Massive storage requirement: observational data

transfer, storage, processing

  • Massive raw data output from prediction models
  • Huge requirement to process raw model output

into usable predictions (post-processing) Hence JASMIN…

ARCHER supercomputer (EPSRC/NERC) JAMSIN (STFC/Stephen Kill)

slide-47
SLIDE 47

JASMIN infrastructure

Part data store, part HPC cluster, part private cloud…

slide-48
SLIDE 48

JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1 JC2-LSW1

48 * 16 = 768 10GbE Non-blocking 16 x 12 x 40GbE = 192 40GbE ports S1036 = 32 x 40GbE

JC2-LSW1 JC2-LSW1 JC2-SP1 JC2-SP1 JC2-SP1 JC2-SP1 JC2-SP1 JC2-SP1

16 x MSX1024B-1BFS 48x10GBE + 12 40 GbE 16 x 12 40GbE = 192 Ports / 32 = 6 Total 192 40 GbE Cable

1,104 x 10GbE Ports CLOS L3 ECMP OSPF

  • ~1,200 Ports expansion
  • Max 36 leaf switches :1,728 Ports @ 10GbE
  • Non-Blocking, Zero Contention (48x10Gb = 12x 40Gb uplinks)
  • Low Latency (250nS L3 / per switch/router) 7-10uS MPI

954 Routes 954 Routes

Non-blocking, low latency, CLOS Tree Network

slide-49
SLIDE 49

JASMIN “Science DMZ” Architecture

Supercomputer Center Simple Science DMZ

http://fasterdata.es.net/science-dmz-architecture

slide-50
SLIDE 50

The UK Met Office UPSCALE campaign

10 01 00 10 00 01 11 01 01

5 TB per day

Data conversion & compression

2.5 TB

JASMIN Data transfer HERMIT @ HLRS Automation controller Clear data from HPC once successfully transferred and data validated

slide-51
SLIDE 51

Example Data Analysis

  • Tropical cyclone tracking has

become routine; 50 years of N512 data can be processed in 50 jobs in one day

  • Eddy vectors; analysis we would not attempt on a

server/workstation (total of 3 months of processor time and ~40 GB memory needed) completed in 24 hours in 1,600 batch jobs

  • JASMIN/LOTUS combination has clearly demonstrated the

value of cluster computing to data processing and analysis.

M Roberts et al: Journal of Climate 28 (2), 574-596

slide-52
SLIDE 52

The Ada Lovelace Center

slide-53
SLIDE 53

The Experimental Data Challenge?

  • Data rates are increasing, facilities science more data intensive
  • Handling and processing data has become a bottleneck to produce science
  • Need to compare with complex models and simulations to interpret the data
  • Computing provision at home-institution highly variable
  • Consistent access to HTC/HPC to process and interpret experimental data
  • Computational algorithms more specialised
  • More users without the facilities science background
  • Need access to data, compute and software services
  • Allow more timely processing of data
  • Use of HPC routine not “tour de force”
  • Generate more and better science
slide-54
SLIDE 54

Ada Lovelace Centre

The ALC will significantly enhance our capability to support the Facilities’ science programme:

  • Theme 1: Increases capacity in advanced software development for

data analysis and interpretation

  • Theme 2: Develop new generation of scientific data experts and

scientific software engineers who can interact with science domain experts

  • Theme 3: Provide significant compute infrastructure for managing,

analysing and simulating the data generated by the facilities and for designing next generation Big-Science experiments

  • Focus is the science drivers and computational needs of Facilities
slide-55
SLIDE 55

ALC Pathfinder: Tomographic Reconstruction

  • Support in-experiment and post-

experiment tomographic reconstruction

  • Round-trip the data to HPC CPU/GPU clusters

in experiment time

  • Tomographic image reconstruction toolbox

with different algorithms

  • High throughput image reconstruction

framework – time scheduled

  • Visualisation on the beamline or remote
  • An integral component of IMAT’s in-experiment

data analysis capability through the ISIS Mantid software suite

  • Goal is to maximise the science from data

collected on facility instruments

Image Processing Experiment Visualisation STFC Scientific Computing: Erica Yang, Srikanth Nagella, Martin Turner, Derek Ross STFC ISIS: Winfried Kockelmann, Genoveva Burca, Federico Montesino Pouzols DLS: Mark Basham

IMAT SCARF MANTID

slide-56
SLIDE 56

ALC Pathfinder: CCP4-DAaaS

SCD Cloud Network Storage Servers

Virtualisation

OS Run-time Data

Applications

  • SaaS
  • PaaS
  • IaaS

NFS SCARF CCP4 DLS

CCP4 – Macro-Crystallography suite

  • proteins, viruses and nucleic acids
  • determine macromolecular

structures by X-ray crystallography

  • Used by DLS users
  • But need post-experimental

access Data Analysis as a Service

  • Remote access to data and

compute via SCD Cloud

  • CCP4 s/w maintained on Cloud

via VM packaging and distribution (CVMFS)

  • User Portal provides access to

right data and compute and workflows

Frazer Barnsley, Shirley Crompton, CCP4, et al

slide-57
SLIDE 57

The ALC - Towards a “Super-facility”?

“A network of connected facilities, software and expertise to enable new modes of discovery”

Katie Antypas, Inder Monga, Lawrence Berkeley National Laboratory

Infrastructure + Software + Expertise With Common Interfaces and Transparent Access

Data Catalogue Petabyte Data storage Parallel File system HPC CPU+GPU Visualisation Data Catalogue Petabyte Data storage Parallel File system HPC CPU+GPU Visualisation Software Data Acquisition

slide-58
SLIDE 58

New Opportunities: Reproducible Science

  • Traceable science
  • Preservation
  • Provenance
  • Publishing
  • A tool for the user
  • Tracking progress
  • ‘RARE’ research
  • Robust
  • Accountable
  • Reproducible
  • Explainable
  • ALC can build in support for reproducible science

SXD Single crystal diffractome ter

ISIS Beamtime Application: SXD Round: 2010 1 RB1010274 Reversible B-H Bond Activation at Cationic Rh(III) Centres: Structural Characterization of Key Hydrogen-containing Intermediates PI: Aldridge Dr S University of Oxford Department of Chemistry

Data DOIs

Perform research & gather data Analyse collected data Manage & curate research data Publish results Crystallise research idea Seek & gain funding

slide-59
SLIDE 59

Jim Gray’s Vision: All Scientific Data Online

  • Many disciplines overlap and use data

from other sciences.

  • Internet can unify all literature and data
  • Go from literature to computation to

data back to literature.

  • Information at your fingertips –

For everyone, everywhere

  • Increase Scientific Information

Velocity

  • Huge increase in Science Productivity

(From Jim Gray’s last talk) Literature Derived and recombined data Raw Data

slide-60
SLIDE 60

Acknowledgements:

With thanks to Mark Basham, David Corney, Jonathan Churchill, Imanol Luengo, Barbara Montanari, Brian Matthews and Dan Rolfe