Big ideas + big data = real life benefits Thursday 27 October 2016 - - PowerPoint PPT Presentation

big ideas big data real life benefits
SMART_READER_LITE
LIVE PREVIEW

Big ideas + big data = real life benefits Thursday 27 October 2016 - - PowerPoint PPT Presentation

Big ideas + big data = real life benefits Thursday 27 October 2016 synchrotron.org.au Big Data at the Australian Synchrotron Professor Andrew Peele Director Australian Synchrotron and ANSTO Representative in Victoria synchrotron.org.au


slide-1
SLIDE 1

synchrotron.org.au

Big ideas + big data = real life benefits

Thursday 27 October 2016

slide-2
SLIDE 2

synchrotron.org.au

Big Data at the Australian Synchrotron

Professor Andrew Peele

Director Australian Synchrotron and ANSTO Representative in Victoria

slide-3
SLIDE 3

Australian Nuclear Science and Technology Organisation

ANSTO is a public research organisation with a variety of roles for the nation. ANSTO operates Australia’s multipurpose nuclear reactor. Research and Innovation Science and Engineering Commercial Businesses Expert advice and support to Government and international agencies

slide-4
SLIDE 4

Australia’s National Research Priorities

Landmark and National Research Infrastructure ANSTO Research Infrastructure

  • OPAL multi-purpose reactor
  • Australian Centre for

Neutron Scattering

  • Australian Synchrotron
  • Centre for Accelerator Science

Radiobiology & Bioimaging Isotope Tracing in Natural Systems Radiotracers & Radioisotopes Materials Development & Characterisation Nuclear Stewardship National Deuteration Facility Soil and water Environmental change and health Food Resources Advanced manufacturing Cyber security Transport Energy

slide-5
SLIDE 5

Multi-site organisation

CLAYTON VIC LUCAS HEIGHTS NSW CAMPERDOWN NSW

slide-6
SLIDE 6

Life-changing pharmaceutical breakthroughs

Several drugs have been developed following structural studies and target screening at the Australian Synchrotron and are now under clinical trials Venetoclax

DEVELOPED BY

WEHI, Genentech & Abbott

FOR TREATMENT OF

Chronic Lymphocytic Leukaemia

CSL362

DEVELOPED BY

St Vincent’s Institute of Medical Research & CSL

FOR TREATMENT OF

Acute Myeloid Leukaemia cancer cells

Momelotinib

DEVELOPED BY

Gilead Sciences

FOR TREATMENT OF

Myelofibrosis and Pancreatic Cancer

Nexvax2

DEVELOPED BY

Monash University with ImmunsanT

FOR TREATMENT OF

Celiac Disease

Solanezumab

DEVELOPED BY

St Vincent’s Institute

FOR TREATMENT OF

Alzheimer’s Disease

PRMT5 inhibitors

DEVELOPED BY

Cancer Therapeutics CRC with Merck

FOR TREATMENT OF

Melanoma, Breast Cancer

slide-7
SLIDE 7

Infrastructure for researchers

Far-IR IMBL IRM MX1/MX2 PD SAXS SXR XAS XFM

900 750 600 450 300 150

Merit beamtime Facility time

20% 80%

  • Free of charge to users
  • Travel and accommodation paid
  • Expectation to publish

Including commercial access

Shifts requested Shifts awarded

slide-8
SLIDE 8

Infrastructure for researchers

Access is peer reviewed based on merit consistent with international best-practice:

Quality of the proposal National benefit and applications Track record The need for Synchrotron radiation

40% 30% 30%

Three application rounds per year Operates 24/7

(apart from maintenance periods) More than 5600 researcher visits per year Around 1000 experiments All facilities are oversubscribed. The success rate for applications is about 60%. About right for competition to breed excellence.

slide-9
SLIDE 9

Our current 10 operational beamlines

(Capacity for 30+ beamlines) IRM

Infrared Microscope

Far - IR

Terahertz / Far-IR Spectroscopy

MX2

Micro-focused Crystallography

MX1

Macromolecular Crystallography

XFM

X-ray Fluorescence Microscopy

(4–25 keV)

IMBL

Imaging and Medical Beamline

(30–120 keV)

PD

Powder Diffraction

(4–37 keV)

XAS

X-ray Absorption Spectroscopy

(4–50 keV)

SAXS / WAXS

Small Angle X-ray Scattering / Wide Angle X-ray Scattering

(6–20 keV)

SXR

Soft X-ray Spectroscopy

(90–2500 eV)

Soft X-ray Imaging

slide-10
SLIDE 10

synchrotron.org.au

Managing Big Data at the Australian Synchrotron

Dr Andreas Moll

Senior Scientific Software Engineer

slide-11
SLIDE 11

Flavours of Big Data: Data volume

15

Imaging and Medical Beamline X-ray Fluorescence Microscopy beamline ~270 TB ~146 TB

slide-12
SLIDE 12

Flavours of Big Data: Single images

16

1 Gigapixel image 40 × 9 mm = 66667 × 15000 (600 nm) pixels, raw data 250 GB, scan time 38 hrs.

Petrographic section of high grade ore from western shear zone of the Sunrise Dam gold deposit, WA

Sr:Fe:Rb map

Fisher et al., Miner. Deposita 50, 665-674 (2015)

slide-13
SLIDE 13

Flavours of Big Data: Data rate

17

Sample Orientation Diffraction Pattern Data acquisition took 15 minutes Next iteration of detector will be 18 seconds and can create raw data with ~4 GB / s! Micro Crystallography (MX2) beamline

slide-14
SLIDE 14

Dealing with Big Data

18

Scientific software

  • Data management
  • Workflows
  • Real time analysis
  • Distributed computing
  • Automatic workflows for data reduction and processing
  • Remote analysis tools for users

Infrastructure

  • Storage
  • Compute (CPU + GPU)
  • Network

Big Data definition A volume of data that is too large or too complex to process by simple means, hence requiring significant investments in IT infrastructure, workflows and tools to capture, store, transfer, analyse and visualise datasets.

slide-15
SLIDE 15

Infrastructure at the Australian Synchrotron

19

Central storage: 650 TB Additional storage at RDS: 440 TB We still keep all historic user data (except IMBL) Official data retention period: 6 – 12 months Storage: MASSIVE (operated by Monash University)

  • Batch system (based on SLURM)
  • Remote Desktop environment
  • Realtime visualisation

HPC:

42 nodes, each with

  • 2x6 core X5650 CPUs
  • 48 GB RAM
  • 2 NVIDIA M2070 GPUs
  • 58 TB GPFS file system
slide-16
SLIDE 16

Data collection and processing

20

Imaging and Medical Beamline

  • Three experimental enclosures for various resolutions and image modalities
  • Largest beam in the world, up to 540 x 48 mm in 3B
  • High-flux from the superconducting multipole wiggler
  • Dedicated near-beam surgery and animal holding and preparation facilities.
  • All with the Computed Tomography (CT) capabilities
slide-17
SLIDE 17

Computed Tomography

21

X-ray Beam Sample Detector Projections (individual TIF files) Slices Visualisation and Analysis

reconstruction capture

slide-18
SLIDE 18

22

Computed Tomography

2B X Pixels 2560 Y Pixels 600 Bit Depth (Ruby) 16 Single Image size (MB) 2.9 Acquisition Time* (s) 0.05 Projections 1800 Slices 25 Total Dataset Size (GB) 132 Time (min) 38

~3 - 5 GB per minute

~ 3 samples / 2 hours ~12 samples / shift

~ 36 samples per day ~14 TB raw data in a 3 day experiment Detector parameters Raw data size

slide-19
SLIDE 19

23

Computed Tomography

Stitches together serial scans into single projection image at each angle Uses projections to reconstruct tomographic slices of the sample 2560 x 600 px x 25 slices with 10% overlap 1800 projections 1 Slice (2560 x 2560 px), now 32 bit!

Full Sample (13620 slices) 116 GB per sample

25 MB per slice

332 GB per sample

(plus 8 bit (83 GB))

1) Stitching: 2) Reconstruction with X-tract: ~ 60 TB total data potential for 1 experiment (3 days)!

slide-20
SLIDE 20

24

Computed Tomography

Uses projections to reconstruct tomographic slices of the sample 1 Slice (2560 x 2560 px), now 32 bit!

Full Sample (13620 slices) 22 TB

25 MB per slice

332 GB per sample

(plus 8 bit (83 GB))

2) Reconstruction with X-tract: ~ 60 TB total data potential for 1 experiment (3 days)!

slide-21
SLIDE 21

25

Online vs offline

Online (during the beamtime) Local

Storage

Compute imblcompute Offline (post beamtime) Run for each projection in parallel X-tract uses CUDA for GPU acceleration

VNC

Paradigm shift: bring the users to the data and not the data to the users

48 CPUs 2 GPUs 512GB RAM 60 TB Storage

IMBL

Detector

collect User at beamline

How to handle Big Data: Remote analysis instead of data transfer (sftp, hard drives etc.)

slide-22
SLIDE 22

Remote access with Strudel

26

slide-23
SLIDE 23

Gigapixel image on MASSIVE

27

Cluster mode Gigapixel image = 2,505 files, each 100 MB Analysed using GeoPixe software Can run in Cluster mode for data sorting and extraction

  • Partition data
  • Parallelise sorting through data

Each MASSIVE remote access session provides: How to handle Big Data:

  • 12 CPUs
  • 1 GPU
slide-24
SLIDE 24

‘Realtime’ processing and data reduction

28

Automatic workflows

  • reduce data by averaging data, removing unwanted data, etc.
  • first, quick reconstruction of ‘live data’ for quick user feedback
  • full processing of the data where possible

Example MX2 beamline: Workflow for automatic data processing and protein structure determination from MX diffraction images (close to real-time)

  • 1. single shot assessment of space group and quality metric
  • 2. data reduction of datasets with special care for the type
  • f experiment (chemical or protein crystallography)
slide-25
SLIDE 25

What we have learned

29

Design and implementation of all workflows were driven by the available infrastructure

e.g. MASSIVE and RDS services existed before the workflow

Next iteration: Workflows are custom built and can’t be re-used Depend on external service provider

  • Decouple workflow and infrastructure
  • Generic workflow software
  • Microservice architecture

ASCI – Australian Synchrotron Computing Infrastructure

Realtime diffraction spot finding at MX2

  • Uses newly developed workflow software
  • Check quality of recorded data live
slide-26
SLIDE 26

ASCI - Australian Synchrotron Computing Infrastructure

30

6 nodes, each with

  • 48 CPUs
  • 2 GPUs (NVIDIA GeForce GTX 1080)
  • 512 GB RAM

2PB (raw) of Ceph storage

Analysis Session Analysis Session Analysis Session Analysis Session Infrastructure Service Workflow Service

Firewall, nginx

Routing + Security HTML5 based VNC connection Automatic load balancing of docker containers

docker images

IMBL XFM SAXS/WAXS

create instance

slide-27
SLIDE 27

The future of data processing

31

Streaming of data instead of writing (intermediate) files to disk Clever file formats (structure the data in an optimal way) So far: TIF, text, proprietary binary files Next: HDF5 Distributed computing Common workflow system (graph based, distributed) Microservice architecture Automated metadata capture and data curation / preservation

  • split monolithic applications into independent services
  • allows for more flexibility and scalability

Task Task Task Task Task Task Task

slide-28
SLIDE 28

Summary

32

  • Big Data requires clever storage, file formats and processing algorithms
  • Bring the users to the data and not the data to the users
  • The facility that provides users with the best computing environment will have a competitive edge

Send user home with information not with data

slide-29
SLIDE 29

The Australian Synchrotron is an information pipeline

slide-30
SLIDE 30

XFM - ideally suited to study bio-metals

Simultaneous access to 10+ elements; Z > 14 ~ Si High sensitivity - sub-ppm; sub-mM; 1e-12g / s Native contrast - no dyes or contrast agents necessary - but possible! Quantitative Non-destructive / minor damage Extended penetration & DoF

  • study intact cells & sections

Sensitive to chemical speciation via XANES spectroscopy

34

LA-ICP-MS

XFM Spatial resolution Sensitivity

ppt ppb ppm 0.1 μm 1 μm 10 μm 100 μm LMD-LA-ICP-MS SEM-EDX PIXE

EJ New Dalton Trans (2013), 42(9) pp 3210

slide-31
SLIDE 31

Data

Antony van der Ent, Hugh Harris, Martin de Jonge, Peter Erskine, Rachel Mak, Jolanta Mesjasz- Przybylowicz, Wojciech Przybylowicz, Emmanuelle Montargès-Pelletier, Alban Barnabas, Guillaume Echevarria, David Paterson and Daryl Howard University of Adelaide Australian Synchrotron

slide-32
SLIDE 32
slide-33
SLIDE 33

The Maia Detector

  • 1. Form a spot on a specimen
  • 2. Collect fluorescence + scatter in 384

detector pixels and stage position signals while scanning sample

XFM @ AS: ~2 µm FWHM ~1e10 ph / s Sample position

Fitted spectrum (integrated) Fluorescence spectrum

slide-34
SLIDE 34

Naïve Data Storage

1 Gpix image = 1 GB (pixels in image) x 2048 (spectral channels) x 384 detector pixels =

786 TB for one image!

SrFeRb

slide-35
SLIDE 35

Event Mode Data Storage

Fitted spectrum

1 10 5 10 15 20 Energy [keV]

After “training”, elemental maps are determined: by performing a fit of the elemental & scatter intensities in each low- statistical single-pixel spectrum THIS FIT CAN BE LINEAR (but often isn’t) Many empty channels suggest event mode data storage

slide-36
SLIDE 36

How many events are there?

AS brilliance – 1019 ph/s/mrad2/mm2/0.1%bw ~1015 ph/s 0.1% bw at AS front end

40

AS

slide-37
SLIDE 37

Event Mode Data Storage

1015 1010 107 106 Storage Ring Beamline/Mono Sample Detector Photons/s 1 MB/s 86 GB/day 1 TB/day for all AS

slide-38
SLIDE 38

What next?

42

1 TB/day for all AS 10 TB/day for all AS New Beamlines & new detection systems AS Future 10 EB/day for all?? AS

slide-39
SLIDE 39

XFM is being used to study the sub-micron metal distribution in grains such as wheat, barley and rice.

  • B. K. R. Trijatmiko, et al. Scientific Reports, 6, 19792 (2016).

Big Data = Supercharging food

First International field trials in Philippines & Colombia Iron Zinc Natural 2 16 Target 13 28 This Study 15 45 (µg g-1 of rice) More than two billion people are micronutrient deficient Wild Type Johnson Strain

  • B. Kyriacou, et al., J. Cereal Science, 59, 173 (2014).
slide-40
SLIDE 40

Big Data = Benefits to industry

1 % 11.6 %

  • Through research programs
  • > 200 companies interacting with University and

research institutions

  • Access to researchers
  • Access to Grant funding
  • Access to facilities
  • Internal Beamline-Industry Group
slide-41
SLIDE 41

Big Data = Real-life benefits

De-clogging Ink- jet printer heads for MemJet Materials for improved solar cell efficiency Gold in Gum Leaves Facilitating approval of generic oncology medication for Hospira Testing safety of zinc nanoparticles in sunscreen Venetoclax approved by FDA to combat chronic lymphocytic leukemia Strengthening sheep leather Over 1,284 protein structures solved Cultural Heritage – finding hidden artworks Iron enriched rice variants Over 2,800 peer reviewed papers Over 620 student theses Zeobond green cement Stainless magnesium

slide-42
SLIDE 42

Making the Big Data challenge even bigger

slide-43
SLIDE 43

New beamlines

47

BioSAXS MX3 Micro materials characterisation Advanced diffraction and scattering Medium Energy XAS Mirco-CT X-ray fluorescence nanoprobe

slide-44
SLIDE 44

New beam lines = Meet demand, fill gaps

Geosciences Health / Medical Advanced materials High energy 3D Imaging High throughput protein structure Small crystal capacity Residual stress analysis Combined spectroscopy, diffraction and imaging

slide-45
SLIDE 45

New beam lines = More real-life benefits

Geosciences Health / Medical Advanced materials Better use of resources Better drugs Better materials

slide-46
SLIDE 46

Questions at end

slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55

synchrotron.org.au

Big ideas + big data = real life benefits

Thursday 27 October 2016