synchrotron.org.au
Big ideas + big data = real life benefits
Thursday 27 October 2016
Big ideas + big data = real life benefits Thursday 27 October 2016 - - PowerPoint PPT Presentation
Big ideas + big data = real life benefits Thursday 27 October 2016 synchrotron.org.au Big Data at the Australian Synchrotron Professor Andrew Peele Director Australian Synchrotron and ANSTO Representative in Victoria synchrotron.org.au
synchrotron.org.au
Thursday 27 October 2016
synchrotron.org.au
Director Australian Synchrotron and ANSTO Representative in Victoria
ANSTO is a public research organisation with a variety of roles for the nation. ANSTO operates Australia’s multipurpose nuclear reactor. Research and Innovation Science and Engineering Commercial Businesses Expert advice and support to Government and international agencies
Australia’s National Research Priorities
Landmark and National Research Infrastructure ANSTO Research Infrastructure
Neutron Scattering
Radiobiology & Bioimaging Isotope Tracing in Natural Systems Radiotracers & Radioisotopes Materials Development & Characterisation Nuclear Stewardship National Deuteration Facility Soil and water Environmental change and health Food Resources Advanced manufacturing Cyber security Transport Energy
CLAYTON VIC LUCAS HEIGHTS NSW CAMPERDOWN NSW
Several drugs have been developed following structural studies and target screening at the Australian Synchrotron and are now under clinical trials Venetoclax
DEVELOPED BY
WEHI, Genentech & Abbott
FOR TREATMENT OF
Chronic Lymphocytic Leukaemia
CSL362
DEVELOPED BY
St Vincent’s Institute of Medical Research & CSL
FOR TREATMENT OF
Acute Myeloid Leukaemia cancer cells
Momelotinib
DEVELOPED BY
Gilead Sciences
FOR TREATMENT OF
Myelofibrosis and Pancreatic Cancer
Nexvax2
DEVELOPED BY
Monash University with ImmunsanT
FOR TREATMENT OF
Celiac Disease
Solanezumab
DEVELOPED BY
St Vincent’s Institute
FOR TREATMENT OF
Alzheimer’s Disease
PRMT5 inhibitors
DEVELOPED BY
Cancer Therapeutics CRC with Merck
FOR TREATMENT OF
Melanoma, Breast Cancer
Far-IR IMBL IRM MX1/MX2 PD SAXS SXR XAS XFM
900 750 600 450 300 150
Including commercial access
Shifts requested Shifts awarded
Access is peer reviewed based on merit consistent with international best-practice:
Quality of the proposal National benefit and applications Track record The need for Synchrotron radiation
Three application rounds per year Operates 24/7
(apart from maintenance periods) More than 5600 researcher visits per year Around 1000 experiments All facilities are oversubscribed. The success rate for applications is about 60%. About right for competition to breed excellence.
(Capacity for 30+ beamlines) IRM
Infrared Microscope
Far - IR
Terahertz / Far-IR Spectroscopy
MX2
Micro-focused Crystallography
MX1
Macromolecular Crystallography
XFM
X-ray Fluorescence Microscopy
(4–25 keV)
IMBL
Imaging and Medical Beamline
(30–120 keV)
PD
Powder Diffraction
(4–37 keV)
XAS
X-ray Absorption Spectroscopy
(4–50 keV)
SAXS / WAXS
Small Angle X-ray Scattering / Wide Angle X-ray Scattering
(6–20 keV)
SXR
Soft X-ray Spectroscopy
(90–2500 eV)
Soft X-ray Imaging
synchrotron.org.au
Senior Scientific Software Engineer
15
Imaging and Medical Beamline X-ray Fluorescence Microscopy beamline ~270 TB ~146 TB
16
1 Gigapixel image 40 × 9 mm = 66667 × 15000 (600 nm) pixels, raw data 250 GB, scan time 38 hrs.
Petrographic section of high grade ore from western shear zone of the Sunrise Dam gold deposit, WA
Sr:Fe:Rb map
Fisher et al., Miner. Deposita 50, 665-674 (2015)
17
Sample Orientation Diffraction Pattern Data acquisition took 15 minutes Next iteration of detector will be 18 seconds and can create raw data with ~4 GB / s! Micro Crystallography (MX2) beamline
18
Scientific software
Infrastructure
Big Data definition A volume of data that is too large or too complex to process by simple means, hence requiring significant investments in IT infrastructure, workflows and tools to capture, store, transfer, analyse and visualise datasets.
19
Central storage: 650 TB Additional storage at RDS: 440 TB We still keep all historic user data (except IMBL) Official data retention period: 6 – 12 months Storage: MASSIVE (operated by Monash University)
HPC:
42 nodes, each with
20
Imaging and Medical Beamline
21
X-ray Beam Sample Detector Projections (individual TIF files) Slices Visualisation and Analysis
reconstruction capture
22
2B X Pixels 2560 Y Pixels 600 Bit Depth (Ruby) 16 Single Image size (MB) 2.9 Acquisition Time* (s) 0.05 Projections 1800 Slices 25 Total Dataset Size (GB) 132 Time (min) 38
~3 - 5 GB per minute
~ 3 samples / 2 hours ~12 samples / shift
~ 36 samples per day ~14 TB raw data in a 3 day experiment Detector parameters Raw data size
23
Stitches together serial scans into single projection image at each angle Uses projections to reconstruct tomographic slices of the sample 2560 x 600 px x 25 slices with 10% overlap 1800 projections 1 Slice (2560 x 2560 px), now 32 bit!
Full Sample (13620 slices) 116 GB per sample
25 MB per slice
332 GB per sample
(plus 8 bit (83 GB))
1) Stitching: 2) Reconstruction with X-tract: ~ 60 TB total data potential for 1 experiment (3 days)!
24
Uses projections to reconstruct tomographic slices of the sample 1 Slice (2560 x 2560 px), now 32 bit!
Full Sample (13620 slices) 22 TB
25 MB per slice
332 GB per sample
(plus 8 bit (83 GB))
2) Reconstruction with X-tract: ~ 60 TB total data potential for 1 experiment (3 days)!
25
Online (during the beamtime) Local
Storage
Compute imblcompute Offline (post beamtime) Run for each projection in parallel X-tract uses CUDA for GPU acceleration
VNC
Paradigm shift: bring the users to the data and not the data to the users
48 CPUs 2 GPUs 512GB RAM 60 TB Storage
IMBL
Detector
collect User at beamline
How to handle Big Data: Remote analysis instead of data transfer (sftp, hard drives etc.)
26
27
Cluster mode Gigapixel image = 2,505 files, each 100 MB Analysed using GeoPixe software Can run in Cluster mode for data sorting and extraction
Each MASSIVE remote access session provides: How to handle Big Data:
28
Automatic workflows
Example MX2 beamline: Workflow for automatic data processing and protein structure determination from MX diffraction images (close to real-time)
29
Design and implementation of all workflows were driven by the available infrastructure
e.g. MASSIVE and RDS services existed before the workflow
Next iteration: Workflows are custom built and can’t be re-used Depend on external service provider
ASCI – Australian Synchrotron Computing Infrastructure
Realtime diffraction spot finding at MX2
30
6 nodes, each with
2PB (raw) of Ceph storage
Analysis Session Analysis Session Analysis Session Analysis Session Infrastructure Service Workflow Service
Firewall, nginx
Routing + Security HTML5 based VNC connection Automatic load balancing of docker containers
docker images
IMBL XFM SAXS/WAXS
create instance
…
31
Streaming of data instead of writing (intermediate) files to disk Clever file formats (structure the data in an optimal way) So far: TIF, text, proprietary binary files Next: HDF5 Distributed computing Common workflow system (graph based, distributed) Microservice architecture Automated metadata capture and data curation / preservation
Task Task Task Task Task Task Task
32
Send user home with information not with data
Simultaneous access to 10+ elements; Z > 14 ~ Si High sensitivity - sub-ppm; sub-mM; 1e-12g / s Native contrast - no dyes or contrast agents necessary - but possible! Quantitative Non-destructive / minor damage Extended penetration & DoF
Sensitive to chemical speciation via XANES spectroscopy
34
LA-ICP-MS
XFM Spatial resolution Sensitivity
ppt ppb ppm 0.1 μm 1 μm 10 μm 100 μm LMD-LA-ICP-MS SEM-EDX PIXE
EJ New Dalton Trans (2013), 42(9) pp 3210
Antony van der Ent, Hugh Harris, Martin de Jonge, Peter Erskine, Rachel Mak, Jolanta Mesjasz- Przybylowicz, Wojciech Przybylowicz, Emmanuelle Montargès-Pelletier, Alban Barnabas, Guillaume Echevarria, David Paterson and Daryl Howard University of Adelaide Australian Synchrotron
XFM @ AS: ~2 µm FWHM ~1e10 ph / s Sample position
1 Gpix image = 1 GB (pixels in image) x 2048 (spectral channels) x 384 detector pixels =
SrFeRb
1 10 5 10 15 20 Energy [keV]
After “training”, elemental maps are determined: by performing a fit of the elemental & scatter intensities in each low- statistical single-pixel spectrum THIS FIT CAN BE LINEAR (but often isn’t) Many empty channels suggest event mode data storage
AS brilliance – 1019 ph/s/mrad2/mm2/0.1%bw ~1015 ph/s 0.1% bw at AS front end
40
AS
1015 1010 107 106 Storage Ring Beamline/Mono Sample Detector Photons/s 1 MB/s 86 GB/day 1 TB/day for all AS
42
1 TB/day for all AS 10 TB/day for all AS New Beamlines & new detection systems AS Future 10 EB/day for all?? AS
XFM is being used to study the sub-micron metal distribution in grains such as wheat, barley and rice.
First International field trials in Philippines & Colombia Iron Zinc Natural 2 16 Target 13 28 This Study 15 45 (µg g-1 of rice) More than two billion people are micronutrient deficient Wild Type Johnson Strain
research institutions
De-clogging Ink- jet printer heads for MemJet Materials for improved solar cell efficiency Gold in Gum Leaves Facilitating approval of generic oncology medication for Hospira Testing safety of zinc nanoparticles in sunscreen Venetoclax approved by FDA to combat chronic lymphocytic leukemia Strengthening sheep leather Over 1,284 protein structures solved Cultural Heritage – finding hidden artworks Iron enriched rice variants Over 2,800 peer reviewed papers Over 620 student theses Zeobond green cement Stainless magnesium
47
BioSAXS MX3 Micro materials characterisation Advanced diffraction and scattering Medium Energy XAS Mirco-CT X-ray fluorescence nanoprobe
synchrotron.org.au
Thursday 27 October 2016