Data, Data everywhere with …
French N+N meeting, DTI, London
- Prof. Malcolm Atkinson
Director www.nesc.ac.uk
Data, Data everywhere with French N+N meeting, DTI, London Prof. - - PowerPoint PPT Presentation
Data, Data everywhere with French N+N meeting, DTI, London Prof. Malcolm Atkinson Director www.nesc.ac.uk www.ogsadai.org.uk 3 rd November 2003 Contents Data: The Lingua Franca of e-Science Data: The Challenge for e-Science OGSA-DAI
Director www.nesc.ac.uk
Closing the information loop – between lab and computational model.
(Computing Science, Bioinformatics, Beatson Cancer Research Labs)
Harnessing Genomics Programme
Slide from Professor Muffy Calder, Glasgow
Shared data Public curated data
Enormous quantities of data: Petabytes
For an increasing number of communities gating step is not collection but analysis
Ubiquitous Internet: >100 million hosts
Collaboration & resource sharing the norm Security and Trust are crucial issues
Ultra-high-speed networks: >10 Gb/s
Global optical networks Bottlenecks: last kilometre & firewalls
Huge quantities of computing: >100 Top/s
Moore’s law gives us all supercomputers Organising their effective use is the challenge
Moore’s law everywhere
Instruments, detectors, sensors, scanners, … Organising their effective use is the challenge
Derived from Ian Foster’s slide at ssdbM July 03
grouped by wavelength
areas of the sky
Data and images courtesy Alex Szalay, John Hopkins
Slide from Ian Foster’s ssdbm 03 keynote
PDB Content Growth Bases 45,356,382,990
15 minutes
10 hours ($1000)
7 disks = $5000 (SCSI)
100 Watts
5.6 Kg
Inside machine
2 months
14 months ($1 million)
6800 Disks + 490 units + 32 racks = $7 million
100 Kilowatts
33 Tonnes
60 m2 May 2003 Approximately Correct
See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
Combining approaches Combining skills Sharing resources
Data Access & Integration a Ubiquitous Requirement
Scale, heterogeneity, distribution, dynamic variation
Unpredictable (autonomous) development of both
Global Production of Published Data Volume↑ Diversity↑ Combination ⇒ Analysis ⇒ Discovery
Data Huggers Meagre metadata Ease of Use Optimised integration Dependability
Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data & Computation
Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility
Data Intensive Users Data Intensive Applications for Science X Simulation, Analysis & Integration Technology for Science X
OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Distributed Generic Virtual Data Access and Integration Layer
Structured Data Integration Structured Data Access
Structured Data
Relational XML Semi-structured
Registry Job Submission Data Transport Resource Usage Banking Brokering Workflow Authorisation
30% of Applic’n Requir’s
Virtual Integration Architecture
Copies of data in multiple locations
Composition of multiple sources
DataDescription, DataAccess, DataFactory,
DataManagement
E.g. DAIS
OGSA, Query languages, Java, data transport
controlled exposure of heterogeneous data resources via an OGSI-compliant grid access to these resource via common interfaces using existing underlying query mechanisms (ultimately) data integration across distributed data resources
Reference implementation of GGF DAIS WG standard Balance standard tracking & testing With stability for application and product developers
Registry for sources
responds with Factory handle
access to database
handle of GDS to client
with XPath, SQL, etc
client as XML
SOAP/HTTP service creation API interactions
Registry Factory
GridDataService to manage access Grid Data Service Client XML / Relational database
1 3 Data Set 2
R E Q U E S T O R S T U B
C L I E N T A P I Data Set Data Set
dr
C O N S U M E R S T U B
C L I E N T A P I Data Set 4
Established
OGSA-DAI: 1183 downloads
461 R3 & R3.0.2 >379 in UK
50 downloads of R3.0.0 of R3.0.2 within a week Recent performance analysis ⇒ R3.0.3 Nov 03 DQP prototype: 77 downloads
Since 1st September 2003
471 registered users
Cumulative Downloads By Time
200 400 600 800 1000 1200 1400 15/01/2003 15/02/2003 15/03/2003 15/04/2003 15/05/2003 15/06/2003 15/07/2003 15/08/2003 15/09/2003 15/10/2003 Date Number of Downloads R3.0.2 R3 R2.5 R2 R1.5 R1 Courses
Downloads By Country - Release 3
128 78 83 79 30 United Kingdom United States China Japan Germany Unknown Austria Korea, Republic of Brazil India Canada Hong Kong Hungary Sweden Australia Switzerland Italy Taiwan France Poland Netherlands Romania Russian Federation Singapore Ireland
Performance & monitoring Additional DBMS’s supported Additional SQL supported DBMS management operations
archive, restore, bulk load
File access Client libraries Installation wizard User support, courses, training material, performance report
Compliance with DAIS standards proposal Distributed Relational Query Processing Improved dependability and security integration Extended & integrated XML and relational facilities Distributed transaction participation Coordinated OGSA-DAI contributor community
Integrated with GT3 New facilities depend on user priorities, context and research OGSA-DAI components from contributor community
Maintainable release for the user community
GDTS2 GDS3 GDS2 GDTS1 Sx Sy
sources of data about “x” & “y”
responds with Factory handle
integration from resources Sx and Sy
GridDataServices network
returns handle of GDS to client
scripts each has a set of queries to GDS with XPath, SQL, etc
analyst as formatted binary described in a standard XML notation SOAP/HTTP service creation API interactions
Data Registry Data Access & Integration master Client
Analyst
XML database Relational database GDS GDS GDS GDTS GDTS
tells analyst GDS1
“scientific” Application coding scientific insights Problem Solving Environment Semantic Meta data
New Science, Engineering , Medicine, Planning, …
Many opportunities for International collaboration