Scientific data management and analysis Data acquired fr from the - PowerPoint PPT Presentation

Scientific data management and analysis – Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST) Amin Mesmoudi 1

Context • CNRS-Mastodons 2012- • LSST Project 2

LSST needs in storage and data access (1/2) • Storage: Table Size #Records #Attributes Object 109 TB 38 B 470 Moving Object 5 GB 6 M 100 Source 3.6 PB 5 T 125 Forced Source 1.1 PB 32 T 7 Difference 71 TB 200 B 65 Image Source CCD Exposure 0.6 TB 17 B 45 1 trillion=10 18 3

LSST needs in storage and data access (2/2) • Access • Declarative queries (SQL) SELECT objectId, taiMidPoint, fluxToAbMag(psfMag) FROM Source JOIN IN Object USING(objectId) JOIN Filter USING(filterId) WHERE areaSpec_box(:raMin, :declMin, :raMax, :declMax) AND filterName = 'u' AND variability BETWEEN :varMin AND :varMax ORDER R BY objectId, taiMidPoint ASC • With possibility to define new adhoc functions (UDF) • Example: areaspec_box, angDist < dist • 500,000 queries per day 4

Project Objectives • Propose a distributed architecture able to store +100 PB of data • Open Source • Shared-Nothing • Have the possibility to evaluate both simple queries (a few seconds of computation) and complex queries (days of computation) • Ability to access objects using indexes or by a full scan on large tables (>> 1 PB) • Study the capacity of existing systems to meet the needs of the LSST project • Benchmark 5

Outline • Context • MapReduce • SQL on MapReduce • Benchmark • Data sets • Queries • Experiments • Summary 6

MapReduce (1/2) • “ MapReduce is a simplified parallel data processing approach for execution on a computer cluster” [Dean and Ghemawat 2004]. • Programmer specifies only two functions: map (in_key, in_value) -> list(out_key, intermediate_value) • Process <key,value> as input • Produce a set of <key2,value> as intermediate results reduce (out_key, list(intermediate_value)) -> list(out_value) • Combine intermediate values with a unique key • Produce a set of values as output 7

MapReduce (2/2) SourceID ObjectID RA DECL ExposureID 1 1 10 -3 1 5 2 15 -4 1 9 2 20 -5 5 13 4 25 -7 1 2 1 40 -7 2 6 2 45 -8 2 10 3 50 -9 2 14 4 55 -13 3 3 1 60 -10 3 7 2 65 -11 3 ObjectID freq 11 3 70 -12 3 1 4 15 3 75 -15 4 2 5 4 1 80 -12 4 3 4 8 2 85 -13 4 4 3 12 3 90 -14 5 16 4 95 -20 6 8

ObjectID freq ObjectID freq ObjectID freq ObjectID freq SourceID ObjectID RA DECL ExposureID Combiner Partitioner Mapper 1 1 Node 01 1 1 1 1 1 1 10 -3 1 Reducer 2 1 2 2 1 1 5 2 15 -4 1 1 4 2 1 4 1 1 1 9 2 20 -5 5 4 1 1 1 13 4 25 -7 1 SourceID ObjectID RA DECL ExposureID Combiner Partitioner Node 02 1 1 1 1 Mapper 2 1 40 -7 2 2 2 2 1 2 1 Reducer Shuffle and Sort 6 2 45 -8 2 2 1 3 1 3 1 2 5 10 3 50 -9 2 2 1 4 1 4 1 14 4 55 -13 3 2 1 SourceID ObjectID RA DECL ExposureID Combiner Node 03 Partitioner 1 1 Mapper 1 1 3 1 60 -10 3 2 1 2 1 3 1 7 2 65 -11 3 3 1 3 2 3 2 11 3 70 -12 3 Reducer 3 1 3 1 15 3 75 -15 4 3 4 4 1 4 3 4 1 Combiner SourceID ObjectID RA DECL ExposureID 1 1 1 1 Partitioner Mapper Node 04 4 1 4 1 80 -12 4 2 1 2 1 ObjectID freq 3 1 8 2 85 -13 4 3 1 1 4 4 1 12 3 90 -14 5 4 1 2 5 16 4 95 -20 6 3 4 9 4 3

SQL-On-MapReduce MapReduce Jobs executor Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Data loader Query processing Query DFS 10

SQL-On-MapReduce SQL Completeness (L, M, H suggests range of support) Packaged MapReduce Operations Analytic UDFs/Custom SQL-on-Hadoop Jobs SQL DDL/DML functions functions Technology executor Hive M M L H Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Data HadoopDB/Hadapt M M L L loader M L - L Drill Impala M L - L Query Presto processing M - L - Query DFS M M - H Spark/Shark 11

Benchmark – Why ? http://spark.apache.org/ http://blog.cloudera.com/blog/2014/01/impala-performance- dbms-class-speed/ 12

Benchmark – What? • A set of resources and methods (situations) allowing to compare efficiency of systems and approaches offering the same functionality • Resources : Data sets, Queries, Parameters, Machines, … • Methods: indexed data vs. non-indexed, Smart partitioning vs. Simple partitioning, selective queries vs. non selective queries, increasing the number of machines, compressed data , … 13

Data sets DataSet Size PT 1.1 90 GB PT 1.2 PT 1.2 250 GB 250 GB Winter13 3 TB SDSS 3 TB 14

Data sets Table #attributes #records #size DataSet Size Source Source 107 107 162 (m) 162 (m) 118 GB 118 GB PT 1.1 90 GB Object Object 227 227 4 (m) 4 (m) 7 GB 7 GB PT 1.2 PT 1.2 250 GB 250 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp 6 41 (m) 16 GB Winter13 3 TB osure_Metadata SDSS 3 TB 15

Data sets Table #attributes #records #size DataSet Size Source Source 107 107 162 (m) 162 (m) 118 GB 118 GB PT 1.1 90 GB Object Object 227 227 4 (m) 4 (m) 7 GB 7 GB PT 1.2 PT 1.2 250 GB 250 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp 6 41 (m) 16 GB Winter13 3 TB osure_Metadata SDSS 3 TB DataSet Source: Source: Object: Object: DataSet SourceID ObjectID DECL RA scienceCcd #records Size (GB) #records Size (GB) ExposureId PT 1.2 250 GB 325 (m) 236 9 (m) 14 D250GB 325 (m) 9 (m) 325 (m) 162 (m) 84 (k) + PT 1.2 500 GB 650 (m) 472 18 (m) 28 D500GB 650 (m) 18 (m) 650 (m) 162 (m) 84 (k) D1TB 1.3 (b) 36 (m) 1,1 (b) 162 (m) 84 (k) PT 1.2 1 TB 1.3 (b) 944 36 (m) 56 D2TB 2.6 (b) 72 (m) 2,3 (b) 162 (m) 84 (k) PT 1.2 2 TB 2.6 (b) 1888 72 (m) 112 16

Queries(1/2) id SQL syntaxe Q1 select * from source where sourceid=29785473054213321; Selection Q2 select sourceid, ra,decl from source where objectid=402386896042823; Q3 select sourceid, objectid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q4 select sourceid, ra,decl from source where scienceccdexposureid=454490250461; Group By Q5 select objectid,count(sourceid) from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 group by objectid; Q6 select objectid,count(sourceid) from source group by objectid; Q7 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q8 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96; Join Q9 SELECT s.psfFlux, s.psfFluxSigma, sce.exposureType FROM Source s JOIN RefSrcMatch rsm ON (s.sourceId = rsm.sourceId) JOIN Science_Ccd_Exposure_Metadata sce ON (s.scienceCcdExposureId = sce.scienceCcdExposureId) WHERE s.ra > 359.959 and s.ra < 359.96 and s.decl < 2.05 and s.decl > 2 and s.filterId = 2 and rsm.refObjectId is not NULL; 17

Queries (1/2) id SQL syntaxe Order by Q10 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 order by objectid; Q11 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 order by objectid; Q12 select id FROM rundeepforcedsource where areaspec_box(coord_ra,coord_decl,-55,-2,55,2)=1; UDF Q13 select fluxToAbMag(flux_naive) from rundeepforcedsource where objectid=1398583353936135; 18

Predicates selectivity DataSet SourceID ObjectID 2<DECL<2.05 359.959 <ra 359.959 <ra < 359.96 scienceCcdExposureId =id =id < 359.96 and 2 <decl < 2.05 D250GB 1 43 1,6 (m) 14 (k) 21 3.6 (k) D500GB 1 43 3,3 (m) 28 (k) 43 7.3 (k) D1TB 1 43 6,6 (m) 57 (k) 86 14.6 (k) D2TB 1 43 13,2 (m) 127 (k) 172 29,2 (k) 19

Experiments’ environment • Material: • 10 machines DELL C6220 • 2 clusters of 25 and 50 virtual machines (8 GB RAM, 2 cores and 300 GB of disk space) Available resources • PT 1.2: 250 GB, 500 GB, 1 TB et 2 TB RAM 1 TB Disk space 52 TB • Systems: Hive and HadoopDB #Processors 240 20

Hive and HadoopDB HIVE* (modified) HIVE Map/reduce Map/reduce Map/reduce Map/reduce Map/reduce Map/reduce RDBMS RDBMS HDFS HDFS Hive HadoopDB Store result of whole job Job 4 Job 1 Job 6 Job 3 Job 2 Job 5 21

Parameters to be measured • Tuning: Data loading time (indexing, partitioning, ...) • Performance: Total query execution time • Fault tolerance: Number of faults handled by the tool • Latency: time needed to have the first response • Situations • Scalability (data volume): • 250 GB  500 GB  1 TB  2 TB • Hardware Evolution: • 25 machines  50 machines • Indexed data Vs. non-indexed data • Different partitioning schemas 22

Lessons learned (1/8) 2000 1800 1600 1400 minutes 1200 1000 800 600 400 200 0 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines • Hive vs HadoopDB Hive (HDFS) global hash local hash tuning • HadoopDB offers a Custom Partitioning • More data => more time to change data partitioning schema (scalability) • More hardware (machines) => less time to partition the data (hardware speed up) 23

Scientific data management and analysis Data acquired fr from the - PowerPoint PPT Presentation

Scientific data management and analysis Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST) Amin Mesmoudi 1 Context CNRS-Mastodons 2012- LSST Project 2 LSST needs in storage and data access (1/2) Storage:

Dublin Business School Nov 2019 History of Dublin Business School Established in 1975 Acquired

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Acquired Acquired Methicillin Methicillin-Res Resis istant tant Staphylococcus

Community-Acquired community-acquired pneumonia as: Pneumonia The Latest A. An ailment that

Community-Acquired Pneumonia acquired pneumonia as: a. An ailment that often leads to

1 LVMH-Herms Timeline 2 LVMH acquired a LVMH announces LVMH acquired LVMH not allowed to

Nosocomial Nosocomial Burden of Hospital-Acquired Pneumonia Burden of Hospital-Acquired

Update in Community- Update in Community- Acquired Pneumonia Acquired Pneumonia Brad Sharpe,

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

Efficient Scientific Data Management on Supercomputers Suren Byna Staff Scientist Scientific

Scientific Programming in mpags-python.github.io Steven Bamford An introduction to scientific

SCIENCE SCIENCE Scientific Question Hypothesis Prediction Experimental Test Scientific

Hospital Acquired Complications Paediatric Risk Adjustment HISA Health Data Analytics Conference

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

The Scientific Data Management The Scientific Data Management Center Center Arie Shoshani (PI)

A formal approach for fostering component reuse and managing software change Abderrahman MOKNI,

Multimedia Indexation Titus ZAHARIA, Pr. Titus.Zaharia@telecom-sudparis.eu Multimedia indexation

Memory optimization strategies for linear Outline mappings and indexation-based shared documents

The 3FM As already discussed, the 3FM has Javier Estrada corporate finance

MEDIGRID ACI-GRID project French ministry of research Medical image processing on grids

A simple pattern-matching algorithm for recovering empty nodes Mark Johnson Brown University

I-SEM CRM Emerging Thinking - Decision 2 Industry Workshop Dundalk, 5 th April 2016 1 Agenda

Splunk implementa-on Our experiences throughout the 3 year journey About us Harvard