scientific data management and analysis data acquired fr
play

Scientific data management and analysis Data acquired fr from the - PowerPoint PPT Presentation

Scientific data management and analysis Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST) Amin Mesmoudi 1 Context CNRS-Mastodons 2012- LSST Project 2 LSST needs in storage and data access (1/2) Storage:


  1. Scientific data management and analysis – Data acquired fr from the Larg rge Synoptic Survey Telescope (L (LSST) Amin Mesmoudi 1

  2. Context • CNRS-Mastodons 2012- • LSST Project 2

  3. LSST needs in storage and data access (1/2) • Storage: Table Size #Records #Attributes Object 109 TB 38 B 470 Moving Object 5 GB 6 M 100 Source 3.6 PB 5 T 125 Forced Source 1.1 PB 32 T 7 Difference 71 TB 200 B 65 Image Source CCD Exposure 0.6 TB 17 B 45 1 trillion=10 18 3

  4. LSST needs in storage and data access (2/2) • Access • Declarative queries (SQL) SELECT objectId, taiMidPoint, fluxToAbMag(psfMag) FROM Source JOIN IN Object USING(objectId) JOIN Filter USING(filterId) WHERE areaSpec_box(:raMin, :declMin, :raMax, :declMax) AND filterName = 'u' AND variability BETWEEN :varMin AND :varMax ORDER R BY objectId, taiMidPoint ASC • With possibility to define new adhoc functions (UDF) • Example: areaspec_box, angDist < dist • 500,000 queries per day 4

  5. Project Objectives • Propose a distributed architecture able to store +100 PB of data • Open Source • Shared-Nothing • Have the possibility to evaluate both simple queries (a few seconds of computation) and complex queries (days of computation) • Ability to access objects using indexes or by a full scan on large tables (>> 1 PB) • Study the capacity of existing systems to meet the needs of the LSST project • Benchmark 5

  6. Outline • Context • MapReduce • SQL on MapReduce • Benchmark • Data sets • Queries • Experiments • Summary 6

  7. MapReduce (1/2) • “ MapReduce is a simplified parallel data processing approach for execution on a computer cluster” [Dean and Ghemawat 2004]. • Programmer specifies only two functions: map (in_key, in_value) -> list(out_key, intermediate_value) • Process <key,value> as input • Produce a set of <key2,value> as intermediate results reduce (out_key, list(intermediate_value)) -> list(out_value) • Combine intermediate values with a unique key • Produce a set of values as output 7

  8. MapReduce (2/2) SourceID ObjectID RA DECL ExposureID 1 1 10 -3 1 5 2 15 -4 1 9 2 20 -5 5 13 4 25 -7 1 2 1 40 -7 2 6 2 45 -8 2 10 3 50 -9 2 14 4 55 -13 3 3 1 60 -10 3 7 2 65 -11 3 ObjectID freq 11 3 70 -12 3 1 4 15 3 75 -15 4 2 5 4 1 80 -12 4 3 4 8 2 85 -13 4 4 3 12 3 90 -14 5 16 4 95 -20 6 8

  9. ObjectID freq ObjectID freq ObjectID freq ObjectID freq SourceID ObjectID RA DECL ExposureID Combiner Partitioner Mapper 1 1 Node 01 1 1 1 1 1 1 10 -3 1 Reducer 2 1 2 2 1 1 5 2 15 -4 1 1 4 2 1 4 1 1 1 9 2 20 -5 5 4 1 1 1 13 4 25 -7 1 SourceID ObjectID RA DECL ExposureID Combiner Partitioner Node 02 1 1 1 1 Mapper 2 1 40 -7 2 2 2 2 1 2 1 Reducer Shuffle and Sort 6 2 45 -8 2 2 1 3 1 3 1 2 5 10 3 50 -9 2 2 1 4 1 4 1 14 4 55 -13 3 2 1 SourceID ObjectID RA DECL ExposureID Combiner Node 03 Partitioner 1 1 Mapper 1 1 3 1 60 -10 3 2 1 2 1 3 1 7 2 65 -11 3 3 1 3 2 3 2 11 3 70 -12 3 Reducer 3 1 3 1 15 3 75 -15 4 3 4 4 1 4 3 4 1 Combiner SourceID ObjectID RA DECL ExposureID 1 1 1 1 Partitioner Mapper Node 04 4 1 4 1 80 -12 4 2 1 2 1 ObjectID freq 3 1 8 2 85 -13 4 3 1 1 4 4 1 12 3 90 -14 5 4 1 2 5 16 4 95 -20 6 3 4 9 4 3

  10. SQL-On-MapReduce MapReduce Jobs executor Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Data loader Query processing Query DFS 10

  11. SQL-On-MapReduce SQL Completeness (L, M, H suggests range of support) Packaged MapReduce Operations Analytic UDFs/Custom SQL-on-Hadoop Jobs SQL DDL/DML functions functions Technology executor Hive M M L H Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Data HadoopDB/Hadapt M M L L loader M L - L Drill Impala M L - L Query Presto processing M - L - Query DFS M M - H Spark/Shark 11

  12. Benchmark – Why ? http://spark.apache.org/ http://blog.cloudera.com/blog/2014/01/impala-performance- dbms-class-speed/ 12

  13. Benchmark – What? • A set of resources and methods (situations) allowing to compare efficiency of systems and approaches offering the same functionality • Resources : Data sets, Queries, Parameters, Machines, … • Methods: indexed data vs. non-indexed, Smart partitioning vs. Simple partitioning, selective queries vs. non selective queries, increasing the number of machines, compressed data , … 13

  14. Data sets DataSet Size PT 1.1 90 GB PT 1.2 PT 1.2 250 GB 250 GB Winter13 3 TB SDSS 3 TB 14

  15. Data sets Table #attributes #records #size DataSet Size Source Source 107 107 162 (m) 162 (m) 118 GB 118 GB PT 1.1 90 GB Object Object 227 227 4 (m) 4 (m) 7 GB 7 GB PT 1.2 PT 1.2 250 GB 250 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp 6 41 (m) 16 GB Winter13 3 TB osure_Metadata SDSS 3 TB 15

  16. Data sets Table #attributes #records #size DataSet Size Source Source 107 107 162 (m) 162 (m) 118 GB 118 GB PT 1.1 90 GB Object Object 227 227 4 (m) 4 (m) 7 GB 7 GB PT 1.2 PT 1.2 250 GB 250 GB RefSrcMatch 10 189 (m) 1.7 GB Science_Ccd_Exp 6 41 (m) 16 GB Winter13 3 TB osure_Metadata SDSS 3 TB DataSet Source: Source: Object: Object: DataSet SourceID ObjectID DECL RA scienceCcd #records Size (GB) #records Size (GB) ExposureId PT 1.2 250 GB 325 (m) 236 9 (m) 14 D250GB 325 (m) 9 (m) 325 (m) 162 (m) 84 (k) + PT 1.2 500 GB 650 (m) 472 18 (m) 28 D500GB 650 (m) 18 (m) 650 (m) 162 (m) 84 (k) D1TB 1.3 (b) 36 (m) 1,1 (b) 162 (m) 84 (k) PT 1.2 1 TB 1.3 (b) 944 36 (m) 56 D2TB 2.6 (b) 72 (m) 2,3 (b) 162 (m) 84 (k) PT 1.2 2 TB 2.6 (b) 1888 72 (m) 112 16

  17. Queries(1/2) id SQL syntaxe Q1 select * from source where sourceid=29785473054213321; Selection Q2 select sourceid, ra,decl from source where objectid=402386896042823; Q3 select sourceid, objectid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q4 select sourceid, ra,decl from source where scienceccdexposureid=454490250461; Group By Q5 select objectid,count(sourceid) from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 group by objectid; Q6 select objectid,count(sourceid) from source group by objectid; Q7 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2; Q8 select * from source join object on (source.objectid=object.objectid) where ra > 359.959 and ra < 359.96; Join Q9 SELECT s.psfFlux, s.psfFluxSigma, sce.exposureType FROM Source s JOIN RefSrcMatch rsm ON (s.sourceId = rsm.sourceId) JOIN Science_Ccd_Exposure_Metadata sce ON (s.scienceCcdExposureId = sce.scienceCcdExposureId) WHERE s.ra > 359.959 and s.ra < 359.96 and s.decl < 2.05 and s.decl > 2 and s.filterId = 2 and rsm.refObjectId is not NULL; 17

  18. Queries (1/2) id SQL syntaxe Order by Q10 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 and decl < 2.05 and decl > 2 order by objectid; Q11 select objectid,sourceid from source where ra > 359.959 and ra < 359.96 order by objectid; Q12 select id FROM rundeepforcedsource where areaspec_box(coord_ra,coord_decl,-55,-2,55,2)=1; UDF Q13 select fluxToAbMag(flux_naive) from rundeepforcedsource where objectid=1398583353936135; 18

  19. Predicates selectivity DataSet SourceID ObjectID 2<DECL<2.05 359.959 <ra 359.959 <ra < 359.96 scienceCcdExposureId =id =id < 359.96 and 2 <decl < 2.05 D250GB 1 43 1,6 (m) 14 (k) 21 3.6 (k) D500GB 1 43 3,3 (m) 28 (k) 43 7.3 (k) D1TB 1 43 6,6 (m) 57 (k) 86 14.6 (k) D2TB 1 43 13,2 (m) 127 (k) 172 29,2 (k) 19

  20. Experiments’ environment • Material: • 10 machines DELL C6220 • 2 clusters of 25 and 50 virtual machines (8 GB RAM, 2 cores and 300 GB of disk space) Available resources • PT 1.2: 250 GB, 500 GB, 1 TB et 2 TB RAM 1 TB Disk space 52 TB • Systems: Hive and HadoopDB #Processors 240 20

  21. Hive and HadoopDB HIVE* (modified) HIVE Map/reduce Map/reduce Map/reduce Map/reduce Map/reduce Map/reduce RDBMS RDBMS HDFS HDFS Hive HadoopDB Store result of whole job Job 4 Job 1 Job 6 Job 3 Job 2 Job 5 21

  22. Parameters to be measured • Tuning: Data loading time (indexing, partitioning, ...) • Performance: Total query execution time • Fault tolerance: Number of faults handled by the tool • Latency: time needed to have the first response • Situations • Scalability (data volume): • 250 GB  500 GB  1 TB  2 TB • Hardware Evolution: • 25 machines  50 machines • Indexed data Vs. non-indexed data • Different partitioning schemas 22

  23. Lessons learned (1/8) 2000 1800 1600 1400 minutes 1200 1000 800 600 400 200 0 250 Go 500 Go 1 To 250 Go 500 Go 1 To 25 machine 50 machines • Hive vs HadoopDB Hive (HDFS) global hash local hash tuning • HadoopDB offers a Custom Partitioning • More data => more time to change data partitioning schema (scalability) • More hardware (machines) => less time to partition the data (hardware speed up) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend