Qserv: parallel, distributed SQL query service for the LSST sky - - PowerPoint PPT Presentation

qserv parallel distributed sql query service for the lsst
SMART_READER_LITE
LIVE PREVIEW

Qserv: parallel, distributed SQL query service for the LSST sky - - PowerPoint PPT Presentation

Qserv: parallel, distributed SQL query service for the LSST sky catalog Daniel L. Wang SLAC National Accelerator Laboratory, Menlo Park, CA, USA 27 January 2015 D.L.Wang (SLAC) 27 January 2015 1 / 27 Outline Background 1 Qserv Overview 2


slide-1
SLIDE 1

Qserv: parallel, distributed SQL query service for the LSST sky catalog

Daniel L. Wang

SLAC National Accelerator Laboratory, Menlo Park, CA, USA

27 January 2015

D.L.Wang (SLAC) 27 January 2015 1 / 27

slide-2
SLIDE 2

Outline

1

Background

2

Qserv Overview

3

The guts

4

Other notes

D.L.Wang (SLAC) 27 January 2015 2 / 27

slide-3
SLIDE 3

What is LSST?

LSST = Large Synoptic Survey Telescope (under construction)

LSST DM Database team

SLAC: Jacek Becla, John Gates, Andrew Hanushevsky, Kian-Tat Lim, Fritz Mueller, Andrei Salnikov, Daniel Wang IPAC/Caltech: Serge Monkewitz

Other contributors

IN2P3: Fabrice Jammes Texas A&M: Vaikunth Thukral More coming soon...

Astronomy objectives

Nature of Dark Energy Solar System Optical Transients Galactic structure

Image credit: LSST Corp / NOAO D.L.Wang (SLAC) 27 January 2015 3 / 27

slide-4
SLIDE 4

Purpose: A baseline architecture

a catalog of celestial bodies for astronomers

A workable solution Scale in CPU and data Affordable Reusable by science No off-the-shelf solutions priced for science (on the horizon)

D.L.Wang (SLAC) 27 January 2015 4 / 27

slide-5
SLIDE 5

Problem: volume and complexity

Why is this hard?

LSST’s final data release

Table name # rows row size footprint Object 38 × 109 1.7kB 64TB Object Extra 38 × 109 26kB 1PB Source (detections) 6.3 × 1012 0.56kB 3.5PB ForcedSource (expected det.) 38 × 1012 40B 1.5PB

Requirements

Lowest latency (< 10 seconds) for object (or small area) retrieval “Fast” efficient table scans (< 1 hour) Efficient spatial joins* (<24 hours) High concurrency * Brute force: Object⋊

⋉Object → O(1.4 × 1021) [1.4 zetta!] Object⋊ ⋉ForcedSource→O(1.4 × 1024) [1.4 yotta!]

D.L.Wang (SLAC) 27 January 2015 5 / 27

slide-6
SLIDE 6

Block diagram

supervisor supervisor

mysql client czar mysqlproxy mysqld daemon executor xrootd client (ssi) Lua script XRootD mgr xrootd cmsd worker mysqld ssi/ofs xrootd cmsd pathpublish service Central state system zookeeper

D.L.Wang (SLAC) 27 January 2015 6 / 27

slide-7
SLIDE 7

Czar processing flow

D.L.Wang (SLAC) 27 January 2015 7 / 27

slide-8
SLIDE 8

Worker processing flow

D.L.Wang (SLAC) 27 January 2015 8 / 27

slide-9
SLIDE 9

Partitioning

The secret sauce

Idea: break up queries into independently-processed pieces But, keep small queries small: optimize small area-selections → spatially-contiguous tiles, “chunks” of the sky

Drivers

Region queries

Small area → interactive, avoid involving many workers Large area → spread load over many workers

Near neighbor queries

SELECT * FROM Object o1, Object o2 WHERE scisql angSep(o1.ra, o1.decl, o2.ra, o2.decl) < R; Avoid all-to-all comms, quadratic scaling.

D.L.Wang (SLAC) 27 January 2015 9 / 27

slide-10
SLIDE 10

XrootD as RPC

Idea

XrootD already handles node join/leave, auto-failover among replicas. Workers publish owned spatial regions as XrootD “files”

Implementation

Encode data dependency as path: / < group > / < chunkId > Workers scan their storage upon startup, publish using cmsd plugin Czar dispatches using experimental XrdSsi API.

D.L.Wang (SLAC) 27 January 2015 10 / 27

slide-11
SLIDE 11

What is XrdSsi?

A new, fully-asynchronous RPC-ish API to XrootD (instead of file access).

  • pen

write read close new resource send request receive result release resource

Assumptions that would break a filesystem

Open a non-existent file for writing should always fail.

(invalid resource)

Writes to the same file can happen among multiple replicas without any consistency checking.

(different queries on one resource)

It’s okay if a file takes a day or week to read after opening.

(slow queries)

If I write to a file on a server, it is not okay to read the same path from another server (holding a replica).

(results only on “write” server)

Client might have 1M files open. This is totally fine and within the expected load.

D.L.Wang (SLAC) 27 January 2015 11 / 27

slide-12
SLIDE 12

XrdSsi API abstractions

XrdSsiService : service instance (RPC server), provides resources XrdSsiService::Resource : path in a service (function name), receives

requests

XrdSsiRequest : request buffer and response stream (arguments, return

value), mediate input/output transfer

Similar abstractions on both client and server: → Considering short-circuit XrdSsi for single-process app

Client-side

XrdSsiService::Resource : request to a service XrdSsiRequest : request payload and result callback

Server-side

XrdSsiService : answer resource requests over the wire XrdSsiSession : request processor for requests, callbacks XrdSsiResponder : transport abstraction

D.L.Wang (SLAC) 27 January 2015 12 / 27

slide-13
SLIDE 13

XrdSsi Client-Server flow

D.L.Wang (SLAC) 27 January 2015 13 / 27

slide-14
SLIDE 14

Why use XrdSsi?

Compared to XrootD files

Working around file-based assumptions was error-prone and abstractions were often leaky. Difficult to predict when file-based assumption would cause problems, usually trial-error with help of XrootD dev.

Compared to other RPC

Most RPC mechanisms don’t expect up to 1M function names. RPC interfaces buffer parameters and results. RPC interfaces provide type marshalling/encoding.

D.L.Wang (SLAC) 27 January 2015 14 / 27

slide-15
SLIDE 15

Status

Overall: Working prototype (→ 300 nodes) LSST: under construction. First light 2019 Query execution: ok (stability, scalability, throughput, latency) SQL support: limited (no subqueries, no BLOB/BIT) Testability: Single-node installation, data loading

D.L.Wang (SLAC) 27 January 2015 15 / 27

slide-16
SLIDE 16

Lessons

Need to parse? Use a parser Need to manipulate? Use an IR Need scalability? You need:

reliability/fault-recovery parallelism, distribution data: sequential access asynchronous operations

Working with astronomers, understand spherical geometry

D.L.Wang (SLAC) 27 January 2015 16 / 27

slide-17
SLIDE 17

Thank you!

Questions?

Code: https://github.com/LSST/qserv/ Mailing list, wiki online Ask me: danielw@slac.stanford.edu Looking for new users, devs (we are hiring: http://ls.st/dr3 ).

D.L.Wang (SLAC) 27 January 2015 17 / 27

slide-18
SLIDE 18

Partitioning: Sizing considerations

Full-scan dispatch overhead: 10k is alright (less chunks = faster) Self-join: subchunk size < 5k rows (more subchunks = faster) load balancing: “many” chunks per node (more chunks = better balance) subchunk size: edge length ≫ overlap distance (larger = bigger relative to overlap → less wasted space Memory footprint: 2× [sizeof(Object X)+sizeof(Source X)] < sizeof(DRAM per spindle) ..so chunk y can be read in while chunk x is being operated on.

See Shared Scans

D.L.Wang (SLAC) 27 January 2015 18 / 27

slide-19
SLIDE 19

HOWTO: Execute a query

echo "SELECT * FROM Object WHERE areaspec circle(3,4,0.02) AND rFlux < 0.03 LIMIT 2;" | mysql --port=4040 --protocol=TCP LSST Proxy: conn.submitQuery(queryStr) Daemon: Admit query for processing, return result handle Executor:

Parse/analyze/generate chunk queries, dispatch queries Accumulate results in mysqld as they are ready Perform aggregation (as appropriate) Signal result ready

Proxy: Fetch results from mysqld and return to user

D.L.Wang (SLAC) 27 January 2015 19 / 27

slide-20
SLIDE 20

Details: Analyze and prep

SELECT * FROM Object WHERE areaspec circle(3,4,0.02) AND rFlux < 0.03 LIMIT 2;

SQL query string

  • 1. Parse SQL

Parsed query structure

  • 2. Logical modules

1 2 ... n

  • 4. Split query
  • 3. Physical modules

1 2 ... n

  • 5. Perform partitioned

table substitution Parallel/merge query structures Partitioned query strings

Parse and build IR. Infer db from context(LSST); Lookup LSST.Object Compute chunk coverage from circle(3,4,0.02) Replace restrictor call IR with IR for scisql ptInSphCircle(ra PS,decl PS,3,4,0.02)=1 Replace Object with LSST.Object CC in IR Generate aggregation: SELECT * FROM result 14121 LIMIT 2 Generate query template For each chunk# in coverage, substitute and dispatch

D.L.Wang (SLAC) 27 January 2015 20 / 27

slide-21
SLIDE 21

Details: Dispatch

For each chunk-query to be dispatched: Generate dispatch msg Open XrdSsi request for /q/<db>/<chunk#> (CB) Write dispatch msg (CB) Read response (CB) Accumulate result rows

XrdSsi request/response

Request path from XRootD mgr mgr redirect to server with /q/<db>/<chunk#> Request path from server (redirect?) Channel open w/ server Trivia: XRootD client multiplexes logical connections on single TCP connection

D.L.Wang (SLAC) 27 January 2015 21 / 27

slide-22
SLIDE 22

Details: XRootD name abstraction

Stateless, cached → fast restarts Automagically form connection topo for lookups Enables data-addressed dispatch Chunk query connection monitoring

D.L.Wang (SLAC) 27 January 2015 22 / 27

slide-23
SLIDE 23

HOWTO: Table scan

SELECT * FROM Object WHERE rFlux < 0.03 LIMIT 2; No area restriction, Object=partitioned, mark as scan LSST.Object Set chunk coverage=everything Repeat dispatch for all chunks in db: chunk query: SELECT * FROM LSST.Object <chunk#> WHERE rFlux < 0.03 LIMIT 2; Accumulate rows Merge query: SELECT * FROM merge 34235 LIMIT 2

D.L.Wang (SLAC) 27 January 2015 23 / 27

slide-24
SLIDE 24

HOWTO: Area query

SELECT * FROM Object WHERE areaspec poly(2,2,3,5,4,10,4,0,3,1); Detect polygon with vertices (2,2),(3,5),(4,10),(4,0),(3,1), Object=partitioned, if area beyond threshold, mark as scan Infer db=LSST, lookup partitioning for LSST.Object, lookup partitioning columns Compute chunk coverage using geom lib Repeat dispatch for covered chunks: chunk query: SELECT * FROM LSST.Object <chunk#> WHERE

scisql ptInConvexPoly(ra PS,decl PS,2,2,3,5,4,10,4,0,3,1)=1;

Accumulate rows Return table as-is, no merging necessary

D.L.Wang (SLAC) 27 January 2015 24 / 27

slide-25
SLIDE 25

HOWTO: point query

SELECT * FROM Object WHERE objectId IN (2112525,123125); Infer db=LSST, lookup partitioning for LSST.Object, lookup partitioning columns Check IN predicate; objectId=keyColumn for LSST.Object Lookup chunk# for objectId IN (2112525) (123125) Dispatch for two chunks of interest. chunk query: SELECT * FROM LSST.Object <chunk#> WHERE objectId

IN (2112525,123125)

Accumulate rows Return table as-is, no merging necessary

D.L.Wang (SLAC) 27 January 2015 25 / 27

slide-26
SLIDE 26

HOWTO: near-neighbor

SELECT * FROM Object o1, Object o2 WHERE o1.objectId <>

  • 2.objectId AND scisql angSep(o1.ra PS,o1.decl PS,
  • 2.ra PS,o2.decl PS) < 0.000001;

Infer db=LSST, lookup partitioning for LSST.Object No spatial restriction: full-sky, mark as scan Dispatch for full-sky, generate subchunk lists for all chunks Subchunk creation(per subchunk): CREATE TABLE

Object <chunk#> <subchunk#> SELECT * FROM LSST.Object <chunk#> WHERE subChunkId=<subchunk#>;

chunk query sequence:

SELECT * FROM Object <chunk#> <subchunk#> o1, Object <chunk#> <subchunk#> o2 WHERE ... SELECT * FROM Object <chunk#> <subchunk#> o1, ObjectFullOverlap <chunk#> <subchunk#> o2 WHERE ... ... (each subchunk)

Accumulate rows (wait a long time for N*k join) Return table as-is, no merging necessary

D.L.Wang (SLAC) 27 January 2015 26 / 27

slide-27
SLIDE 27

HOWTO: reference match

SELECT * FROM Object o, RefMatch m, RefObject r WHERE

  • .objectId = m.objectId AND m.refObjectId = r.refObjectId;

Infer db=LSST, lookup partitioning for LSST.Object No spatial restriction: full-sky, mark as scan Dispatch for full-sky, generate subchunk lists for all chunks Create subchunk tables as in near-neighbor case. per-subchunk query sequence:

SELECT * FROM Object <chunk#> <subchunk#> o, RefMatch <chunk#> <subchunk#> m, RefObject <chunk#> <subchunk#> r WHERE o.objectId = m.objectId AND m.refObjectId = r.refObjectId; SELECT * FROM Object <chunk#> <subchunk#> o, RefMatch <chunk#> <subchunk#> m, RefObjectFullOverlap <chunk#> <subchunk#> r WHERE

  • .objectId = m.objectId AND m.refObjectId =

r.refObjectId;

Accumulate rows Return table as-is, no merging necessary

D.L.Wang (SLAC) 27 January 2015 27 / 27