Alpenhorn: Managing Data Products for the Canadian Hydrogen - - PowerPoint PPT Presentation

alpenhorn managing data products for the canadian
SMART_READER_LITE
LIVE PREVIEW

Alpenhorn: Managing Data Products for the Canadian Hydrogen - - PowerPoint PPT Presentation

Alpenhorn: Managing Data Products for the Canadian Hydrogen Intensity Mapping Experiment Davor Cubranic University of British Columbia CHIME: Canadian Hydrogen Intensity Mapping Experiment Novel Canadian radio telescope Designed as a


slide-1
SLIDE 1

Alpenhorn: Managing Data Products for the Canadian Hydrogen Intensity Mapping Experiment

Davor Cubranic University of British Columbia

slide-2
SLIDE 2

CHIME: Canadian Hydrogen Intensity Mapping Experiment

Novel Canadian radio telescope Designed as a cosmology experiment: map redshifted hydrogen gas as a measure of dark energy Large field of view, bandwidth, and processing power enable additional experiments:

  • Pulsar timing survey
  • Fast radio burst search
slide-3
SLIDE 3

Participating Institutions

  • NRC — Dominion Radio Astrophysical Observatory, Kaleden, BC
  • University of British Columbia
  • Perimeter Institute, Waterloo
  • University of Toronto
  • Canadian Institute for Theoretical Astrophysics, Toronto
  • McGill University
  • National Radio Astronomy Observatory, Charlottesville, Va.
  • West Virginia University, Morgantown, WV
slide-4
SLIDE 4

CHIME @ DRAO

slide-5
SLIDE 5

chime

slide-6
SLIDE 6

4x256 dual-polarization antennas Analog signal: amplification & filtering FPGA: digitization & FFT GPU: cross-antenna signal correlation Project-specific downstream processing

Pipeline

F-ENGINE F-ENGINE X-ENGINE X-ENGINE COSMOLOGY PULSAR FRB

slide-7
SLIDE 7

FPGA: 6.5 Tb/s output GPU: 256 x 25.6 Gb/s input Cosmology: 2-3 TB/day ≈0.2 Gb/s Pulsar: 256x0.25 Gb/s → ~0.6 Gb/s FRB: 256x0.55 Gb/s → ~0.2 Gb/s

Data Rates

F-ENGINE F-ENGINE X-ENGINE X-ENGINE COSMOLOGY PULSAR FRB

slide-8
SLIDE 8

1Gb/s >>> 100 Mb/s

slide-9
SLIDE 9

Processing Sites

CANADA CANADA

Dominion Radio Astrophysical Observatory SciNet Toronto UBC-V Westgrid Burnaby

slide-10
SLIDE 10

Managing Data Products

Wide range of data files produced daily Move data off the location and to the researchers’ analysis site(s) safely and reliably

  • Replication
  • Data integrity checks

Make things findable:

  • Where are copies of this file located
  • What files have data for X

Keep it simple??

slide-11
SLIDE 11

Alpenhorn

  • Set of tools for data management and replication
  • Developed incrementally by CHIME since ~2013
  • Used for the past five years on the CHIME Pathfinder
  • Recently extended and generalized to accommodate

CHIME FRB and Pulsar projects’ data needs

slide-12
SLIDE 12

System Architecture

Shared State DB Location A - File Server Alpenhorn service

Monitor
 changes Update

Cron

Request
 copying Check for
 copy
 requests

Copy files

User

Location B

Alpenhorn service

Cron User

Location C

Alpenhorn service

Cron User

slide-13
SLIDE 13

Data Model

Storage:

  • Storage node: directory on a host
  • Storage group: group of nodes ≈ location

Data products:

  • Acquisition: uninterrupted collection of data from a single instrument
  • Archive file: acquisition component containing data

Data replicas:

  • Archive file copy: physical instance of an archive file at a specific

location

  • Copy request: action of copying an archive file copy to another

location

slide-14
SLIDE 14

Service

Watches every storage node available on the system for new files matching a registered name pattern

  • If new/moved, add archive file copy +[archive file, acquisition]

to the database

  • If deleted, mark in the database as absent
  • If a lock file is deleted, process the locked file as if new

Periodically:

  • Execute archive file copy requests
  • Check integrity of suspect files
  • Delete unwanted files (iff also not-needed)
slide-15
SLIDE 15

Transfer Jobs

Moving data between two sites is done with regularly- scheduled “sync” jobs Request a copy from one storage node all files not available

  • n the destination storage group
  • The request is executed by the Alpenhorn service that

has both source and destination locally reachable

  • Copy method is configurable (rsync, bbcp, Globus)

In “target” mode, sync copies to a local destination, but deciding what to copy (the “target”) is based on a group that doesn’t have to be local

slide-16
SLIDE 16

Demo

slide-17
SLIDE 17

Transport Disks

How does Alpenhorn help CHIME manage data offload? Hot-swap 4-disk enclosures at DRAO and UBC

  • Enclosure ≈ transport storage group
  • Individual drives ≈ storage node

Cron job at DRAO syncs files to the transport group that are

  • n the local source and not in the remote target group
slide-18
SLIDE 18

The Human Interface

The actual workflow for a transport disk:

  • Site operator at DRAO inserts an

empty hard disk into the enclosure and “alpenhorn mounts” it as part

  • f a storage group
  • Alpenhorn service at DRAO will

automatically use this disk if its group is the destination of a copy request (e.g., issued as part of a cron job’s “sync”)

  • When the disk is full, alpenhorn will

stop copying to it, and the operator runs “alpenhorn unmount”

  • Filled data disk(s) are shipped to

UBC

DRAO DB Alpenhorn

alpenhorn mount Insert disk Use as copy destination alpenhorn unmount

slide-19
SLIDE 19

The Human Interface (2)

At the other end…

  • UBC operator inserts the full data

disks into the enclosure and mounts them as part of the UBC storage group

  • Alpenhorn service at UBC registers

those files as locally available, and copies them to the local destination if any request is outstanding

  • When all files are copied, the UBC
  • perator can “alpenhorn clean” the

transport disk and “alpenhorn unmount” it

  • Cleaned (empty) data disk(s) are

shipped back to DRAO and the process repeats

UBC DB Alpenhorn

alpenhorn mount Insert disk Use as copy source alpenhorn unmount alpenhorn clean

slide-20
SLIDE 20

Demo part 2

slide-21
SLIDE 21

Customizing

Acquisitions and archive files have a type Alpenhorn configuration file specifies the map between pathname patterns and matching type Built-in “generic” types match using the configured patterns, but don’t keep track of any metadata Types are dynamically extensible using user-contributed classes

  • Must provide a few required callbacks and properties
  • Can perform arbitrary processing to extract metadata when

called-back on new archive file events

  • This metadata usually goes into type-owned tables in the DB
slide-22
SLIDE 22

Summary

Alpenhorn is a set of tools for managing an archive of scientific data across multiple sites Automatically:

  • tracks all copies of a single file,
  • handles available disk storage on the destination, and
  • ensures file integrity and sufficient replication

CLI for cron scripts and interactive use Written for the CHIME radio-telescope, but includes a framework for user-provided customization

slide-23
SLIDE 23

github.com/ radiocosmology/ alpenhorn