How to make a petabyte ROOT file: proposal for managing data with - - PowerPoint PPT Presentation

how to make a petabyte root file proposal for managing
SMART_READER_LITE
LIVE PREVIEW

How to make a petabyte ROOT file: proposal for managing data with - - PowerPoint PPT Presentation

How to make a petabyte ROOT file: proposal for managing data with columnar granularity Jim Pivarski Princeton University DIANA October 11, 2017 1 / 12 Motivation: start by stating the obvious ROOTs selective reading is very important


slide-1
SLIDE 1

How to make a petabyte ROOT file: proposal for managing data with columnar granularity

Jim Pivarski

Princeton University – DIANA

October 11, 2017

1 / 12

slide-2
SLIDE 2

Motivation: start by stating the obvious

ROOT’s selective reading is very important for analysis. Datasets have about a thousand branches1, so if you want to plot a quantity from a terabyte dataset with TTree::Draw, you only have to read a few gigabytes from disk.

13116 ATLAS MC, 1717 ATLAS data, 2151 CMS MiniAOD, 675+ CMS NanoAOD, 560 LHCb 2 / 12

slide-3
SLIDE 3

Motivation: start by stating the obvious

ROOT’s selective reading is very important for analysis. Datasets have about a thousand branches1, so if you want to plot a quantity from a terabyte dataset with TTree::Draw, you only have to read a few gigabytes from disk. Same for reading over a network (XRootD). auto file = TFile::Open("root://very.far.away/mydata.root");

13116 ATLAS MC, 1717 ATLAS data, 2151 CMS MiniAOD, 675+ CMS NanoAOD, 560 LHCb 2 / 12

slide-4
SLIDE 4

Motivation: start by stating the obvious

ROOT’s selective reading is very important for analysis. Datasets have about a thousand branches1, so if you want to plot a quantity from a terabyte dataset with TTree::Draw, you only have to read a few gigabytes from disk. Same for reading over a network (XRootD). auto file = TFile::Open("root://very.far.away/mydata.root");

This is GREAT.

13116 ATLAS MC, 1717 ATLAS data, 2151 CMS MiniAOD, 675+ CMS NanoAOD, 560 LHCb 2 / 12

slide-5
SLIDE 5

Conversation with computer scientist

So it sounds like you already have a columnar database.

3 / 12

slide-6
SLIDE 6

Conversation with computer scientist

So it sounds like you already have a columnar database. Not exactly— we still have to manage data as files, rather than columns.

3 / 12

slide-7
SLIDE 7

Conversation with computer scientist

So it sounds like you already have a columnar database. Not exactly— we still have to manage data as files, rather than columns. What? Why? Couldn’t you just use XRootD to manage (move, backup, cache) columns directly? Why does it matter that they’re inside of files?

3 / 12

slide-8
SLIDE 8

Conversation with computer scientist

So it sounds like you already have a columnar database. Not exactly— we still have to manage data as files, rather than columns. What? Why? Couldn’t you just use XRootD to manage (move, backup, cache) columns directly? Why does it matter that they’re inside of files?

  • Because. . . because. . .

3 / 12

slide-9
SLIDE 9

Evidence that it matters: the CMS NanoAOD project

Stated goal: to serve 30–50% of CMS analyses with a single selection of columns. Need to make hard decisions about which columns to keep: reducing more makes data access easier for 50% of analyses while completely excluding the rest.

4 / 12

slide-10
SLIDE 10

Evidence that it matters: the CMS NanoAOD project

Stated goal: to serve 30–50% of CMS analyses with a single selection of columns. Need to make hard decisions about which columns to keep: reducing more makes data access easier for 50% of analyses while completely excluding the rest. If we really had columnar data management, the problem would be moot: we’d just let the most frequently used 1–2 kB of each event migrate to warm storage while the rest cools.

4 / 12

slide-11
SLIDE 11

Evidence that it matters: the CMS NanoAOD project

Stated goal: to serve 30–50% of CMS analyses with a single selection of columns. Need to make hard decisions about which columns to keep: reducing more makes data access easier for 50% of analyses while completely excluding the rest. If we really had columnar data management, the problem would be moot: we’d just let the most frequently used 1–2 kB of each event migrate to warm storage while the rest cools. Instead, we’ll probably put the whole small copy (NanoAOD) in warm storage and the whole large copy (MiniAOD) in colder storage.

4 / 12

slide-12
SLIDE 12

Evidence that it matters: the CMS NanoAOD project

Stated goal: to serve 30–50% of CMS analyses with a single selection of columns. Need to make hard decisions about which columns to keep: reducing more makes data access easier for 50% of analyses while completely excluding the rest. If we really had columnar data management, the problem would be moot: we’d just let the most frequently used 1–2 kB of each event migrate to warm storage while the rest cools. Instead, we’ll probably put the whole small copy (NanoAOD) in warm storage and the whole large copy (MiniAOD) in colder storage. This is artificial.

4 / 12

slide-13
SLIDE 13

Evidence that it matters: the CMS NanoAOD project

Stated goal: to serve 30–50% of CMS analyses with a single selection of columns. Need to make hard decisions about which columns to keep: reducing more makes data access easier for 50% of analyses while completely excluding the rest. If we really had columnar data management, the problem would be moot: we’d just let the most frequently used 1–2 kB of each event migrate to warm storage while the rest cools. Instead, we’ll probably put the whole small copy (NanoAOD) in warm storage and the whole large copy (MiniAOD) in colder storage. This is artificial. There’s a steep popularity distribution across columns, but we cut it abruptly with file schemas (data tiers).

4 / 12

slide-14
SLIDE 14

Except for the simplest TTree structures, we can’t pull individual branches

  • ut of a file and manage them on their own.

5 / 12

slide-15
SLIDE 15

Except for the simplest TTree structures, we can’t pull individual branches

  • ut of a file and manage them on their own.

But you have XRootD!

5 / 12

slide-16
SLIDE 16

Except for the simplest TTree structures, we can’t pull individual branches

  • ut of a file and manage them on their own.

But you have XRootD! Yes, but only ROOT knows how to interpret a branch’s relationship with other branches.

5 / 12

slide-17
SLIDE 17

What would it look like if we could?

CREATE TABLE derived_data AS SELECT pt, eta, phi, deltaphi**2 + deltaeta**2 AS deltaR FROM original_data WHERE deltaR < 0.2; creates a new derived data table from original data, but links, rather than copying, pt, eta, and phi.2

2Implementation dependent, but common. “WHERE” selection may be implemented with a stencil. 6 / 12

slide-18
SLIDE 18

What would it look like if we could?

CREATE TABLE derived_data AS SELECT pt, eta, phi, deltaphi**2 + deltaeta**2 AS deltaR FROM original_data WHERE deltaR < 0.2; creates a new derived data table from original data, but links, rather than copying, pt, eta, and phi.2 If original data is deleted, the database would not delete pt, eta, and phi, as they’re in use by derived data.

2Implementation dependent, but common. “WHERE” selection may be implemented with a stencil. 6 / 12

slide-19
SLIDE 19

What would it look like if we could?

CREATE TABLE derived_data AS SELECT pt, eta, phi, deltaphi**2 + deltaeta**2 AS deltaR FROM original_data WHERE deltaR < 0.2; creates a new derived data table from original data, but links, rather than copying, pt, eta, and phi.2 If original data is deleted, the database would not delete pt, eta, and phi, as they’re in use by derived data. For data management, this is a very flexible system, as columns are a more granular unit for caching and replication.

2Implementation dependent, but common. “WHERE” selection may be implemented with a stencil. 6 / 12

slide-20
SLIDE 20

What would it look like if we could?

CREATE TABLE derived_data AS SELECT pt, eta, phi, deltaphi**2 + deltaeta**2 AS deltaR FROM original_data WHERE deltaR < 0.2; creates a new derived data table from original data, but links, rather than copying, pt, eta, and phi.2 If original data is deleted, the database would not delete pt, eta, and phi, as they’re in use by derived data. For data management, this is a very flexible system, as columns are a more granular unit for caching and replication. For users, there is much less cost to creating derived datasets— many versions of corrections and cuts.

2Implementation dependent, but common. “WHERE” selection may be implemented with a stencil. 6 / 12

slide-21
SLIDE 21

Idea #1. Cast data from ROOT files into a well-known standard for columnar, hierarchical data; manage those columns individually in an object store like Ceph.

7 / 12

slide-22
SLIDE 22

Idea #1. Cast data from ROOT files into a well-known standard for columnar, hierarchical data; manage those columns individually in an object store like Ceph.

  • 1. Apache Arrow is one such standard. It’s similar to ROOT’s splitting format but

permits O(1) random access and splits down to all levels of depth.

7 / 12

slide-23
SLIDE 23

Idea #1. Cast data from ROOT files into a well-known standard for columnar, hierarchical data; manage those columns individually in an object store like Ceph.

  • 1. Apache Arrow is one such standard. It’s similar to ROOT’s splitting format but

permits O(1) random access and splits down to all levels of depth.

  • 2. PLUR or PLURP is my subset of the above with looser rules about how data may

be referenced. Acronym for the minimum data model needed for physics: Primitives, Lists, Unions, Records, and maybe Pointers (beyond Arrow).

7 / 12

slide-24
SLIDE 24

Idea #1. Cast data from ROOT files into a well-known standard for columnar, hierarchical data; manage those columns individually in an object store like Ceph.

  • 1. Apache Arrow is one such standard. It’s similar to ROOT’s splitting format but

permits O(1) random access and splits down to all levels of depth.

  • 2. PLUR or PLURP is my subset of the above with looser rules about how data may

be referenced. Acronym for the minimum data model needed for physics: Primitives, Lists, Unions, Records, and maybe Pointers (beyond Arrow).

(the “standard database” approach)

7 / 12

slide-25
SLIDE 25

Idea #2 (this talk). Keep ROOT data as they are, but put individual TBaskets in the object store. TFile/TTree subclasses fetch data from the object store instead of seeking to file positions.

8 / 12

slide-26
SLIDE 26

Idea #2 (this talk). Keep ROOT data as they are, but put individual TBaskets in the object store. TFile/TTree subclasses fetch data from the object store instead of seeking to file positions.

  • 1. Presents the same TFile/TTree interface to users; old scripts still work.

8 / 12

slide-27
SLIDE 27

Idea #2 (this talk). Keep ROOT data as they are, but put individual TBaskets in the object store. TFile/TTree subclasses fetch data from the object store instead of seeking to file positions.

  • 1. Presents the same TFile/TTree interface to users; old scripts still work.
  • 2. But data replication, storage class, and caching are handled by the object store

with columnar granularity.

8 / 12

slide-28
SLIDE 28

Idea #2 (this talk). Keep ROOT data as they are, but put individual TBaskets in the object store. TFile/TTree subclasses fetch data from the object store instead of seeking to file positions.

  • 1. Presents the same TFile/TTree interface to users; old scripts still work.
  • 2. But data replication, storage class, and caching are handled by the object store

with columnar granularity.

  • 3. Branches are shared transparently across derived datasets: all trees are friends.

8 / 12

slide-29
SLIDE 29

Idea #2 (this talk). Keep ROOT data as they are, but put individual TBaskets in the object store. TFile/TTree subclasses fetch data from the object store instead of seeking to file positions.

  • 1. Presents the same TFile/TTree interface to users; old scripts still work.
  • 2. But data replication, storage class, and caching are handled by the object store

with columnar granularity.

  • 3. Branches are shared transparently across derived datasets: all trees are friends.
  • 4. The logic of sharing, reference counting branches, managing datasets, etc. must

all be implemented in ROOT; only ROOT understands how to combine branches.

8 / 12

slide-30
SLIDE 30

Idea #2 (this talk). Keep ROOT data as they are, but put individual TBaskets in the object store. TFile/TTree subclasses fetch data from the object store instead of seeking to file positions.

  • 1. Presents the same TFile/TTree interface to users; old scripts still work.
  • 2. But data replication, storage class, and caching are handled by the object store

with columnar granularity.

  • 3. Branches are shared transparently across derived datasets: all trees are friends.
  • 4. The logic of sharing, reference counting branches, managing datasets, etc. must

all be implemented in ROOT; only ROOT understands how to combine branches.

(the “ROOT becomes the database” approach)

8 / 12

slide-31
SLIDE 31

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

9 / 12

slide-32
SLIDE 32

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

9 / 12

slide-33
SLIDE 33

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

9 / 12

slide-34
SLIDE 34

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

◮ REST APIs for flexibility; TBaskets fetched by HTTP GET, may be web-cached.

No new ROOT dependencies.

9 / 12

slide-35
SLIDE 35

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

◮ REST APIs for flexibility; TBaskets fetched by HTTP GET, may be web-cached.

No new ROOT dependencies.

◮ Methods for deriving new TTrees from old TTrees:

◮ share common TBranch data by default; 9 / 12

slide-36
SLIDE 36

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

◮ REST APIs for flexibility; TBaskets fetched by HTTP GET, may be web-cached.

No new ROOT dependencies.

◮ Methods for deriving new TTrees from old TTrees:

◮ share common TBranch data by default; ◮ “soft skim” by stencil (event list/event bitmap), “hard skim” only if re-basketization

is needed to compactify results (keeping fewer than ∼10% of original);

9 / 12

slide-37
SLIDE 37

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

◮ REST APIs for flexibility; TBaskets fetched by HTTP GET, may be web-cached.

No new ROOT dependencies.

◮ Methods for deriving new TTrees from old TTrees:

◮ share common TBranch data by default; ◮ “soft skim” by stencil (event list/event bitmap), “hard skim” only if re-basketization

is needed to compactify results (keeping fewer than ∼10% of original);

◮ save all provenance and use git-like versioning to determine if two branches are

related/may be combined (for a join by index position, rather than mutual column).

9 / 12

slide-38
SLIDE 38

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

◮ REST APIs for flexibility; TBaskets fetched by HTTP GET, may be web-cached.

No new ROOT dependencies.

◮ Methods for deriving new TTrees from old TTrees:

◮ share common TBranch data by default; ◮ “soft skim” by stencil (event list/event bitmap), “hard skim” only if re-basketization

is needed to compactify results (keeping fewer than ∼10% of original);

◮ save all provenance and use git-like versioning to determine if two branches are

related/may be combined (for a join by index position, rather than mutual column).

◮ No user-facing partition boundaries: huge dataset appears as one TTree.

9 / 12

slide-39
SLIDE 39

How it could be done

◮ Subclass of TFile initializes itself by getting data from a “controlling” database

(document store like MongoDB might be best).

◮ Reference counts for objects referenced by TKeys (including TBaskets and user

  • bjects like histograms) are maintained by this controlling database.

◮ Bulk data, the contents of TKeys, are in a “warehouse” database (object store—

might be the same database). Optimal basket size may be big, like megabytes.

◮ REST APIs for flexibility; TBaskets fetched by HTTP GET, may be web-cached.

No new ROOT dependencies.

◮ Methods for deriving new TTrees from old TTrees:

◮ share common TBranch data by default; ◮ “soft skim” by stencil (event list/event bitmap), “hard skim” only if re-basketization

is needed to compactify results (keeping fewer than ∼10% of original);

◮ save all provenance and use git-like versioning to determine if two branches are

related/may be combined (for a join by index position, rather than mutual column).

◮ No user-facing partition boundaries: huge dataset appears as one TTree. ◮ Users work in shared TFile: home TDirectories; permissions managed by database.

9 / 12

slide-40
SLIDE 40

Two modes of use

Direct connection

User launches ROOT, does TFile::Open("rootdb://data.cern/cms"), and extracts objects for analysis: Get("home/username/myhist")->Draw().

10 / 12

slide-41
SLIDE 41

Two modes of use

Direct connection

User launches ROOT, does TFile::Open("rootdb://data.cern/cms"), and extracts objects for analysis: Get("home/username/myhist")->Draw().

Job submission

User passes a macro, TTree::Draw request, or TDataFrame to a service that parallelizes it and puts results in the user’s home TDirectory.

10 / 12

slide-42
SLIDE 42

Two modes of use

Direct connection

User launches ROOT, does TFile::Open("rootdb://data.cern/cms"), and extracts objects for analysis: Get("home/username/myhist")->Draw().

Job submission

User passes a macro, TTree::Draw request, or TDataFrame to a service that parallelizes it and puts results in the user’s home TDirectory.

◮ compute nodes use this same interface to communicate with storage;

10 / 12

slide-43
SLIDE 43

Two modes of use

Direct connection

User launches ROOT, does TFile::Open("rootdb://data.cern/cms"), and extracts objects for analysis: Get("home/username/myhist")->Draw().

Job submission

User passes a macro, TTree::Draw request, or TDataFrame to a service that parallelizes it and puts results in the user’s home TDirectory.

◮ compute nodes use this same interface to communicate with storage; ◮ but a scheduler attempts to maximize shared cache locality on the compute nodes.

10 / 12

slide-44
SLIDE 44

Two modes of use

Direct connection

User launches ROOT, does TFile::Open("rootdb://data.cern/cms"), and extracts objects for analysis: Get("home/username/myhist")->Draw().

Job submission

User passes a macro, TTree::Draw request, or TDataFrame to a service that parallelizes it and puts results in the user’s home TDirectory.

◮ compute nodes use this same interface to communicate with storage; ◮ but a scheduler attempts to maximize shared cache locality on the compute nodes.

This is the “query server” idea I’ve been exploring for some time now, except that all of the interface is ROOT.

10 / 12

slide-45
SLIDE 45

auto file = TFile::Open("rootdb://data.cern/cms"); file->Get("home/username")->cd(); file->Get("derived_data")->Draw("x >> hist"); file->Get("hist")->Fit("gaus");

user's laptop compute nodes control db

cache

Get TBasket data, perform calculation, save to "hist" in db. Preferentially send jobs to compute nodes that have the TBaskets in cache...

dispatch warehouse db

HTTP HTTP REST REST Zookeeper 11 / 12

slide-46
SLIDE 46

Questions for you

Question: How would you feel if I developed this kind of service within ROOT (idea #2), rather than outside of ROOT (idea #1)?

12 / 12

slide-47
SLIDE 47

Questions for you

Question: How would you feel if I developed this kind of service within ROOT (idea #2), rather than outside of ROOT (idea #1)? I’d want to sketch it out in Python (my uproot project) to figure out the architecture before committing to the ROOT codebase: ∼year timescale.

12 / 12

slide-48
SLIDE 48

Questions for you

Question: How would you feel if I developed this kind of service within ROOT (idea #2), rather than outside of ROOT (idea #1)? I’d want to sketch it out in Python (my uproot project) to figure out the architecture before committing to the ROOT codebase: ∼year timescale. Question: Deeply nested columnar splitting, zero-copy structure manipulations, and many database indexing techniques are not possible with today’s ROOT serialization.

12 / 12

slide-49
SLIDE 49

Questions for you

Question: How would you feel if I developed this kind of service within ROOT (idea #2), rather than outside of ROOT (idea #1)? I’d want to sketch it out in Python (my uproot project) to figure out the architecture before committing to the ROOT codebase: ∼year timescale. Question: Deeply nested columnar splitting, zero-copy structure manipulations, and many database indexing techniques are not possible with today’s ROOT serialization. Are you interested in forward-incompatible changes to ROOT serialization that would make these things possible? I could propose them as a ROOT 7 serialization format.

12 / 12

slide-50
SLIDE 50

Questions for you

Question: How would you feel if I developed this kind of service within ROOT (idea #2), rather than outside of ROOT (idea #1)? I’d want to sketch it out in Python (my uproot project) to figure out the architecture before committing to the ROOT codebase: ∼year timescale. Question: Deeply nested columnar splitting, zero-copy structure manipulations, and many database indexing techniques are not possible with today’s ROOT serialization. Are you interested in forward-incompatible changes to ROOT serialization that would make these things possible? I could propose them as a ROOT 7 serialization format. (Subject of another talk.)

12 / 12