Scalable Distributed Lineage Authentication Ashish Gehani Scalable - - PowerPoint PPT Presentation

scalable distributed lineage authentication
SMART_READER_LITE
LIVE PREVIEW

Scalable Distributed Lineage Authentication Ashish Gehani Scalable - - PowerPoint PPT Presentation

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage Authentication p. 1/59 What is data lineage ? Output Operation Input 1 Input n (a) Primitive operation (b) Compound operation tree Scalable


slide-1
SLIDE 1

Scalable Distributed Lineage Authentication

Ashish Gehani

Scalable Distributed Lineage Authentication – p. 1/59

slide-2
SLIDE 2

What is data lineage?

Output Operation Input 1 Input n

(a) Primitive operation (b) Compound operation tree

Scalable Distributed Lineage Authentication – p. 2/59

slide-3
SLIDE 3

Why track lineage?

GIS - Data origins Material science - Component pedigree Biology - Experiment reproducibility Grid - Debugging

Scalable Distributed Lineage Authentication – p. 3/59

slide-4
SLIDE 4

Why certify lineage?

Reproduction costly PDB - $200,000 / protein Fermilab Collision Detector - 1 month, multiple TB / datum Reliability Accreditation Ownership Auditability

Scalable Distributed Lineage Authentication – p. 4/59

slide-5
SLIDE 5

What’s been done?

LFS - Inputs, Outputs, Options → SQL PASS - Runtime environs → Berkeley DB Trio - Tracks data accuracy using lineage CMCS - Chemistry toolkit → WebDAV Chimera - Workflow scripts

myGrid - Biology Grid workflows

V esta - Incremental builds ESSW - Earth Science data management

Scalable Distributed Lineage Authentication – p. 5/59

slide-6
SLIDE 6

What’s the problem?

Single trust domain Chimera, myGrid, V esta, ESSW Centralized service LFS, PASS, Trio, CMCS No assurance Unsigned Incomplete

Scalable Distributed Lineage Authentication – p. 6/59

slide-7
SLIDE 7

What granularity?

What to audit? Processes, System calls, File system? Fine grain → High overhead Coarse grain → False positives File system: Pro - Intermediate complexity Pro - Captures most persistent change Con - Can’t track data from: Network, Keyboard, Pipes, Memory maps

Scalable Distributed Lineage Authentication – p. 7/59

slide-8
SLIDE 8

Certification approach

? Consumer Producer Output Input Input = Output

No global TCB Require commitments Check agreement of: Producer output Consumer input Trusted user in subtree / path → Tampering detectable

Scalable Distributed Lineage Authentication – p. 8/59

slide-9
SLIDE 9

Metadata generation

Maintain process table entries for:

accessed, modified files

Process File 1 Read File 2 Read close()

  • pen()
  • pen()

close() File 3 Write Process execution Time close()

  • pen()

File 3 File 1 File 2 Owner

Scalable Distributed Lineage Authentication – p. 9/59

slide-10
SLIDE 10

Minimal representation

Net Address Inode Time 1 Signature Output Input n Executor Input

Executor: 32 bit IPv4 address, 32 bit user ID Signature: 160 bits [ SIGNKE(E, O, I1, . . . , In)] Input / Output File: 32 bit IPv4 address 32 bit inode 32 bit time (Seconds since 1/1/70)

Scalable Distributed Lineage Authentication – p. 10/59

slide-11
SLIDE 11

Workload

Berkeley NOW file system traces Month of activity Access patterns stable Instruction - 20 workstations in teaching lab Research - 13 desktops of research group Web - 1 web server running Postgres Windows - 8 Windows desktops

Scalable Distributed Lineage Authentication – p. 11/59

slide-12
SLIDE 12

Cumulative lineage

Current paradigm Entire tree migrates with data Metadata grows rapidly:

Steps 1 2 3 4 5 Workload Instruction 0.4 KB 3 KB 31 KB 253 KB 2 MB Research 0.2 KB 0.8 KB 2 KB 8 KB 29 KB Web 1 KB 39 KB 1 MB 29 MB 813 MB Windows 0.2 KB 0.8 KB 2 KB 9 KB 30 KB

Scalable Distributed Lineage Authentication – p. 12/59

slide-13
SLIDE 13

Operational impact

Time (in ms) to read tree in open(): Steps 1 2 3 4 Workload Instruction 0.04 0.05 0.11 1.72 Research 0.05 0.05 0.04 0.04 Web 0.06 0.13 6.42 997.5 Windows 0.07 0.04 0.04 0.04 Time (in ms) to write tree in close(): Steps 1 2 3 4 Workload Instruction 0.20 0.28 0.32 0.84 Research 0.16 0.19 2.39 3.1 Web 0.16 0.24 4.82 579.14 Windows 0.16 0.50 5.34 3.17

Scalable Distributed Lineage Authentication – p. 13/59

slide-14
SLIDE 14

In actu

Larger representation Unless certification available for: DHCP bindings inode mappings Clock synchronization

Scalable Distributed Lineage Authentication – p. 14/59

slide-15
SLIDE 15

Decentralized lineage

Proposed paradigm Remote pointers replace branches Metadata remains small:

Workload Storage Instruction 0.4 KB Research 0.2 KB Web 1 KB Windows 0.2 KB

Scalable Distributed Lineage Authentication – p. 15/59

slide-16
SLIDE 16

Verifying lineage

Algorithm : CHECKLINEAGE(D) {E, S, O, I1, . . . , In} ← GETROOT(D) OUTPUT(E) PE ← PKILOOKUP(E) if I1, . . . , In = {} then        Result ← VERIFY(PE, S, E, O) if Result = FALSE then CheckFailed else                    Result ← VERIFY(PE, S, E, O|I1| . . . |In) if Result = TRUE then    for i ← 1 to n do CHECKLINEAGE(Ii) ← − Reliability drops else CheckFailed

Scalable Distributed Lineage Authentication – p. 16/59

slide-17
SLIDE 17

Increasing availability

Traditional strategy: Form virtual topology Flood neighbors Inefficient use of storage

Scalable Distributed Lineage Authentication – p. 17/59

slide-18
SLIDE 18

Bonsai

Prune lineage tree

Pruned levels λ Stored locally Pruned − must be recovered from remote node

Scalable Distributed Lineage Authentication – p. 18/59

slide-19
SLIDE 19

Simplest pruning

Trade verification reliability for storage

Scalable Distributed Lineage Authentication – p. 19/59

slide-20
SLIDE 20

Scaling to the Grid

Scalable Distributed Lineage Authentication – p. 20/59

slide-21
SLIDE 21

Grid computation

Compute nodes partially trusted

Pegasus Planner NCBI TIGR PDB Swiss−Prot New Data Query GADU Server PFAM BLOCKS BLAST THMM 300 Nodes Globus Node Globus Node Globus Node User’s administrative domain External trusted database Grid nodes − trust but verify Comparative Analysis Database JGI

Scalable Distributed Lineage Authentication – p. 21/59

slide-22
SLIDE 22

New problems

Long running jobs Simple pruning insufficient Lineage generated in many trust domains Cryptographic key retrieval slow Trusted timestamps requires data upload Grid data sets too large

Scalable Distributed Lineage Authentication – p. 22/59

slide-23
SLIDE 23

Grid properties

Nodes have a stake in cooperating Nontrivial quality of service Low churn rates Common software subset Small number of domains

Scalable Distributed Lineage Authentication – p. 23/59

slide-24
SLIDE 24

Exploratory strategies

Leverage transitive intra-domain trust Long paths in single domain Perform greedy verification Few malicious nodes Embed forward-secure temporal witnesses Common stake

Scalable Distributed Lineage Authentication – p. 24/59

slide-25
SLIDE 25

Managing Lineage Metadata

Scalable Distributed Lineage Authentication – p. 25/59

slide-26
SLIDE 26

"Grid" semantics

Not middleware-specific Distributed system Large files Multiple administrative domains Range of data sources Loose collaboration Non-interactive workloads

Scalable Distributed Lineage Authentication – p. 26/59

slide-27
SLIDE 27

Motivation

Low latency lineage queries Enables use for: Quality GALE Dynamic toolchain selection Safety / Reliability Check tool dependencies / versions Trust Limit sources

Scalable Distributed Lineage Authentication – p. 27/59

slide-28
SLIDE 28

Auditing

Baseline Userspace filesystem (FUSE) Hooks in Linux kernels > 2.6.14 Intercede on:

  • pen(), close(), read(), write()

Exploring LibAudit, BSM Inter-process: fork(), exec(), clone() Network: connect(), accept()

Scalable Distributed Lineage Authentication – p. 28/59

slide-29
SLIDE 29

Process Table

+-----------+--------------+ | Field | Type | +-----------+--------------+ | LPID | int(11) | | Host | varchar(256) | | IP | char(16) | | Time | datetime | | PID | int(11) | | PID_Name | varchar(256) | | PPID | int(11) | | PPID_Name | varchar(256) | | UID | int(11) | | UID_Name | char(32) | | GID | int(11) | | GID_Name | char(32) | | CmdLine | varchar(256) | | Environ | text | +-----------+--------------+

Scalable Distributed Lineage Authentication – p. 29/59

slide-30
SLIDE 30

File Table

+------------+--------------+ | Field | Type | +------------+--------------+ | LFID | int(11) | | FileName | varchar(256) | | Time | datetime | | NewTime | datetime | | RdWt | int(11) | | LPID | int(11) | | Hash | varchar(256) | | Signature | varchar(256) | +------------+--------------+

Scalable Distributed Lineage Authentication – p. 30/59

slide-31
SLIDE 31

Auxiliary Files

Application not lineage-aware Synchronization error Black node "forgets" Gray subtree "lost"

Scalable Distributed Lineage Authentication – p. 31/59

slide-32
SLIDE 32

Initial Approaches

Distributed transfer decouples metadata Auxiliary files Data structure in filesystem Local database Large files File server Headers / footers Application- / format-specific engineering In-band encoding

Scalable Distributed Lineage Authentication – p. 32/59

slide-33
SLIDE 33

Overloaded Namespace

Metadata in local database In overloaded namespace "Appear" as header Change open(), read(), close() sequence Transparently append lineage Change open(), write(), close() sequence Transparently extract lineage seek() semantics not supported Limited to ftp, scp, GridFTP, Web browsers

Scalable Distributed Lineage Authentication – p. 33/59

slide-34
SLIDE 34

SQL Response Time

Scalable Distributed Lineage Authentication – p. 34/59

slide-35
SLIDE 35

HQL Response Time

Scalable Distributed Lineage Authentication – p. 35/59

slide-36
SLIDE 36

Record Replication

Exploit redundancy between lineage Merge forest into single graph Weight vertices with metadata size Weight edges with tree (DAG) count

Scalable Distributed Lineage Authentication – p. 36/59

slide-37
SLIDE 37

Goal

Partition graph Minimize remote queries Minimize global cache size Approach Replicate vertex metadata in cluster Intra-cluster queries in cache Inter-cluster queries remote

Scalable Distributed Lineage Authentication – p. 37/59

slide-38
SLIDE 38

Partitioning

Transform to clustering Construct objective function Augmenting edges for vertex weights Consolidating edges Optimization Approximation

Scalable Distributed Lineage Authentication – p. 38/59

slide-39
SLIDE 39

System Support for Forensic Inference

Scalable Distributed Lineage Authentication – p. 39/59

slide-40
SLIDE 40

Motivation

Virtual conflict management Specify detailed policy Prevent disallowed actions Physical world differs Retrospective accountability Punishment deters crimes

Scalable Distributed Lineage Authentication – p. 40/59

slide-41
SLIDE 41

Enforcement Choices

Proactive security Must characterize a priori Supports pseudonymity Reactive security Adjudicate a posteriori Accounts for complex context, e.g. intent Monitoring distributed among population

Scalable Distributed Lineage Authentication – p. 41/59

slide-42
SLIDE 42

Changing Approach

Limits of proactive approach Institution-wide specification complex Characterizing attacks never ends Reactive security enablers Cryptographic digital identitities Trusted Platform Module Forensics matches reactive paradigm

Scalable Distributed Lineage Authentication – p. 42/59

slide-43
SLIDE 43

Ex Post Facto Rules

Policy creator’s burden is reduced Individual’s freedom is broader Legal semantics are interpretable Addresses false postive / negative tradeoff Intrusion detection systems

Scalable Distributed Lineage Authentication – p. 43/59

slide-44
SLIDE 44

Evidence Standards

1923 - Frye case Based in scientific knowledge Assist the trier of fact 1993 - Daubert case Relevancy Reliability of collection process 1997 - Havner case If "foundational data underlying testimony are unreliable", considered "no evidence"

Scalable Distributed Lineage Authentication – p. 44/59

slide-45
SLIDE 45

Evidence Selection

Sarbanes-Oxley Act Publicly traded companies Information flow controls Health Insurance Portability and Accountability Act Healthcare providers Data privacy protection Gramm-Leach-Bliley Act Financial firms, educational institutions Personal information safeguards

Scalable Distributed Lineage Authentication – p. 45/59

slide-46
SLIDE 46

Forensic Analysis

Reproducible by opposing counsel Framework must be agreed upon Commutativity of operations Invertibility implications Establishing chain of custody

Scalable Distributed Lineage Authentication – p. 46/59

slide-47
SLIDE 47

Formal Framework

Standardization Laws on digital evidence encoded in logic Automation Manage the proliferation of evidence Soundness Constructive proofs of conclusions Completeness Claims not inferred can be ruled out

Scalable Distributed Lineage Authentication – p. 47/59

slide-48
SLIDE 48

CyberTrail

CyberLogic Provides attestations ( :⊲) Predicates, certificates for trust queries Protocols are distributed logic programs Digital artifacts - Authority :⊲Statement Intuitionistic - ambiguity in real world

Scalable Distributed Lineage Authentication – p. 48/59

slide-49
SLIDE 49

Case Study

Intermediate granularity auditing User-space filesystem Facts that user can attest Future - automate reasoning Interface with λ-Prolog interpreter

Scalable Distributed Lineage Authentication – p. 49/59

slide-50
SLIDE 50

Facts Generated

Process(p) Owner(e, p) File(f1) Input(p, f1) File(f2) Input(p, f2) File(f3) Output(p, f3) p is a process identifier e is process p’s owner f1, f2 are files read by p f3 is a file written by p

Scalable Distributed Lineage Authentication – p. 50/59

slide-51
SLIDE 51

Digital Artifacts

Corresponding attestations: e :⊲ Owner(e, p) e :⊲ Input(p, f1) e :⊲ Input(p, f2) e :⊲ Output(p, f3)

Scalable Distributed Lineage Authentication – p. 51/59

slide-52
SLIDE 52

Forensic Analysis

List all users who modified f0 Authors(f0) := Output(p, f0) ∧ Owner(e, p) ∧ Input(p, f1) ∧ Authors(f1) Validate against digital artifacts Authors(f0) := Output(p, f0) ∧ Owner(e, p) ∧ Input(p, f1) ∧ Authors(f1) ∧ e :⊲ Output(p, f0) ∧ e :⊲ Owner(e, p) ∧ e :⊲ Input(p, f1)

Scalable Distributed Lineage Authentication – p. 52/59

slide-53
SLIDE 53

More Forensic Analysis

Check digital chain of custody from f0 to f1 Chain(f0, f1) := Chain(f, f1) ∧ Output(p, f) ∧ Input(p, f0) ∧ e :⊲ Output(p, f) ∧ e :⊲ Input(p, f0) Find files derived from f0 Derivatives(f0) := Input(p, f0) ∧ Output(p, f1) ∧ Derivatives(f1)

Scalable Distributed Lineage Authentication – p. 53/59

slide-54
SLIDE 54

SDCI Obligations

Scalable Distributed Lineage Authentication – p. 54/59

slide-55
SLIDE 55

NMI Build and Test

Attended April 2008 workshop Interactive use has limited support Development on VMware Workstation Distributed testing on PlanetLab NMI use after code matures

Scalable Distributed Lineage Authentication – p. 55/59

slide-56
SLIDE 56

Licensing

Our code SPADE, JavaFUSE (GPL) Dependencies MySQL database (GPL / MySQL) GNU Multiple Precision library (LGPL) FUSE filesystem driver (LGPL) Auditing library (LGPL) Pairing Based Cryptography library (GPL) Hashish (Bloom filter) library (GPL) PlanetLab FUSE wrapper (GPL)

Scalable Distributed Lineage Authentication – p. 56/59

slide-57
SLIDE 57

End user interaction

NIGHTINGALE project 15 universities / companies Data from multilingual sources (Mandarin, Arabic, English) Automated speech recognition, machine translation, distillation Queries from monolingual users

Scalable Distributed Lineage Authentication – p. 57/59

slide-58
SLIDE 58

Cyber Infrastructure

Berkeley Customs Grid middleware 500 node cluster at SRI Provenance needed to: Dynamically alter workflow Synthesize self-contained stages NFS access to data Remap NFS mount point SPADE bridge between namespaces

Scalable Distributed Lineage Authentication – p. 58/59

slide-59
SLIDE 59

Other user outreach

Tanu Malik, Purdue Cyber Center Chemical engineering simulations SPADE for provenance tracking Decentralized lineage caching Alternative pruning approach Bypass-yield - agnostic to query workload SkyQuery / National Virtual Observatory

Scalable Distributed Lineage Authentication – p. 59/59