Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) - PowerPoint PPT Presentation

Vassil Roussev

The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 *  We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2

Scalable Forensic Workflow Clone Forensic Target (3TB) Process @150MB/s  We can start working on the case immediately . 3

Current Forensic Processing  Hashing/filtering/correlation  File carving/reconstruction  Indexing The ultimate goal of this work is to make similarity hash-based correlation scalable & I/O-bound. 4

Motivation for similarity approach: Traditional hash filtering is failing  Known file filtering: o Crypto-hash known files, store in library (e.g. NSRL) o Hash files on target o Filter in/out depending on interest  Challenges o Static libraries are falling behind  Dynamic software updates, trivial artifact transformations  We need version correlation o Need to find embedded objects  Block/file in file/volume/network trace o Need higher-level correlations  Disk-to-RAM  Disk-to-network 5

Scenario #1: Fragment Identification Source artifacts (files) v Disk fragments (sectors) Network fragments (packets)  Given a fragment, identify source o Fragments of interest are 1-4KB in size o Fragment alignment is arbitrary 6

Scenario #2: Artifact Similarity Similar files Similar drives (shared content/format) (shared blocks/files)  Given two binary objects, detect similarity/versioning o Similarity here is purely syntactic; o Relies on commonality of the binary representations. 7

Solution: Similarity Digests sdhash sdhash sdhash sdhash sdbf 1 sdbf 2 sdbf 1 sdbf 2 sdhash sdhash Is this fragment present on the drive? Are these artifacts correlated?  0 .. 100  0 .. 100 All correlations based on bitstream commonality 8

Quick Review: Similarity digests & sdhash 9

Generating sdhash fingerprints (1) Digital artifact (block/file/packet/volume) as byte stream … Features (all 64-byte sequences) 10

Generating sdhash fingerprints (2) Digital artifact … Select characteristic features (statistically improbable/rare) 11

Generating sdhash fingerprints (3) Feature Selection Process All features Weak H norm  Feature 0..1000 Filter 0.18 0.16 0.14 Data with low information content 0.12 Probability 0.10 0.08 Rare 0.06 Local 0.04 0.02 Feature 0.00 0 100 200 300 400 500 600 700 800 900 1000 Selector (a) H norm distribution: doc H norm  doc files 12

Generating sdhash fingerprints (4) 8-10K avg 8-10K avg 8-10K avg SHA-1 SHA-1 SHA-1 … = Artifact SD fingerprint bf 3 bf 1 + bf 2 + Sequence of Bloom filters (sdbf) Bloom filter  local SD fingerprint  256 bytes  up to 128/160 features 13

Bloom filter (BF) comparison A bf A 0 .. 100 BF Score bitwise AND B bf B Based on BF theory, overlap due to chance is analytically predictable. Additional BF overlap is proportional to overlap in features. BF Score is tuned such that BF Score (A random , B random ) = 0. 14

SDBF fingerprint comparison … SD B 1 2 m SD A bf B bf B bf B … 1 1 ,bf B 1 ) 1 ,bf B 2 ) 1 ,bf B m ) bf A BF Score (bf A BF Score (bf A BF Score (bf A max 1 … 2 bf A 2 ,bf B 1 ) 2 ,bf B 2 ) 2 ,bf B m ) BF Score (bf A BF Score (bf A BF Score (bf A max 2 … … … … … n bf A n ,bf B 1 ) n ,bf B 2 ) n ,bf B m ) max n BF Score (bf A BF Score (bf A BF Score (bf A SD Score (A,B) = Average(max 1 , max 2 , …, max n ) 15

Scaling up: Block-aligned digests & parallelization 16

Block-aligned similarity digests ( sdbf-dd ) 16K 16K 16K SHA-1 SHA-1 SHA-1 … = Artifact SD fingerprint bf 3 bf 1 + bf 2 + Sequence of Bloom filters (sdbf-dd) Bloom filter  local SD fingerprint  256 bytes  up to 192 features 17

Advantages & challenges for block- aligned similarity digests (sdbf-dd)  Advantages Parallelizable computation o Direct mapping to source data o Shorter (1.6% vs 2.6% of source) o  Faster comparisons (fewer BFs)  Challenges Less reliable for smaller files o Sparse data o Compatibility with sdbf digests o  Solution Increase features for sdbf filters: 128  160 o Use 192 features per BF for sdbf-dd filters o Use compatible BF parameters to allow sdbf  sdbf-dd comparisons o 18

sdhash 1.6: sdbf vs. sdbf-dd accuracy 19

Sequential throughput: sdhash 1.3  Hash generation rate o Six-core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad-Core Intel Xeon @ 2.8 GHz ~20MB/s per core  Hash comparison o 1MB vs. 1MB: 0.5ms  T5 corpus (4,457 files, all pairs) o 10mln file comparisons in ~ 15min  667K file comps per second  Single core 20

sdhash 1.6: File-parallel generation rates on 27GB real data (in RAM) 21

sdhash 1.6: Optimal file-parallel generation: 5GB synthetic target (RAM) 22

sdhash-dd: Hash generation rates 10GB in-RAM target (RAM) 23

Throughput summary: sdhash 1.6  Parallel hash generation o sdbf: file-parallel execution  260 MB/s on 12-core/24-threaded machine o sdbf-dd: block-parallel execution  370 MB/s (SHA1 — > 330MB/s)  Optimized hash comparison rates o 24 threads: 86.6 mln BF/s  1.4 TB/s for small file comparison (<16KB) I.e., we can search for a small file in a reference set of 1.4TB in 1s 24

The Envisioned Architecture 25

The Current State libsdbf CLI: sdhash Service: sdbf_d API Files: Network: Client: Disk: Cluster: Client: sdhash- sdhash- C/C++ C# Python sdhash-dd sdbfCluster sdbfWeb sdbfViz file pcap 26

Todo List (1)  libsdbf o C++ rewrite (v2.0) o TBB parallelization  sdhash-file o More command line options/compatibility w/ssdeep o Service-based processing (w/ sdbf_d )  GPU acceleration  sdhash-pcap o Pcap-aware processing:  payload extraction, file discovery, timelining 27

Todo List (2)  sdbf_d o Persistance: XML o Service interface: JSON o Server clustering  sdbfWeb o Browser-based management/query  sdbfViz o Large-scale visualization & clustering 28

Further Development  Integration w/ RDS sdhash-set : construct SDBF s from existing SHA1 sets o  Compare/identify whole folders, distributions, etc.  Structural feature selection E.g., exe/dll, pdf , zip, … o  Optimizations Sampling o Skipping o  Under min continuous block assumption Cluster “core” extraction/comparison o  Representation Multi-resolution digests o New crypto hashes o Data offsets o 29

Thank you!  http://roussev.net/sdhash wget http://roussev.net/sdhash/sdhash-1.6.zip o make o ./sdhash o  Contact: Vassil Roussev vassil@roussev.net  Reminder DFRWS’12: Washington DC, Aug 6-8 Paper deadline: Feb 20, 2012 Data sniffing challenge to be released shortly 30

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) - PowerPoint PPT Presentation

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 * We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2 Scalable

Vassil Roussev Candice Quates The M57 Case Study Introduction 2 M57: The company & setup

Forensic Science Center Forensic Science Center -10 Budget 10 Budget FY 09- FY 09 Forensic

Forensic Challenge V2.0 UNAM-CERT RedIRIS Topics * Forensic Challenge V1.0 * Forensic

Specialized Topics in Ethical Forensic Practice, Part 3: Bias in Forensic Evaluations November 18,

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

Forensic Mental Health Care in the Texas State Hospital System Matthew Faubion, M.D. Forensic

THE NEW FORENSIC PATIENT Learning Objectives Review the epidemiology of forensic populations

Regional Forensic Trainings 2013 Pathways to Conditional Release: An Overview of the Forensic

Current Forensic DNA Typing o Forensic cases -- matching suspect with evidence Involves generation

STAR-CCM+ in your Workflow Bill Jester, CD-adapco STAR-CCM+ in your workflow Contents

Day 8 Workflow Cloud Resource Provisioning Todays Agenda Introduction What is workflow?

workflow: workflow: QSPR = Quantitative Structure Property

A Workflow Workflow for for Retrieving Retrieving Orthologous Orthologous A Promoters and I

Challenges in Crime Scene Investigation Technical challenges in forensic STR profiling

Drugs in Oral Fluid AS4760 Olaf H. Drummer December 9, 2013 DEPARTMENT OF FORENSIC MEDICINE

CS CSI: I: DUND DUNDEE EE Th The e Fo Fore rensic nsic To Tool olkit kit Meet the

Se Search arch for for th the e Lep Lepton ton Fl Flavor avor Vio Violating lating De

ACESIII Outline Collaborators Design philosophy Mr. Mark Ponton, ACES Q. C.

The dual Voronoi diagrams with respect to representational Bregman divergences Frank Nielsen and

More Practical Single-Trace Attacks on the Number Theoretic Transform Peter Pessl, Robert Primas

Matching and Inequality in the World Economy Arnaud Costinot Jonathan Vogel MIT & Columbia

Online Algorithms Lecture 3 Ji r Sgall Computer Science Institute of the Charles Univ.,

Randomness with CA Bruno Martin Universit e C ote dAzur, I3S-CNRS Journ ee Al ea

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&M University Surface

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) - PowerPoint PPT Presentation

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 * We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2 Scalable

Vassil Roussev Candice Quates The M57 Case Study Introduction 2 M57: The company &amp; setup

Forensic Science Center Forensic Science Center -10 Budget 10 Budget FY 09- FY 09 Forensic

Forensic Challenge V2.0 UNAM-CERT RedIRIS Topics * Forensic Challenge V1.0 * Forensic

Specialized Topics in Ethical Forensic Practice, Part 3: Bias in Forensic Evaluations November 18,

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

Forensic Mental Health Care in the Texas State Hospital System Matthew Faubion, M.D. Forensic

THE NEW FORENSIC PATIENT Learning Objectives Review the epidemiology of forensic populations

Regional Forensic Trainings 2013 Pathways to Conditional Release: An Overview of the Forensic

Current Forensic DNA Typing o Forensic cases -- matching suspect with evidence Involves generation

STAR-CCM+ in your Workflow Bill Jester, CD-adapco STAR-CCM+ in your workflow Contents

Day 8 Workflow Cloud Resource Provisioning Todays Agenda Introduction What is workflow?

workflow: workflow: QSPR = Quantitative Structure Property

A Workflow Workflow for for Retrieving Retrieving Orthologous Orthologous A Promoters and I

Challenges in Crime Scene Investigation Technical challenges in forensic STR profiling

Drugs in Oral Fluid AS4760 Olaf H. Drummer December 9, 2013 DEPARTMENT OF FORENSIC MEDICINE

CS CSI: I: DUND DUNDEE EE Th The e Fo Fore rensic nsic To Tool olkit kit Meet the

Se Search arch for for th the e Lep Lepton ton Fl Flavor avor Vio Violating lating De

ACESIII Outline Collaborators Design philosophy Mr. Mark Ponton, ACES Q. C.

The dual Voronoi diagrams with respect to representational Bregman divergences Frank Nielsen and

More Practical Single-Trace Attacks on the Number Theoretic Transform Peter Pessl, Robert Primas

Matching and Inequality in the World Economy Arnaud Costinot Jonathan Vogel MIT &amp; Columbia

Online Algorithms Lecture 3 Ji r Sgall Computer Science Institute of the Charles Univ.,

Randomness with CA Bruno Martin Universit e C ote dAzur, I3S-CNRS Journ ee Al ea

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&amp;M University Surface

Vassil Roussev Candice Quates The M57 Case Study Introduction 2 M57: The company & setup

Matching and Inequality in the World Economy Arnaud Costinot Jonathan Vogel MIT & Columbia

Mesh Denoising via L 0 Minimization Lei He Scott Schaefer Texas A&M University Surface