vassil roussev the current forensic workflow
play

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) - PowerPoint PPT Presentation

Vassil Roussev The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 * We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2 Scalable


  1. Vassil Roussev

  2. The Current Forensic Workflow Forensic Target (3TB) Clone Process @150MB/s @10MB/s ~5.5 hrs ~82.5 hrs 129 *  We can start working on the case after 88 hours. * http://accessdata.com/distributed-processing 2

  3. Scalable Forensic Workflow Clone Forensic Target (3TB) Process @150MB/s  We can start working on the case immediately . 3

  4. Current Forensic Processing  Hashing/filtering/correlation  File carving/reconstruction  Indexing The ultimate goal of this work is to make similarity hash-based correlation scalable & I/O-bound. 4

  5. Motivation for similarity approach: Traditional hash filtering is failing  Known file filtering: o Crypto-hash known files, store in library (e.g. NSRL) o Hash files on target o Filter in/out depending on interest  Challenges o Static libraries are falling behind  Dynamic software updates, trivial artifact transformations  We need version correlation o Need to find embedded objects  Block/file in file/volume/network trace o Need higher-level correlations  Disk-to-RAM  Disk-to-network 5

  6. Scenario #1: Fragment Identification Source artifacts (files) v Disk fragments (sectors) Network fragments (packets)  Given a fragment, identify source o Fragments of interest are 1-4KB in size o Fragment alignment is arbitrary 6

  7. Scenario #2: Artifact Similarity Similar files Similar drives (shared content/format) (shared blocks/files)  Given two binary objects, detect similarity/versioning o Similarity here is purely syntactic; o Relies on commonality of the binary representations. 7

  8. Solution: Similarity Digests sdhash sdhash sdhash sdhash sdbf 1 sdbf 2 sdbf 1 sdbf 2 sdhash sdhash Is this fragment present on the drive? Are these artifacts correlated?  0 .. 100  0 .. 100 All correlations based on bitstream commonality 8

  9. Quick Review: Similarity digests & sdhash 9

  10. Generating sdhash fingerprints (1) Digital artifact (block/file/packet/volume) as byte stream … Features (all 64-byte sequences) 10

  11. Generating sdhash fingerprints (2) Digital artifact … Select characteristic features (statistically improbable/rare) 11

  12. Generating sdhash fingerprints (3) Feature Selection Process All features Weak H norm  Feature 0..1000 Filter 0.18 0.16 0.14 Data with low information content 0.12 Probability 0.10 0.08 Rare 0.06 Local 0.04 0.02 Feature 0.00 0 100 200 300 400 500 600 700 800 900 1000 Selector (a) H norm distribution: doc H norm  doc files 12

  13. Generating sdhash fingerprints (4) 8-10K avg 8-10K avg 8-10K avg SHA-1 SHA-1 SHA-1 … = Artifact SD fingerprint bf 3 bf 1 + bf 2 + Sequence of Bloom filters (sdbf) Bloom filter  local SD fingerprint  256 bytes  up to 128/160 features 13

  14. Bloom filter (BF) comparison A bf A 0 .. 100 BF Score bitwise AND B bf B Based on BF theory, overlap due to chance is analytically predictable. Additional BF overlap is proportional to overlap in features. BF Score is tuned such that BF Score (A random , B random ) = 0. 14

  15. SDBF fingerprint comparison … SD B 1 2 m SD A bf B bf B bf B … 1 1 ,bf B 1 ) 1 ,bf B 2 ) 1 ,bf B m ) bf A BF Score (bf A BF Score (bf A BF Score (bf A max 1 … 2 bf A 2 ,bf B 1 ) 2 ,bf B 2 ) 2 ,bf B m ) BF Score (bf A BF Score (bf A BF Score (bf A max 2 … … … … … n bf A n ,bf B 1 ) n ,bf B 2 ) n ,bf B m ) max n BF Score (bf A BF Score (bf A BF Score (bf A SD Score (A,B) = Average(max 1 , max 2 , …, max n ) 15

  16. Scaling up: Block-aligned digests & parallelization 16

  17. Block-aligned similarity digests ( sdbf-dd ) 16K 16K 16K SHA-1 SHA-1 SHA-1 … = Artifact SD fingerprint bf 3 bf 1 + bf 2 + Sequence of Bloom filters (sdbf-dd) Bloom filter  local SD fingerprint  256 bytes  up to 192 features 17

  18. Advantages & challenges for block- aligned similarity digests (sdbf-dd)  Advantages Parallelizable computation o Direct mapping to source data o Shorter (1.6% vs 2.6% of source) o  Faster comparisons (fewer BFs)  Challenges Less reliable for smaller files o Sparse data o Compatibility with sdbf digests o  Solution Increase features for sdbf filters: 128  160 o Use 192 features per BF for sdbf-dd filters o Use compatible BF parameters to allow sdbf  sdbf-dd comparisons o 18

  19. sdhash 1.6: sdbf vs. sdbf-dd accuracy 19

  20. Sequential throughput: sdhash 1.3  Hash generation rate o Six-core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad-Core Intel Xeon @ 2.8 GHz ~20MB/s per core  Hash comparison o 1MB vs. 1MB: 0.5ms  T5 corpus (4,457 files, all pairs) o 10mln file comparisons in ~ 15min  667K file comps per second  Single core 20

  21. sdhash 1.6: File-parallel generation rates on 27GB real data (in RAM) 21

  22. sdhash 1.6: Optimal file-parallel generation: 5GB synthetic target (RAM) 22

  23. sdhash-dd: Hash generation rates 10GB in-RAM target (RAM) 23

  24. Throughput summary: sdhash 1.6  Parallel hash generation o sdbf: file-parallel execution  260 MB/s on 12-core/24-threaded machine o sdbf-dd: block-parallel execution  370 MB/s (SHA1 — > 330MB/s)  Optimized hash comparison rates o 24 threads: 86.6 mln BF/s  1.4 TB/s for small file comparison (<16KB) I.e., we can search for a small file in a reference set of 1.4TB in 1s 24

  25. The Envisioned Architecture 25

  26. The Current State libsdbf CLI: sdhash Service: sdbf_d API Files: Network: Client: Disk: Cluster: Client: sdhash- sdhash- C/C++ C# Python sdhash-dd sdbfCluster sdbfWeb sdbfViz file pcap 26

  27. Todo List (1)  libsdbf o C++ rewrite (v2.0) o TBB parallelization  sdhash-file o More command line options/compatibility w/ssdeep o Service-based processing (w/ sdbf_d )  GPU acceleration  sdhash-pcap o Pcap-aware processing:  payload extraction, file discovery, timelining 27

  28. Todo List (2)  sdbf_d o Persistance: XML o Service interface: JSON o Server clustering  sdbfWeb o Browser-based management/query  sdbfViz o Large-scale visualization & clustering 28

  29. Further Development  Integration w/ RDS sdhash-set : construct SDBF s from existing SHA1 sets o  Compare/identify whole folders, distributions, etc.  Structural feature selection E.g., exe/dll, pdf , zip, … o  Optimizations Sampling o Skipping o  Under min continuous block assumption Cluster “core” extraction/comparison o  Representation Multi-resolution digests o New crypto hashes o Data offsets o 29

  30. Thank you!  http://roussev.net/sdhash wget http://roussev.net/sdhash/sdhash-1.6.zip o make o ./sdhash o  Contact: Vassil Roussev vassil@roussev.net  Reminder DFRWS’12: Washington DC, Aug 6-8 Paper deadline: Feb 20, 2012 Data sniffing challenge to be released shortly 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend