Jackstraws : Picking Command and Control Connections from Bot - - PowerPoint PPT Presentation

jackstraws picking command and control connections from
SMART_READER_LITE
LIVE PREVIEW

Jackstraws : Picking Command and Control Connections from Bot - - PowerPoint PPT Presentation

Jackstraws : Picking Command and Control Connections from Bot Traffic egoire Jacob 1 , Ralf Hund 2 , Christopher Kruegel 1 , Thorsten Holz 2 Gr 1 University of California, Santa Barbara / 2 Ruhr-University Bochum Fri Aug 12 2011 G. Jacob


slide-1
SLIDE 1

Jackstraws: Picking Command and Control Connections from Bot Traffic

Gr´ egoire Jacob1, Ralf Hund2, Christopher Kruegel1, Thorsten Holz2

1 University of California, Santa Barbara / 2 Ruhr-University Bochum

Fri Aug 12 2011

  • G. Jacob (UCSB)

Fri Aug 12 2011 1 / 20

slide-2
SLIDE 2

Introduction: the botnet threat

What do botnets do?

❼ Support large-scale malicious activities and the underground economy ❼ Coordination of malicious attacks

e.g., denial of service, spam campaigns, click fraud ❼ Sensitive information theft e.g., credentials, credit card numbers

Why are botnets so convenient for attackers?

❼ Command & Control (C&C) infrastructure for remote control ❼ Incoming commands to trigger attacks and updates ❼ Outgoing responses for status monitoring and information leakage

  • G. Jacob (UCSB)

Fri Aug 12 2011 2 / 20

slide-3
SLIDE 3

Introduction: fighting against botnets

Botnet detection and mitigation

❼ Host-based techniques

  • Traditional malware detection and mitigation
  • Signature matching and behavior monitoring

❼ Network-based techniques

  • Blacklisting IPs related to C&C servers
  • Signatures matching C&C protocol and commands

❼ Automatic generation of these signatures, IP lists or models

  • Clean C&C only logs needed for traffic and system calls

Difficulty of identifying C&C traffic

❼ Potentially encrypted C&C traffic ❼ Non-C&C or “noise” traffic interleaved

  • Malicious connections to 3rd party websites (e.g., part of the attacks)
  • Configuration connections (e.g., connectivity tests, time recovery)
  • Fake benign connections (e.g., mimicry of legitimate applications)
  • G. Jacob (UCSB)

Fri Aug 12 2011 3 / 20

slide-4
SLIDE 4

Introduction: identifying C&C traffic

Our approach: Jackstraws

❼ Combination of network traces and host-based activity

  • Rationale: C&C traffic results in observable host activity

e.g. system modifications, critical information accesses

  • Host-based model: system call graphs with data dependency
  • Network-related link: each graph associated to a network connection

❼ Machine learning to identify and generalize C&C-related host activity

  • Rationale: similar commands result in similar core activities

even for different bots

  • Mining significant activities: graph mining over known connections
  • Identifying similar activity types: graph clustering
  • Abstracting activity types: graph merging into templates
  • Detecting C&C activity: template matching over unknown connections
  • G. Jacob (UCSB)

Fri Aug 12 2011 4 / 20

slide-5
SLIDE 5

System: Jackstraws overview

System architecture

  • G. Jacob (UCSB)

Fri Aug 12 2011 5 / 20

slide-6
SLIDE 6

System: graph collection

Analysis environment

❼ Logging: system calls and network API calls ❼ Tainting: data flows in memory and over the file system

Graph generation

❼ Input: trace of system and network calls ❼ Output: a call graph for each successful connection ❼ Algorithm:

  • Graph root:

successful connect and associated sends/recvs

  • Nodes extension: recursive backward dependency over system calls
  • Nodes labeling:

call parameters, resource names being abstracted

  • Graph collapsing: collapse duplicate nodes
  • G. Jacob (UCSB)

Fri Aug 12 2011 6 / 20

slide-7
SLIDE 7

System: graph collection

Graph generation

network: recv systemcall: NtWriteFile arg: Buffer=buf systemcall: NtWriteFile arg: Buffer=buf systemcall: NtWriteFile arg: Buffer=buf systemcall: NtCreateFile FileName: isSystemDirectory/isExecutable DesiredAccess: FileReadAttributes Attributes: AttributeNormal CreateDisposition: FileSupersede arg: FileHandle=FileHandle arg: FileHandle=FileHandle network: recv systemcall: NtWriteFile Collapse: isMultiple arg: Buffer=buf systemcall: NtCreateFile FileName: isSystemDirectory/isExecutable DesiredAccess: FileReadAttributes Attributes: AttributeNormal CreateDisposition: FileSupersede arg: FileHandle=FileHandle

  • G. Jacob (UCSB)

Fri Aug 12 2011 7 / 20

slide-8
SLIDE 8

System: graph mining

Frequent subgraph mining:

❼ Input: call graphs associated to malicious vs. benign connections ❼ Output: significant subgraphs covering only malicious (C&C) activity ❼ Algorithm:

  • Graph mining: frequent subgraphs from malicious connections
  • Maximization: stripping induced subgraphs from the mined set
  • Set difference: stripping subgraphs included in benign connections
  • G. Jacob (UCSB)

Fri Aug 12 2011 8 / 20

slide-9
SLIDE 9

System: graph mining

Frequent subgraph mining

  • G. Jacob (UCSB)

Fri Aug 12 2011 9 / 20

slide-10
SLIDE 10

System: graph clustering and template generation

Graph clustering:

❼ Input: significant malicious subgraphs ❼ Output: clusters group graphs that represent similar activity ❼ Algorithm:

  • Graph similarity: common edges in the maximal common subgraph
  • Graph clustering: clustering by repeated bisection

Template generation:

❼ Input: clusters of similar malicious subgraphs ❼ Output: graph template covering the graphs of the cluster ❼ Algorithm:

  • Template construction:

minimal common supergraph

  • Template generalization: supergraph weighted by node frequency

+ Frequent nodes constitute the core activity shared by bots + Infrequent nodes constitute optional activity specific to different bots

  • G. Jacob (UCSB)

Fri Aug 12 2011 10 / 20

slide-11
SLIDE 11

System: graph clustering and template generation

Graph clustering and template generation

  • G. Jacob (UCSB)

Fri Aug 12 2011 11 / 20

slide-12
SLIDE 12

System: template matching

Template matching:

❼ Input: template, unlabeled collected call graphs ❼ Output: match result ❼ Algorithm:

  • Core matching:

subgraph isomorphism with core nodes

+ Mandatory nodes must be present

  • Extended match: maximal common supergraph for optional nodes

+ Isomorphism result used to initialize search

  • G. Jacob (UCSB)

Fri Aug 12 2011 12 / 20

slide-13
SLIDE 13

System: template matching

Template matching

systemcall: recv network: connect port: 443 #ip=193.23.126.55 #ip=94.75.255.138 arg: ip=buf systemcall: NtCreateFile Filename: inProgramDirectory\isExecutable DesiredAccess: FileReadAttributes Attributes: AttributeNormal CreateDisposition: FileSupersede #Filename=\??\C:\Program Files\temp\ldr.exe arg: ObjectAttributes=buf systemcall: NtCreateFile Filename: inProgramDirectory\isExecutable DesiredAccess: FileReadAttributes | FileWriteAttributes Attributes: AttributeNormal CreateDisposition: FileSupersede #Filename=\??\C:\Program Files\temp\ldr.exe arg: ObjectAttributes=buf network: recv Collapse: isMultiple arg: Socket=Socket systemcall: NtAllocateVirtualMemory *: * arg: ObjectAttributes=RegionSize systemcall: NtDeviceIoControlFile *: * arg: InputBuffer=buf systemcall: NtWriteFile Collapse: isMultiple arg: Buffer=buf arg: Length=buf systemcall: NtSetInformationFile Collapse: isMultiple arg: FileInformation=buf process: start arg: buf=buf arg: FileHandle=FileHandle arg: FileHandle=FileHandle

  • G. Jacob (UCSB)

Fri Aug 12 2011 13 / 20

slide-14
SLIDE 14

Evaluation: dataset presentation

Collected botnet traffic

❼ 37,572 bot samples corresponding to 745 families (e.g. EgroupDial, Palevo, Virut) ❼ 130,635 network connections and associated behavior graphs (successful connections only)

Labeling connections for ground truth

❼ Manually-crafted network signatures: 385 C&C, 162 benign ❼ 10,801 malicious connections ❼ 12,367 benign connections ❼ 66,538 unknown connections ❼ 40,929 incomplete or irrelevant graphs removed

  • G. Jacob (UCSB)

Fri Aug 12 2011 14 / 20

slide-15
SLIDE 15

Evaluation: dataset presentation

Training and testing sets

  • G. Jacob (UCSB)

Fri Aug 12 2011 15 / 20

slide-16
SLIDE 16

Evaluation: training the system

System configuration

❼ Mining frequency threshold: 10%

  • Trade-off between maximum coverage and mining runtime

❼ Bisection threshold: 60% average and 40% minimal similarity

  • Higher thresholds reduce the effect of generalization

System runtime

❼ Mining: 16h, Clustering: 4.5h, Generalization: 30min ❼ Reasonable processing time wrt. the NP-hardness of algorithms

Templates quality

❼ 417 templates generated

  • 397 templates semantically meaningful

❼ Different types of commands covered

  • Information leakage, download and execute, startup, stealth
  • G. Jacob (UCSB)

Fri Aug 12 2011 16 / 20

slide-17
SLIDE 17

Evaluation: testing the system

Testing over labeled connections

❼ Detection rate: 81.6% ❼ Detection without the generalization: 66.0% ❼ Detection of new families that were missing in the training set ❼ False negatives: 18.4% mainly due to incomplete/infrequent activity ❼ False positives: 0.2% mainly due to weaker templates

  • G. Jacob (UCSB)

Fri Aug 12 2011 17 / 20

slide-18
SLIDE 18

Evaluation: testing the system

Testing over unknown connections

❼ 66,538 unknow connections ❼ New matches: 9,464 connections ❼ New detected families: 193 not covered by network signatures ❼ New detected variants: missed by outdated network signatures ❼ False negatives: high proportion of benign traffic (manual verification) ❼ False positives: 27

  • G. Jacob (UCSB)

Fri Aug 12 2011 18 / 20

slide-19
SLIDE 19

Evaluation: system limitations

Testing over unknown connections

Weakness Consequences Potential remediation Supported Dynamic analysis Incomplete Enhanced analysis environment: call logs e.g. multi-path execution ✕ Computational Non-termination Algorithm optimizations: time e.g. node labeling, ✓ graph collapsing ✓ Interleaved calls Noise against System calls selection: mining e.g. calls with data dependency ✓ Functional No core activity Normalizing graphs: polymorphism e.g. duplicate nodes collapsing, ✓ Rewriting rules: e.g. equivalent operations ✕

  • G. Jacob (UCSB)

Fri Aug 12 2011 19 / 20

slide-20
SLIDE 20

Conclusion: Jackstraws

Contributions

❼ Solution to the problem of identifying C&C traffic from noise ❼ Automated generation of templates representing C&C behaviors ❼ Gains provided by the template generalization:

  • Protocol-agnostic representation of C&C activity
  • Increased level of understanding for analysts
  • Coverage extended to families unknown during training
  • G. Jacob (UCSB)

Fri Aug 12 2011 20 / 20