SLIDE 1 Stack Overflow Considered Harmful?
The Impact of Copy&Paste on Android Application Security
- F. Fischer*, K. Böttinger*, H.Xiao*, C. Stransky†, Y. Acar†, M. Backes†, S. Fahl†
*Fraunhofer AISEC †CISPA, Saarland University
Presentation by Kevin Liao
SLIDE 2
Code copypasta insecure?
SLIDE 3
How prolific are security-related code snippets from Stack Overflow in Android applications?
Research question
SLIDE 4
Rather than discuss results at end… Present results first, then analyze the methodology Does the methodology convince us of the results?
This talk
SLIDE 5
The high-level approach
SLIDE 6
The high-level approach
Extract security-related snippets
SLIDE 7
The high-level approach
Security analysis
SLIDE 8
The high-level approach
Identify code reuse
SLIDE 9
Results: Alarming (potentially)
SLIDE 10
Extracted snippets
30 million posts 2 million Android-related posts ~4,000 security-related snippets
SLIDE 11 Security classification
Secure 70% Insecure 30%
SLIDE 12
Prevalence of code reuse
1.3 million free apps 2,673 secure snippets 1,161 insecure snippets
SLIDE 13
Prevalence of code reuse
SLIDE 14
Prevalence of code reuse
SLIDE 15
Prevalence of code reuse
SLIDE 16 Apps with security-related snippets
Secure 2% Insecure 98%
SLIDE 17 Top-offender? TLS…
Other 8% Empty TrustManager 92%
Trust Manager
verification
SLIDE 18 Next top-offender? Symmetric crypto
Other 91% AES/ECB 9%
ECB mode
SLIDE 19 Next top-offender? Symmetric crypto
Other 91% AES/ECB 9%
ECB mode
SLIDE 20
Do insecure snippets have lower scores?
SLIDE 21
Do insecure snippets wit
with a a war arnin ing
have lower scores?
SLIDE 22
Are high view count/score snippets copy&pasted more?
SLIDE 23
Are high view count/score snippets
wit with a a war arnin ing copy&pasted le less ss?
SLIDE 24
Discussion of methodology
Extract security-related snippets
SLIDE 25 Extract security related-snippets
- 1. Get all posts with ‘Android’ tag
- 2. Filter code-snippets that use security APIs
- TLS/SSL
- Symmetric/asymmetric crypto
- RNG
- Signatures
- Message digests
- Authentication/access control
SLIDE 26
Discuss snippet extraction
SLIDE 27
Discussion of methodology
Security analysis
SLIDE 28 Security analysis
- 1. Manually label snippets as secure or insecure
- 2. Train a binary classifier to automatically
determine security/insecurity of all snippets
SLIDE 29 tl;dr for labeling rules
- SSL/TLS: Use TLS v1.1 or greater; don’t use old
crypto
- Symmetric: Don’t use old crypto; don’t use ECB;
don’t use static/zeroed/derived keys or IVs
- Asymmetric: Use >=2048 bit RSA; use >= 244 bit
ECC
- Hashing: Don’t use MD-family
- RNG: Use crypto-secure RNG; securely random
seed
SLIDE 30
Security score of training set
SLIDE 31
Train SVM binary classifier
SLIDE 32 Feature selection
- Based on tf-idf
- “The features rely merely on the vocabulary level of
input code snippets, without even understanding how they are functioning.”
- Claim: Can be more accurate and more scalable
than rule-based methods
SLIDE 33 https://chrisalbon.com/machine_learning/preprocessing_text/tf-idf/
SLIDE 34 Security classification
Secure 70% Insecure 30%
SLIDE 35
Discuss security classification
SLIDE 36
Discussion of methodology
Identify code reuse
SLIDE 37 Identify code reuse
- 1. Transform source code and Dalvik executables
into same IR
- 2. Identify similar code snippets using Program
Dependency Graphs (PDGs)
SLIDE 38
IR transformation
Source code Dalvik executable Typed AST PPA Bytecode Lift
SLIDE 39 Program Dependency Graphs
- Generate PDG for each method
- Nodes: Statements in methods
- Edges: Data and control dependence
SLIDE 40
Dependency edges
Data: S2 depends on S1, since A read in S2. Control: S2 depends on A, since A determines S2’s execution.
SLIDE 41
Examples of PDGs
SLIDE 42
Prevalence of code reuse
SLIDE 43
Discuss identification of code reuse
SLIDE 44 Final discussion
- About results?
- About methodology?
- About future work?