Breaking CAPTCHAs on the Dark Web Using neural networks to enable - - PowerPoint PPT Presentation

breaking captchas on the dark web
SMART_READER_LITE
LIVE PREVIEW

Breaking CAPTCHAs on the Dark Web Using neural networks to enable - - PowerPoint PPT Presentation

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018 University of Amsterdam Introduction Scraping the Dark Web Useful for


slide-1
SLIDE 1

Breaking CAPTCHAs on the Dark Web

Using neural networks to enable scraping

RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018

University of Amsterdam

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Scraping the Dark Web

Useful for threat intelligence companies

1

slide-4
SLIDE 4

Scraping the Dark Web

Useful for threat intelligence companies ... sometimes hard to get to.

1

slide-5
SLIDE 5

Scraping the Dark Web

Useful for threat intelligence companies ... sometimes hard to get to. Mainly the blockades, such as CAPTCHAs, is an issue for the scrapers.

1

slide-6
SLIDE 6

CAPTCHA

Figure 1: CAPTCHA example

  • Completely Automated Public Turing test to tell Computer and

Humans Apart

2

slide-7
SLIDE 7

CAPTCHA

Figure 1: CAPTCHA example

  • Completely Automated Public Turing test to tell Computer and

Humans Apart

  • Test to determine whether the user is human or not

2

slide-8
SLIDE 8

Main question

How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites?

3

slide-9
SLIDE 9

Main question

How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? Sub-questions:

  • 1. Impact of solving CAPTCHAs

3

slide-10
SLIDE 10

Main question

How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? Sub-questions:

  • 1. Impact of solving CAPTCHAs
  • 2. Solve CAPTCHAs by using Optical Character Recognition (OCR)?

3

slide-11
SLIDE 11

Main question

How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? Sub-questions:

  • 1. Impact of solving CAPTCHAs
  • 2. Solve CAPTCHAs by using Optical Character Recognition (OCR)?
  • 3. Solving CAPTCHAs by using Machine Learning (ML)

3

slide-12
SLIDE 12

Related Work

slide-13
SLIDE 13

Related Work

  • 1. Lawrence et al. created their own dark web scraping tool, D-miner;

CAPTCHAs were solved by human labor [1]

4

slide-14
SLIDE 14

Related Work

  • 1. Lawrence et al. created their own dark web scraping tool, D-miner;

CAPTCHAs were solved by human labor [1]

  • 2. Ryan Mitchell demonstrated how to solve CAPTCHAs using Optical

Character Recognition with Tesseract [2]

4

slide-15
SLIDE 15

Related Work

  • 1. Lawrence et al. created their own dark web scraping tool, D-miner;

CAPTCHAs were solved by human labor [1]

  • 2. Ryan Mitchell demonstrated how to solve CAPTCHAs using Optical

Character Recognition with Tesseract [2]

  • 3. Torch has previously been used to train a neural network to solve

CAPTCHAs by Arun Patala [3]

4

slide-16
SLIDE 16

Methods

slide-17
SLIDE 17

Methods

Two methods to solve the questions:

  • 1. Categorizing dark web websites
  • 2. Breaking CAPTCHAs

5

slide-18
SLIDE 18
  • 1. Categorizing websites

6

slide-19
SLIDE 19
  • 1. Categorizing websites

Analysis of 633 dark web websites

6

slide-20
SLIDE 20
  • 1. Categorizing websites

Analysis of 633 dark web websites

  • Which ones are up?

6

slide-21
SLIDE 21
  • 1. Categorizing websites

Analysis of 633 dark web websites

  • Which ones are up?
  • Are there any duplicates?

6

slide-22
SLIDE 22
  • 1. Categorizing websites

Analysis of 633 dark web websites

  • Which ones are up?
  • Are there any duplicates?
  • Which ones block scraping?

6

slide-23
SLIDE 23
  • 1. Categorizing websites

Analysis of 633 dark web websites

  • Which ones are up?
  • Are there any duplicates?
  • Which ones block scraping?
  • What kind of blockade are they using?

6

slide-24
SLIDE 24
  • 2. Breaking CAPTCHAs

7

slide-25
SLIDE 25
  • 2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

7

slide-26
SLIDE 26
  • 2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

  • 1. Using a service which solves CAPTCHAs through human labor

7

slide-27
SLIDE 27
  • 2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

  • 1. Using a service which solves CAPTCHAs through human labor
  • 2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA

7

slide-28
SLIDE 28
  • 2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

  • 1. Using a service which solves CAPTCHAs through human labor
  • 2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA

  • 3. Character recognition software to solve the CAPTCHA

7

slide-29
SLIDE 29
  • 2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

  • 1. Using a service which solves CAPTCHAs through human labor
  • 2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA

  • 3. Character recognition software to solve the CAPTCHA

8

slide-30
SLIDE 30
  • 2. Breaking CAPTCHAs - Dataset

Testing two common types of CAPTCHA:

Figure 2: CAPTCHAs set 1, generated using PHP Figure 3: CAPTCHAs set 2, generated with Python

9

slide-31
SLIDE 31
  • 2. Breaking CAPTCHAs

Figure 4: Training the neural network

10

slide-32
SLIDE 32
  • 2. Breaking CAPTCHAs

Figure 5: Login web page with generated CAPTCHA

11

slide-33
SLIDE 33
  • 2. Breaking CAPTCHAs

Figure 6: Workflow of solving CAPTCHA with TensorFlow via Scrapy

12

slide-34
SLIDE 34

Results

slide-35
SLIDE 35
  • 1. Categorizing websites

13

slide-36
SLIDE 36
  • 1. Categorizing websites

Figure 7: Percentage of scraping blockade using CAPTCHAs (n = 465 )

13

slide-37
SLIDE 37
  • 1. Categorizing websites

Figure 8: Percentage of scraping blockades using CAPTCHAs (n = 465, n = 55)

14

slide-38
SLIDE 38
  • 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

15

slide-39
SLIDE 39
  • 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Figure 9: Success rate of Tesseract and TensorFlow (n = 1,000), higher is better

15

slide-40
SLIDE 40
  • 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Levenshtein distance: minimal edit distance to get the correct result [5] E.g. kitten to mitten = 1

16

slide-41
SLIDE 41
  • 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Levenshtein distance: minimal edit distance to get the correct result [5] E.g. kitten to mitten = 1

Figure 10: Combined Levenshtein distance, lower is better

16

slide-42
SLIDE 42

Conclusion

slide-43
SLIDE 43

Conclusion

  • Circumventing CAPTCHAs is necessary to scrape blocked parts of

websites

17

slide-44
SLIDE 44

Conclusion

  • Circumventing CAPTCHAs is necessary to scrape blocked parts of

websites

  • Machine Learning is most effective

17

slide-45
SLIDE 45

Conclusion

  • Circumventing CAPTCHAs is necessary to scrape blocked parts of

websites

  • Machine Learning is most effective
  • However, if immediacy takes precedent over success rate and

accuracy, then Tesseract (OCR) might be a better option

17

slide-46
SLIDE 46

Future Research

slide-47
SLIDE 47

Future Research

A more granular analysis of dark web websites:

18

slide-48
SLIDE 48

Future Research

A more granular analysis of dark web websites:

  • What content?

18

slide-49
SLIDE 49

Future Research

A more granular analysis of dark web websites:

  • What content?
  • Any content hidden, due to lack of privileges?

18

slide-50
SLIDE 50

Future Research

Increase readability for Tesseract by ”cleaning up” the image

Figure 11: Removing noise from CAPTCHA [6]

19

slide-51
SLIDE 51

Future Research

Achieve a more efficient training model, by using character segmentation

Figure 12: CAPTCHA character segmentation [7]

20

slide-52
SLIDE 52

Future Research

Try more CAPTCHAs:

21

slide-53
SLIDE 53

Future Research

Try more CAPTCHAs:

  • Increased difficulty

21

slide-54
SLIDE 54

Future Research

Try more CAPTCHAs:

  • Increased difficulty
  • If software to generate the CAPTCHAs, including the answers, is not

available; send a training set to be solved by human labor. This costs money, $ 1,39 per 1,000 images [8]

21

slide-55
SLIDE 55

Questions

?

22

slide-56
SLIDE 56

References

[1] Lawrence, H., Hughes, A., Tonic, R., & Zou, C. (2017, October). D-miner: A framework for mining, searching, visualizing, and alerting on darknet events. In Communications and Network Security (CNS), 2017 IEEE Conference on (pp. 1-9). IEEE. [2] Mitchell, R. (2015). Web scraping with Python: collecting data from the modern web. ” O’Reilly Media, Inc.”. [3] Arun Patala. https://deepmlblog.wordpress.com/2016/01/03/how- to-break-a-captcha-system/ [4]people.cs.pitt.edu [5]extremetech.com [6]ahm3dibrahim.wordpress.com [7] medium.com [8] http://www.deathbycaptcha.com/

23