breaking captchas on the dark web
play

Breaking CAPTCHAs on the Dark Web Using neural networks to enable - PowerPoint PPT Presentation

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018 University of Amsterdam Introduction Scraping the Dark Web Useful for


  1. Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018 University of Amsterdam

  2. Introduction

  3. Scraping the Dark Web Useful for threat intelligence companies 1

  4. Scraping the Dark Web Useful for threat intelligence companies ... sometimes hard to get to. 1

  5. Scraping the Dark Web Useful for threat intelligence companies ... sometimes hard to get to. Mainly the blockades, such as CAPTCHAs, is an issue for the scrapers. 1

  6. CAPTCHA Figure 1: CAPTCHA example • Completely Automated Public Turing test to tell Computer and Humans Apart 2

  7. CAPTCHA Figure 1: CAPTCHA example • Completely Automated Public Turing test to tell Computer and Humans Apart • Test to determine whether the user is human or not 2

  8. Main question How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? 3

  9. Main question How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? Sub-questions: 1. Impact of solving CAPTCHAs 3

  10. Main question How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? Sub-questions: 1. Impact of solving CAPTCHAs 2. Solve CAPTCHAs by using Optical Character Recognition (OCR)? 3

  11. Main question How would a scraper be able to circumvent CAPTCHAs that prevent it from properly scraping dark web websites? Sub-questions: 1. Impact of solving CAPTCHAs 2. Solve CAPTCHAs by using Optical Character Recognition (OCR)? 3. Solving CAPTCHAs by using Machine Learning (ML) 3

  12. Related Work

  13. Related Work 1. Lawrence et al. created their own dark web scraping tool, D-miner; CAPTCHAs were solved by human labor [1] 4

  14. Related Work 1. Lawrence et al. created their own dark web scraping tool, D-miner; CAPTCHAs were solved by human labor [1] 2. Ryan Mitchell demonstrated how to solve CAPTCHAs using Optical Character Recognition with Tesseract [2] 4

  15. Related Work 1. Lawrence et al. created their own dark web scraping tool, D-miner; CAPTCHAs were solved by human labor [1] 2. Ryan Mitchell demonstrated how to solve CAPTCHAs using Optical Character Recognition with Tesseract [2] 3. Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4

  16. Methods

  17. Methods Two methods to solve the questions: 1. Categorizing dark web websites 2. Breaking CAPTCHAs 5

  18. 1. Categorizing websites 6

  19. 1. Categorizing websites Analysis of 633 dark web websites 6

  20. 1. Categorizing websites Analysis of 633 dark web websites • Which ones are up? 6

  21. 1. Categorizing websites Analysis of 633 dark web websites • Which ones are up? • Are there any duplicates? 6

  22. 1. Categorizing websites Analysis of 633 dark web websites • Which ones are up? • Are there any duplicates? • Which ones block scraping? 6

  23. 1. Categorizing websites Analysis of 633 dark web websites • Which ones are up? • Are there any duplicates? • Which ones block scraping? • What kind of blockade are they using? 6

  24. 2. Breaking CAPTCHAs 7

  25. 2. Breaking CAPTCHAs There are 3 common approaches to defeat CAPTCHAs: 7

  26. 2. Breaking CAPTCHAs There are 3 common approaches to defeat CAPTCHAs: 1. Using a service which solves CAPTCHAs through human labor 7

  27. 2. Breaking CAPTCHAs There are 3 common approaches to defeat CAPTCHAs: 1. Using a service which solves CAPTCHAs through human labor 2. Exploiting bugs in the implementation that allow the attacker to bypass the CAPTCHA 7

  28. 2. Breaking CAPTCHAs There are 3 common approaches to defeat CAPTCHAs: 1. Using a service which solves CAPTCHAs through human labor 2. Exploiting bugs in the implementation that allow the attacker to bypass the CAPTCHA 3. Character recognition software to solve the CAPTCHA 7

  29. 2. Breaking CAPTCHAs There are 3 common approaches to defeat CAPTCHAs: 1. Using a service which solves CAPTCHAs through human labor 2. Exploiting bugs in the implementation that allow the attacker to bypass the CAPTCHA 3. Character recognition software to solve the CAPTCHA 8

  30. 2. Breaking CAPTCHAs - Dataset Testing two common types of CAPTCHA: Figure 2: CAPTCHAs set 1, generated using PHP Figure 3: CAPTCHAs set 2, generated with Python 9

  31. 2. Breaking CAPTCHAs Figure 4: Training the neural network 10

  32. 2. Breaking CAPTCHAs Figure 5: Login web page with generated CAPTCHA 11

  33. 2. Breaking CAPTCHAs Figure 6: Workflow of solving CAPTCHA with TensorFlow via Scrapy 12

  34. Results

  35. 1. Categorizing websites 13

  36. 1. Categorizing websites Figure 7: Percentage of scraping blockade using CAPTCHAs (n = 465 ) 13

  37. 1. Categorizing websites Figure 8: Percentage of scraping blockades using CAPTCHAs (n = 465, n = 55) 14

  38. 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract 15

  39. 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract Figure 9: Success rate of Tesseract and TensorFlow (n = 1,000), higher is better 15

  40. 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract Levenshtein distance : minimal edit distance to get the correct result [5] E.g. kitten to mitten = 1 16

  41. 2. Breaking CAPTCHAs - TensorFlow vs. Tesseract Levenshtein distance : minimal edit distance to get the correct result [5] E.g. kitten to mitten = 1 Figure 10: Combined Levenshtein distance, lower is better 16

  42. Conclusion

  43. Conclusion • Circumventing CAPTCHAs is necessary to scrape blocked parts of websites 17

  44. Conclusion • Circumventing CAPTCHAs is necessary to scrape blocked parts of websites • Machine Learning is most effective 17

  45. Conclusion • Circumventing CAPTCHAs is necessary to scrape blocked parts of websites • Machine Learning is most effective • However, if immediacy takes precedent over success rate and accuracy, then Tesseract (OCR) might be a better option 17

  46. Future Research

  47. Future Research A more granular analysis of dark web websites: 18

  48. Future Research A more granular analysis of dark web websites: • What content? 18

  49. Future Research A more granular analysis of dark web websites: • What content? • Any content hidden, due to lack of privileges? 18

  50. Future Research Increase readability for Tesseract by ”cleaning up” the image Figure 11: Removing noise from CAPTCHA [6] 19

  51. Future Research Achieve a more efficient training model, by using character segmentation Figure 12: CAPTCHA character segmentation [7] 20

  52. Future Research Try more CAPTCHAs: 21

  53. Future Research Try more CAPTCHAs: • Increased difficulty 21

  54. Future Research Try more CAPTCHAs: • Increased difficulty • If software to generate the CAPTCHAs, including the answers, is not available; send a training set to be solved by human labor. This costs money, $ 1,39 per 1,000 images [8] 21

  55. Questions ? 22

  56. References [1] Lawrence, H., Hughes, A., Tonic, R., & Zou, C. (2017, October). D-miner: A framework for mining, searching, visualizing, and alerting on darknet events. In Communications and Network Security (CNS), 2017 IEEE Conference on (pp. 1-9). IEEE. [2] Mitchell, R. (2015). Web scraping with Python: collecting data from the modern web. ” O’Reilly Media, Inc.”. [3] Arun Patala. https://deepmlblog.wordpress.com/2016/01/03/how- to-break-a-captcha-system/ [4]people.cs.pitt.edu [5]extremetech.com [6]ahm3dibrahim.wordpress.com [7] medium.com [8] http://www.deathbycaptcha.com/ 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend