Breaking reCAPTCHA: A Holistic Approach via Shape Recognition IFIP - - PowerPoint PPT Presentation

breaking recaptcha a holistic approach via shape
SMART_READER_LITE
LIVE PREVIEW

Breaking reCAPTCHA: A Holistic Approach via Shape Recognition IFIP - - PowerPoint PPT Presentation

Breaking reCAPTCHA: A Holistic Approach via Shape Recognition IFIP SEC 2011 Paul Baecher, Niklas B uscher, Marc Fischlin, Benjamin Milde Darmstadt University of Technology, supported by DFG Heisenberg and Emmy Noether Programmes


slide-1
SLIDE 1

Breaking reCAPTCHA: A Holistic Approach via Shape Recognition

IFIP SEC 2011

Paul Baecher, Niklas B¨ uscher, Marc Fischlin, Benjamin Milde

Darmstadt University of Technology, supported by DFG Heisenberg and Emmy Noether Programmes

slide-2
SLIDE 2

Introduction

1

slide-3
SLIDE 3

What Are CAPTCHAs?

  • Completely Automated Public Turing test to tell Computers and

Humans Apart

  • “reverse” Turing test, term coined by [vABHL03]
  • challenge/response protocol where
  • response should be easy to observe for humans
  • response should be hard to compute for machines
  • 0.01% according to [CLSC05, vAMM+08]
  • application: protect online services from automated use

image: cryptographp 2

slide-4
SLIDE 4

reCAPTCHA

1st generation 2nd generation 3rd generation 4th generation

  • Very popular CAPTCHA service by Google
  • may be considered quite “strong”
  • unique feature: uses OCR source to generate challenges
  • scan and verification word
  • dictionary words. . .

3

slide-5
SLIDE 5

reCAPTCHA Today

reCAPTCHA as of June 2011 (5th generation)

4

slide-6
SLIDE 6

Breaking reCAPTCHA

5

slide-7
SLIDE 7

Breaking reCAPTCHA – Approach

  • Typical approach to break text CAPTCHAs
  • segment into individual letters/digits
  • recognize each letter/digit individually

6

slide-8
SLIDE 8

Breaking reCAPTCHA – Approach

  • Typical approach to break text CAPTCHAs
  • segment into individual letters/digits
  • recognize each letter/digit individually
  • non-trivial segmentation is considered hard [CLSC05]
  • our approach
  • match entire words at once (holistically)
  • i.e. skip segmentation and treat words as letters

6

slide-9
SLIDE 9

High-level Overview

scale 200% detect edges remove ellipse shape repr. (no ellipse)

  • Third generation reCAPTCHA challenges add inverted ellipses

7

slide-10
SLIDE 10

Removing the ellipse

  • 1. Approximate ellipse center
  • riginal challenge

8

slide-11
SLIDE 11

Removing the ellipse

  • 1. Approximate ellipse center

after erosion operations

8

slide-12
SLIDE 12

Removing the ellipse

  • 1. Approximate ellipse center

after dilation operations

8

slide-13
SLIDE 13

Removing the ellipse

  • 1. Approximate ellipse center

center approximated

8

slide-14
SLIDE 14

Removing the ellipse

  • 1. Approximate ellipse center
  • 2. run edge detection on the challenge image

edge detection

8

slide-15
SLIDE 15

Removing the ellipse

  • 1. Approximate ellipse center
  • 2. run edge detection on the challenge image
  • 3. use machine learning to classify contour pixels

after classification, 1 round

8

slide-16
SLIDE 16

Removing the ellipse

  • 1. Approximate ellipse center
  • 2. run edge detection on the challenge image
  • 3. use machine learning to classify contour pixels

after classification, 4 rounds

8

slide-17
SLIDE 17

Removing the ellipse

  • 1. Approximate ellipse center
  • 2. run edge detection on the challenge image
  • 3. use machine learning to classify contour pixels

after classification, 9 rounds

8

slide-18
SLIDE 18

Matching Shapes

  • Contour line (without ellipse) describes the shape of a word
  • reCAPTCHA words are dictionary words
  • key idea: prepare a database of all dictionary words and

use common shape matching techniques

9

slide-19
SLIDE 19

Matching Shapes

  • Contour line (without ellipse) describes the shape of a word
  • reCAPTCHA words are dictionary words
  • key idea: prepare a database of all dictionary words and

use common shape matching techniques

  • How to build a database of all dictionary words?
  • How to “match” two shapes?

9

slide-20
SLIDE 20

Shape Recognition

10

slide-21
SLIDE 21

Shape Recognition

  • Well-studied problem in Computer Vision
  • powerful technique: Shape Contexts (SC)
  • invariant against translation and scaling
  • compact description of the shape

challenge shape reference shapes challenge SC reference SCs create SC match create SC

11

slide-22
SLIDE 22

From Shapes to Shape Contexts

  • Convert shape (set of points in polar space) into SC (sets of

two dimensional histograms)

  • example for one point:

distance bins angle bins

  • use a χ2-distance to match sets of histograms

12

slide-23
SLIDE 23

Matching Shape Contexts Efficiently

  • Naive approach is prohibitively slow for 20K dictionary words
  • more efficient strategy needed
  • work on a random subset of the sets of points of the shape
  • start with a small subset and double it gradually
  • results in logarithmic search space reduction
  • first/last character special treatment
  • easy to detect, allows to prune large chunks

13

slide-24
SLIDE 24

Experimental Results

14

slide-25
SLIDE 25

Results

reCAPTCHA generation 2 3 4 Test set size 496 1005 301 Total success rate 12.7% 5.9% 11.6% Run time 24.5s 17.5s 15.4s Dictionary success rate 22% 10.43% 23.5% First character detected 90.2% 73.2% 84.6%

  • Recall that a CAPTCHA is considered broken at 0.01%
  • performance measurement on verification words only

15

slide-26
SLIDE 26

The End

Thank you!

?

16

slide-27
SLIDE 27

References

Kumar Chellapilla, Kevin Larson, Patrice Y. Simard, and Mary Czerwinski. Building segmentation based human-friendly human interaction proofs (HIPs). In HIP, volume 3517 of Lecture Notes in Computer Science, pages 1–26. Springer-Verlag, 2005. Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. CAPTCHA: Using hard AI problems for security. In Eli Biham, editor, Advances in Cryptology – EUROCRYPT 2003, volume 2656 of Lecture Notes in Computer Science, pages 294–311, Warsaw, Poland, May 4–8, 2003. Springer, Berlin, Germany. Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. reCAPTCHA: Human-based character recognition via web security measures. Science, 321(5895):1465–1468, 2008. 17