Determine the true author of anonymous documents 1 /31 Team Worden - - PowerPoint PPT Presentation

determine the true author of anonymous documents
SMART_READER_LITE
LIVE PREVIEW

Determine the true author of anonymous documents 1 /31 Team Worden - - PowerPoint PPT Presentation

Determine the true author of anonymous documents 1 /31 Team Worden Marc Barrowclift Stakeholder - Rachel Greenstadt Travis Dutko Advisor - Jeff Salvage Corey Everitt Jiakang Jin Eric Nordstrom Ivan "Frankie"


slide-1
SLIDE 1

Determine the true author of anonymous documents

/31 1

slide-2
SLIDE 2
  • Marc Barrowclift
  • Travis Dutko
  • Corey Everitt
  • Jiakang Jin
  • Eric Nordstrom
  • Ivan "Frankie" Orrego

Stakeholder - Rachel Greenstadt Advisor - Jeff Salvage

Team Worden

/31 2

slide-3
SLIDE 3
  • Protect & identify authors of anonymous works
  • Real world example
  • JK Rowling’s Cuckoo’s Calling

What Is Worden?

/31 3

slide-4
SLIDE 4

What Is Worden?

/31 4

  • Our project
  • Dramatically restructure JStylo
  • Why care?
slide-5
SLIDE 5

Live Demo

/31 5

IDENTIFY ANONYMOUS DOCUMENT

slide-6
SLIDE 6

UI Design & User studies

/31 6

Domain Expert Average Intermediate

slide-7
SLIDE 7
  • IKVM.Net
  • Transpile .jar Dependencies into C# classes
  • Rapid prototyping due to familiarity

Original Client / Server Architecture

/31 7

slide-8
SLIDE 8
  • Data prep and packaging done on client side to meet deadlines
  • Angular.js MVC Client
  • Spring MVC Server

Final Client / Server Architecture

/31 8

slide-9
SLIDE 9
  • 2 way data bound
  • Allows proper HTTP abstraction
  • Handles DOM manipulation
  • Control over information flow
  • Highly modularized

Client Architecture

/31 9

slide-10
SLIDE 10
  • Only utilizing the “C” in MVC
  • 1. Picks up HTTP traffic
  • 2. Repackages it
  • 3. Pipes to JStylo
  • 4. Returns a JSON response to controller
  • SaaS (Stylometry as a Service)
  • Independent of web browser

Server Architecture

/31 10

slide-11
SLIDE 11
  • Feature Extraction Engine
  • Reduces documents to data
  • Machine Learning Engine
  • Interprets data

Backend System Architecture

/31 11

slide-12
SLIDE 12
  • Feature Extraction Engine
  • Convert raw words into numeric data
  • Tools: JGAAP, Stanford POS tagger.

Backend System Architecture

/31 12

slide-13
SLIDE 13
  • Feature Extraction Engine
  • Convert raw words into numeric data
  • Tools: JGAAP, Stanford POS tagger.

Backend System Architecture

/31 13

Example: “Sell as a great software engineering projct”

  • Word Bigrams

Sell as as a a great great software software engineering engineering projct project → projct: count = 1

  • Misspellings
  • Letter Bigrams

in: 2 ng: 2 se: 1 ro: 1 ll: 1 ea: 1 …

slide-14
SLIDE 14
  • Machine Learning Engine
  • Interprets/Classifies data
  • Tools: Weka, apache spark

Backend System Architecture

/31 14

slide-15
SLIDE 15
  • Machine Learning Engine
  • Interprets/Classifies data
  • Tools: Weka, apache spark

Backend System Architecture

/31 15

Image Source: Wikipedia

slide-16
SLIDE 16
  • Open source development
  • Builds upon work from dozens of research students
  • Apache Spark machine learning library added
  • Refactoring
  • Separate each component into its own, independent module
  • Decouple third party library, WEKA

Design & Construction

/31 16

slide-17
SLIDE 17

Before

/31 17

slide-18
SLIDE 18

After

/31 18

slide-19
SLIDE 19

Refactoring Progress

/31 19

slide-20
SLIDE 20

Design & Construction Cont.

/31 20

  • Design Patterns
  • Creational: Builder (API), Singleton
  • Structural: Adapter (Machine Learning integration), Decorator

(Feature Extraction Engine), Facade (API)

  • Testing
  • 76% of code touched was covered
slide-21
SLIDE 21

Machine Learning Adapter

/31 21

slide-22
SLIDE 22

Feature Extraction Decorator

/31 22

slide-23
SLIDE 23

Annotated Demo

/31 23

TEST YOUR OWN DOCUMENT

slide-24
SLIDE 24

/31 24

slide-25
SLIDE 25

/31 25

slide-26
SLIDE 26

/31 26

slide-27
SLIDE 27

/31 27

slide-28
SLIDE 28

/31 28

slide-29
SLIDE 29

Impact

/31 29

  • Enhance understanding of privacy vulnerabilities in a surveillance world
  • Education is enough to combat naïve attacks1
  • Evolving JStylo into the next stage of its lifecycle
  • 1. https://www.princeton.edu/~aylinc/papers/Aylin_PETS12_anonymouth.pdf
slide-30
SLIDE 30

Software Evolution

/31 30

  • Increased Ease of Extension
  • Decoupling / Modularization
  • Industry Standard Design Pattern
  • New Machine Learning libraries
  • Better dependency management / updatability
  • New methods of processing
  • Web front end
  • Cluster-computing
  • JSON API
  • 1. https://www.princeton.edu/~aylinc/papers/Aylin_PETS12_anonymouth.pdf
slide-31
SLIDE 31

Software Evolution Cont.

/31 31

  • Future Work
  • More machine learning libraries
  • Feature Extraction overhaul
  • Verification – solving new problems
  • 1. https://www.princeton.edu/~aylinc/papers/Aylin_PETS12_anonymouth.pdf
slide-32
SLIDE 32

For sponsoring Worden’s server

A Special Thanks To Our Sponser

slide-33
SLIDE 33

Questions?