Your words betray you! The role of language in cyber crime - - PowerPoint PPT Presentation

your words betray you the role of language in cyber crime
SMART_READER_LITE
LIVE PREVIEW

Your words betray you! The role of language in cyber crime - - PowerPoint PPT Presentation

Your words betray you! The role of language in cyber crime inves9ga9ons Awais Rashid Digital World Online World Physical World dual use P2P Study 1.6% of searches and 2.4% responses on Gnutella network alone (Study by Hughes et al.


slide-1
SLIDE 1

Awais Rashid

Your words betray you! The role of language in cyber crime inves9ga9ons

slide-2
SLIDE 2

Physical World Online World

Digital World

slide-3
SLIDE 3

dual use

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
  • 1.6% of searches and 2.4% responses on Gnutella network

alone (Study by Hughes et al. 2006)

  • Hundreds or thousands of searches per second

– Approx. 600,000 searches per day on Gnutella alone

  • Specialist vocabulary: 53% of searches used such keywords

and 88% of responses.

– Vocabulary changes over Gme.

P2P Study

slide-8
SLIDE 8

SEARCH FREQUENCY

Topic Popularity

Top 100 Frequent Searches

slide-9
SLIDE 9

Core of Distributors

slide-10
SLIDE 10

Chat and Social Networking

slide-11
SLIDE 11
slide-12
SLIDE 12

Digital Personas

slide-13
SLIDE 13
slide-14
SLIDE 14

Do you Know Who you are Talking to?

? ?

18.3%

slide-15
SLIDE 15
slide-16
SLIDE 16

Isis: ProtecGng Children in Online Social Networks (EPSRC/ESRC) iCOP: IdenGfying and Catching Originators in P2P Networks (EC Safer Internet Programme)

Experience from

slide-17
SLIDE 17

DetecGng DecepGve Digital Personas

slide-18
SLIDE 18
slide-19
SLIDE 19

StylisGc Language “Fingerprint”

Individual_1 New text Individual_2 New text Individual_3 New text Individual_4 New text
slide-20
SLIDE 20

Age and Gender Analysis

Distance Measure Male Female Reference Data Sets Stylis;c Features Classifier Word level SyntacGc level SemanGc level

slide-21
SLIDE 21
slide-22
SLIDE 22

No DecepGon – Age (Precision)

10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Level 1 Level 2 Level 3 Level 4 Level 5

Threshold (%) Precision (%)

72.24% 77.35%

slide-23
SLIDE 23

No DecepGon – Age (Recall)

10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Level 1 Level 2 Level 3 Level 4 Level 5

Threshold (%) Recall (%)

slide-24
SLIDE 24

No DecepGon - Gender

10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Recall Precision

Threshold (%) Recall / Precision (%)

66.86% 71.07%

slide-25
SLIDE 25

DecepGon DetecGon

? ?

18.3%

slide-26
SLIDE 26

DecepGon DetecGon

? ?

84.29%

slide-27
SLIDE 27

DecepGon DetecGon

? ?

93.18%

slide-28
SLIDE 28
  • Being used by law enforcement following trials and

commercialisaGon via a spin-out company (RelaGve Insight)

  • UK case study for Internet Governance Forum in 2009, 2010
  • Featured in internaGonal TV and print news media
  • Part of evidence to UK Select Commigee on Child ProtecGon

and EU Policy frameworks.

  • Chosen as one of the 100 Big Ideas for the future by

UniversiGes UK and Research Councils UK (2011)

  • Mobile App built on the digital persona analysis

demonstrated to the Prime Minister at WeProtect, Dec. 2014

  • An Impact Case Study for REF2014

Is it being used?

slide-29
SLIDE 29

DetecGng Specialist TacGcs, e.g., Vocabulary

slide-30
SLIDE 30

§ Using query analysis to automaGcally triage and idenGfy potenGal candidates for new CSA media

§ New text analysis techniques to automaGcally flag potenGal CSA media based on their filename

  • (Semi-)automaGc video and image analysis

techniques to assess CSA content

30

DetecGng new/unknown CSA media in P2P Networks

slide-31
SLIDE 31

§ Compiling a CSA dataset § Filenames = short text samples § Presence of non-standard forms & “specialised” vocabulary

31

Filename ClassificaGon

Key challenges

slide-32
SLIDE 32

§ Manual collecGon through LE à 268 CSA filenames

§ Legal pornography sites à 10K non-CSA filenames § simulate real-life data distribuGon in P2P

32

Filename ClassificaGon (2)

Dataset

slide-33
SLIDE 33

§ Seman9c features

  • Known CSA keywords
  • Explicit language use
  • References to children, young age
  • Family relaGons

33

Filename ClassificaGon (3)

Feature Selec;on

Original filename

ptl0lita12yo.jpeg

Seman9c Feats.

[paedo_keyword] [child_ref]

slide-34
SLIDE 34

§ Character n-grams

  • slices of 2, 3 and 4 consecuGve characters

34

Filename ClassificaGon (4)

Feature Selec;on

Original filename

ptl0lita12yo.jpeg

  • Char. 2-grams

pt tl l0 0l li it ta a1 12 2y yo

  • Char. 3-grams

ptl tl0 l0l 0li lit ita ta1 a12 12y 2yo

  • Char. 4-grams

ptl0 tl0l l0li 0lit lita ita1 ta12 a12y 12yo

slide-35
SLIDE 35

§ Support Vector Machines (LibShortText) § 5-fold cross-validaGon § EvaluaGon:

  • Overall system accuracy
  • Precision, Recall and F-score per class label

35

Filename ClassificaGon (5)

Experimental Setup

slide-36
SLIDE 36

36

Filename ClassificaGon (6)

Results

Scores SVM classifier (%) Precision Recall F-score Seman;c feats. CSA 5.7 21.3 9.0 Non-CSA 97.7 90.6 94.0

  • Char. n-grams CSA

89.8 62.3 73.6 Non-CSA 99.0 99.8 99.4 Combined CSA 89.9 66.1 76.1 Non-CSA 99.1 99.8 99.5

slide-37
SLIDE 37
slide-38
SLIDE 38

The iCOP Toolkit

slide-39
SLIDE 39
  • Training days for European Law Enforcement personnel

– ParGcipants from 8 European countries and Interpol – Hands-on sessions on live P2P data

  • Live demonstraGon at Interpol at end of project
  • Being uGlised by several law enforcement agencies in Europe

Is it being used?

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

Isis

  • A. Rashid, A. Baron, P. Rayson, C. May-Chahal, P. Greenwood, J.

Walkerdine (2013). “Who Am I? Analysing Digital Personas in Cyber Crime Inves;ga;ons”, IEEE Computer, 46(4).

  • C. May-Chahal, C. Mason, A. Rashid, P. Greenwood, J.

Walkerdine, P. Rayson (2014). “Safeguarding Cyborg Childhoods: Incorpora;ng the On/Offline Behaviour of Children into Everyday Social Work Prac;ces”, BriGsh Journal of Social Work.

Further InformaGon

slide-45
SLIDE 45

iCOP

  • C. Peersman, C. Schulze, A. Rashid, M. Brennan, C. Fischer

(2014). “iCOP: Automa;cally Iden;fying New Child Abuse Media in P2P Networks”, IEEE Symposium on Security and Privacy Workshops 2014: 124-131

  • C. Peersman, C. Schulze, A. Rashid, M. Brennan, C. Fischer

(2016). “iCOP: live forensics to reveal previously unknown criminal media on P2P networks”, Digital InvesGgaGon, 18, pp. 50-64.

Further InformaGon

slide-46
SLIDE 46

General

  • M. Edwards, A. Rashid, P. Rayson (2015). “A Systema;c Survey of

Online Data Mining Technology Intended for Law Enforcement”, ACM CompuGng Surveys, 48(1).

  • A. Rashid, J. Weckert, R. Lucas: (2009). “SoZware Engineering

Ethics in a Digital World” IEEE Computer 42(6): 34-41.

  • A. Rashid, K. Moore, C. May-Chahal, R. Chitchyan (2015).

“Managing emergent ethical concerns for soZware engineering in society”, Proc. ICSE 2015, Soqware Engineering in Society, pp. 523-526. IEEE

Further InformaGon