content on the Web and social media Dr. Theodora Tsikrika Date: - - PowerPoint PPT Presentation

content on the web and social media
SMART_READER_LITE
LIVE PREVIEW

content on the Web and social media Dr. Theodora Tsikrika Date: - - PowerPoint PPT Presentation

Discovery and retrieval of terrorism related content on the Web and social media Dr. Theodora Tsikrika Date: Venue: Multimedia Knowledge and Social Media Analytics Lab Information Technologies Institute Centre for Research and Technology


slide-1
SLIDE 1

Date: Venue:

Discovery and retrieval of terrorism related content on the Web and social media

  • Dr. Theodora Tsikrika

Multimedia Knowledge and Social Media Analytics Lab Information Technologies Institute Centre for Research and Technology Hellas (CERTH)

slide-2
SLIDE 2

HomeMade Explosives (HMEs) and Recipes characterisation

FP7 project (Nov 2013 - Dec 2016) https://www.homer-project.eu/

Research & Innovation Activities

reTriEval and aNalysis of heterogeneouS online content for terrOrist activity Recognition H2020 IA (Sep 2016 – Aug 2019) http://tensor-project.eu/

slide-3
SLIDE 3

Date: Venue:

The Web

slide-4
SLIDE 4

Date: Venue:

Surface vs. Deep vs. Dark Web

  • Surface Web:

– Readily accessible content – Indexed by search engines

  • Deep Web:

– Further user actions needed in order to access the content – Special techniques needed for crawling/indexing – Much larger than Surface Web

  • Dark Web:

– Special software needed in order to access the content – Provides users with anonymity – Includes several darknets (e.g., TOR, I2P, Freenet, etc.) – Usage: illegal marketplaces, whistleblowing, Bitcoin transactions, etc. – User base: from journalists and LEAs to criminals

slide-5
SLIDE 5

Date: Venue:

Motivation

  • Challenges for Law Enforcement Agencies (LEAs):

– Extensive use of Surface Web & Dark Web for communication and diffusion of terrorism-related information

  • Propaganda and radicalization
  • Tutorials on the construction of explosives and weapons

– Need for effective and efficient domain-specific discovery tools

  • Barriers:

– Surface Web discovery tools:

  • effective for general search, more limited for domain-specific search

– Dark Web discovery tools:

  • limited for both general & domain-specific search
slide-6
SLIDE 6

Date: Venue:

Domain-specific discovery methods

  • 1. Focused crawling

– Domain-specific document collection – Automatically traversing the Web link structure of the Web – Selecting links to follow based on their relevance to the domain

  • 2. Search engine querying

– Automatically query search engines/social media using their APIs – (Semi-)automatic domain-specific query generation & expansion

  • 3. Hybrid approach

– (1) + (2) + (post-retrieval classification)

slide-7
SLIDE 7

Date: Venue:

Crawling

slide-8
SLIDE 8

Focused crawling

slide-9
SLIDE 9

Focused crawling

  • Classifier-guided link selection

– Anchor text – URL terms – Text window (x = 100 characters) surrounding anchor text – Web page text

slide-10
SLIDE 10

Focused crawling (+ Dark Web)

slide-11
SLIDE 11

Date: Venue:

Experiments

  • Seed set: 5 pages (1 Surface Web, 1 TOR , 2 I2P, 1 Freenet)
  • Seed set obtained: LEAs representatives + domain experts
  • Crawling depth = 2
  • Link selection classifier / Web page classifier

– Training set: 400 (105 pos, 295 neg) / 600 (250 pos, 350 neg) samples – SVM classifier with RBF kernel

Threshold 0.5 0.6 0.7 0.8 0.9 Link-based classifier Precision 0.63 0.63 0.77 0.77 0.97 Recall 1.00 0.91 0.87 0.84 0.42 F-measure 0.77 0.74 0.82 0.8 0.58 Link-based classifier + Web page classifier Precision 0.86 0.87 0.87 0.87 0.94 Recall 1.00 0.99 0.96 0.92 0.47 F-measure 0.93 0.92 0.91 0.9 0.62

slide-12
SLIDE 12

Date: Venue:

Search engine querying

  • Query generation & expansion

1. Exploit domain-specific knowledge for query generation 2. Apply machine learning/deep learning for query expansion

  • Query submission

1. Multiple queries automatically submitted 2. Search results merged (duplicate removal, re-ranking) 3. Post-retrieval classification (filtering step)

slide-13
SLIDE 13

Date: Venue:

Query generation - patterns

Concepts Keywords _explosive_ acetone peroxide, anfo, c-4, hmtd, lead azide, lead picrate, mercury fulminate, nitrocellulose, nitrogen triiodide, nitroglycerin, nitroglycol, potassium chlorate, petn, picric acid, rdx, r-salt, semtex, tatp, trinitrotoluene TNT, urea nitrate _ingredient_ ammonium nitrate, potassium nitrate _context_ anarchist, islam _object_ bomb(s), explosive(s), ied, pyrotechnics, homemade bomb(s), homemade explosive(s), homemade ied, homemade pyrotechnics, improvised bomb(s), improvised explosive(s), improvised pyrotechnics _action_ how to make, manufacture, making, preparation, synthesis _recipe_ recipe(s), preparatory manual _resource_ book, forum, handbook, pdf, torrent, video

slide-14
SLIDE 14

Date: Venue:

Query generation - patterns

Patterns Equivalent _ingredient_ _explosive_ _explosive_ _object_ _explosive_ plastic homemade _explosive_ _object_ _object_ _recipe_ _recipe_ _object_ _action_ _explosive_ _explosive_ _action_ _action_ _explosive_ at home _action_ _explosive_ _object_ _explosive_ _object_ _action_ _action_ _object_ _explosive_ _action_ _explosive_ powder _action_ _object_ _object_ _action_ _action_ _action_ _explosive_

slide-15
SLIDE 15

Date: Venue:

Query generation - patterns

Pattern Candidate Queries _object_ _recipe_ homemade bomb recipe homemade explosive recipe improvised bomb recipe improvised explosive recipe ied recipe

slide-16
SLIDE 16

Date: Venue:

Query generation - patterns

HME queries ammonium nitrate urea nitrate trinitrotoluene TNT tatp semtex r−salt rdx picric acid petn potassium chlorate nitroglycol nitroglycerin nitrogen triiodide nitrocellulose mercury fulminate lead picrate lead azide hmtd c−4 black gunpowder anfo acetone peroxide all

precision 0.0 0.2 0.4 0.6 0.8 1.0

Experimental evaluation

  • 414 queries
  • top 10 results retrieved
  • 1157 unique URLs
  • manually assessed
slide-17
SLIDE 17

Date: Venue:

Query expansion

  • Machine learning techniques (decision trees) for

generating candidate expansion terms

problem OR bombs OR home OR time OR impact OR ^glass OR heating OR terms OR acid OR ^power OR ^rights OR ^time OR grams OR alcohol OR cap OR fuel OR reaction OR (explosive AND ^petn) OR (explosive AND ^world) OR (explosive AND acid)

  • Simplification

heating OR grams OR fuel OR reaction OR (explosive AND acid)

slide-18
SLIDE 18

Date: Venue:

Hybrid discovery approach

slide-19
SLIDE 19

Date: Venue:

Social media discovery framework

slide-20
SLIDE 20

Date: Venue:

Key player identification

  • Aim: identify key players in terrorism-related social media networks
  • Goal: remove key players  destroy internal connectivity 

community becomes small isolated networks

  • Motivation: social media networks exhibit scale free topology

– power law degree distribution – robust to random attacks – vulnerable to targeted attacks

  • Approach: targeted attacks based on centrality measures (existing+new)
  • Evaluation: social media network of terrorism-related Twitter posts
slide-21
SLIDE 21

Date: Venue:

Terrorism-related social media discovery

  • Social media network of Twitter accounts

– query Twitter API – Arabic keywords provided by LEAs + domain experts – keywords related to Caliphate state (ISIS)

  • Dataset:

– 38,766 posts by 5,461 users – 100 posts manually assessed for relevance – users linked through mentions – largest connected component: 3,600 users/9,203 links – 2.56 power law exponent (p-value = 0.7780)

slide-22
SLIDE 22

Date: Venue:

Results: largest connected component decay

Decrease in relative size:

  • 5% random attack
  • 27.1 % closeness centrality
  • 44 – 49 % rest of centrality measures
  • 50.1% MEB
slide-23
SLIDE 23

Date: Venue:

Results: key players

  • Top-10 key players identified by each of the 7 centrality

measures

– 18 unique Twitter user accounts

  • 10 days after dataset construction:

– 14 out of 18 suspended – 10 out of 14 suspensions took place within 72 hours of account creation

  • Further evidence to dataset relevance
  • High volatility
slide-24
SLIDE 24

Date: Venue:

Conclusions

  • Domain-specific discovery tools

– Build your own search engine – Exploit capabilities of already existing search systems – Combine them in a hybrid approach – Exploit social network structures

  • Challenges

– Multilingual and Multimedia content – From Surface to Dark Web – Volatility (Dark Web, social media) – Validating sources (mis-information, dis-information, etc.) – Legal, ethical and privacy aspects

slide-25
SLIDE 25

Date: Venue:

References

1.

  • G. Kalpakis, T. Tsikrika, N. Cunningham, C. Iliou, S. Vrochidis, J. Middleton, I. Kompatsiaris, "OSINT and the Dark

Web", In "Open Source Intelligence Investigation - From Strategy to Implementation", B. Akhgar, P. S. Bayerl, F. Sampson (Eds.), Springer, 2016 2.

  • I. Gialampoukidis, G. Kalpakis, T. Tsikrika, S. Papadopoulos, S. Vrochidis, I. Kompatsiaris. "Detection of

Terrorism-related Twitter Communities using Centrality Scores". In Proceedings of International Workshop on Multimedia Forensics and Security (MFSec 2017), Bucharest, Romania, June 06, 2017 3.

  • T. Tsikrika, B. Akhgar, V. Katos, S. Vrochidis, P. Burnap, M. L. Williams., "1st International Workshop on Search

and Mining Terrorist Online Content & Advances in Data Science for Cyber Security and Risk on the Web". In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM 2017), pp. 823-824, Cambridge, UK, February 2017 4.

  • C. Iliou, G. Kalpakis, T. Tsikrika, S. Vrochidis, I. Kompatsiaris, "Hybrid Focused Crawling for Homemade

Explosives Discovery on Surface and Dark Web", 11th International Conference on Availability, Reliability and Security (ARES 2016), Salzburg, Austria, Aug 2016 5.

  • I. Gialampoukidis, G. Kalpakis, T. Tsikrika, S. Vrochidis, I. Kompatsiaris, "Key player identification in terrorism-

related social media networks using centrality measures". In Intelligence and Security Informatics Conference (EISIC), 2016 European, pp. 112-115. IEEE, 2016 6.

  • G. Kalpakis, T. Tsikrika, C. Iliou, T. Mironidis, S. Vrochidis, J. Middleton, U. Williamson, I. Kompatsiaris,

"Interactive Discovery and Retrieval of Web Resources Containing Home Made Explosive Recipes", 4th International Conference on Human Aspects of Information Security, Privacy and Trust, Toronto, Canada, 17 - 22 July 2016

slide-26
SLIDE 26

Date: Venue:

THANK YOU!

http://mklab.iti.gr theodora.tsikrika@iti.gr