PyData Bratislava
22nd May 2019
Data-driven Approaches for Detection of Antisocial Behavior
Veronika Žatková, Ivan Srba, Róbert Móro (FIIT STU)
Data-driven Approaches for Detection of Antisocial Behavior - - PowerPoint PPT Presentation
Data-driven Approaches for Detection of Antisocial Behavior Veronika atkov, Ivan Srba, Rbert Mro (FIIT STU) PyData Bratislava 22 nd May 2019 WHO ARE WE? Ivan and Rbert - Researchers @FIIT STU Veronika - Master student @FIIT STU Our
PyData Bratislava
22nd May 2019
Data-driven Approaches for Detection of Antisocial Behavior
Veronika Žatková, Ivan Srba, Róbert Móro (FIIT STU)
Our topics of interest:
▪ Data science ▪ Machine learning ▪ Data mining
WHO ARE WE?
Ivan and Róbert - Researchers @FIIT STU Veronika - Master student @FIIT STU
2
▪ Computational social science ▪ Social computing
3
Source: https://kinsta.com/blog/wordpress-social-media-plugins/
4
Source: https://www.wsj.com/articles/scholars-get-the-real-scoop-on-fake-news-1515360315
5
Source: https://www.poynter.org/fact-checking/2019/is-expert-crowdsourcing-the-solution-to-health-misinformation/
6
Source: https://www.edutopia.org/blog/how-respond-when-students-use-hate-speech-richard-curwin
7
can help to characterize, detect and mitigate such antisocial behavior? How
Two research projects: ▪
▪
https://rebelion.fiit.stuba.sk/
WHAT ARE WE WORKING ON?
8
Cooperation:
Data science perspective
ANTISOCIAL BEHAVIOR
10
TASKS
11
Characterization
▪ what does characterize/distinguish, e.g., fake news from true news, how is it spread and by whom is it shared?
Detection
▪ how can we automatically detect fake news, hate speech, etc.?
Mitigation
▪ how can we stop, e.g., the spread of fake news in a transparent, trustworthy, ethical way?
TECHNIQUES
12
Machine learning Data mining Natural language processing Neural networks and deep learning
OPEN PROBLEMS
13
Exploiting content, user and context data
▪ Multisource approaches ▪ Multimodal approaches ▪ Multilingual approaches ▪ Extended context
OPEN PROBLEMS
14
Exploiting content, user and context data
▪ Multisource approaches ▪ Multimodal approaches ▪ Multilingual approaches ▪ Extended context
Addressing unlabelled and dynamic data
▪ Unsupervised, semi-supervised and ensemble models (e.g. multiview learning) ▪ Active learning
OPEN PROBLEMS
15
Exploiting content, user and context data
▪ Multisource approaches ▪ Multimodal approaches ▪ Multilingual approaches ▪ Extended context
Addressing unlabelled and dynamic data
▪ Unsupervised, semi-supervised and ensemble models (e.g. multiview learning) ▪ Active learning
Investigating new mitigation approaches
▪ Early warning system ▪ On-site warning system ▪ Education and training
OPEN PROBLEMS
16
No suitable content-rich and benchmark datasets No suitable applications and platforms to deploy solutions
Platform for monitoring antisocial behavior
18
IMPLEMENTATION
19
Primary implementation language: Python Dev ops
▪ Docker ▪ Travis CI
CENTRAL DATA STORAGE
20
Mediates data transfer between all platform modules Three layers
▪ Evidence layer ▪ Inference and prediction layer ▪ Platform management layer
CENTRAL DATA STORAGE
21
Mediates data transfer between all platform modules Implementation
▪ Flask ▪ PostgreSQL ▪ REST APIs + Apistrap + Schematics ▪ Swagger
http://flask.pocoo.org/ https://github.com/Cognexa/apistrap https://schematics.readthedocs.io/en/latest https://swagger.io/
WEB MONITORING
22
Crawls and parses data from various data sources by means of data providers Data sources
▪ News sites ▪ Fact-checking sites ▪ Social networks ▪ Existing datasets Event-based architecture Supports scheduling
WEB MONITORING
23
Crawls and parse data from various data sources by means of data providers Data providers
▪ Site-specific crawlers and parsers ▪ RSS feeds ▪ News site generic crawler and parser ▪ News API
Chaining of data providers
RSS feed Site-specific parser
https://newsapi.org/
WEB MONITORING
24
Crawls and parse data from various data sources by means of data providers Implementation
▪ Scrapy library ▪ Beautiful Soup library ▪ Newspaper library ▪ Feedparser library ▪ Celery + RabbitMQ + Flower
https://scrapy.org/ https://www.crummy.com/software/BeautifulSoup/ https://github.com/codelucas/newspaper/tree/master/newspape https://github.com/kurtmckee/feedparser
PLATFORM MANAGEMENT
25
Manages the data flows between all platform modules Web monitoring management
▪ Monitors (e.g. “Monitoring of health misinformation in Europe”)
Data storage management ▪ Access control to central data storage
PLATFORM MANAGEMENT
26
Manages the data flows between all platform modules Implementation ▪ Django ▪ Flask-JWT (not implemented yet)
https://www.djangoproject.com/ https://flask-jwt-extended.readthedocs.io/en/latest/
27
AI CORE
28
Allows to easily extend the platform with a wide variety of data-driven methods User and domain modeling methods
▪ Derive and maintain user and content characteristics ▪ Sources and their trust, authors’ credibility, ...
Prediction methods
▪ Characterize and detect antisocial behavior
AI CORE
29
Allows to easily extend the platform with a wide variety of data-driven methods Implementation
▪ Independant from platform ▪ Central storage allows easy data exchange between methods
END-USER SERVICES
30
Serve as an interface for experts (e.g., journalists) and general public Examples
▪ Real-time monitoring and visualization tool ▪ URL and user history verifier ▪ Education and training tool
31
The first prototype of Monant was developed by a team of our students
32
Source: https://patientengagementhit.com/news/patient-access-to-preventive-care-key-for-cancer-care-equity
33
Source: https://www.cancer.news/2019-04-24-green-coffee-blueberries-tomatoes-strawberries-have-chlorogenic-acid.html
NATURAL NEWS NETWORK
CASE STUDY - HEALTHCARE MISINFORMATION
34
Task: To characterize the amount of misinformative articles containing false claims related to cancer treatment Data providers
▪ Custom crawlers and parsers of Natural News network ▪ Additional data providers to be used
▪ badatel.net ▪ RSS parser ▪ Newspaper crawler and parser ▪ News API
CASE STUDY - HEALTHCARE MISINFORMATION
35
Articles: 40,198 news articles from 23 sites
CASE STUDY - HEALTHCARE MISINFORMATION
36
Articles: 40,198 news articles from 23 sites Claims: 139 cancer "treatments"
Source of claims: https://docs.google.com/spreadsheets/d/1EyhHFv2WswRNrFZ-O6SjF5m_9EhnV6zCZ0RdSX5TtFM/edit#gid=0
CASE STUDY - HEALTHCARE MISINFORMATION
37
Articles: 40,198 news articles from 23 sites Claims: 139 cancer "treatments" Mapping: 6,222 news articles (15.5%) contains at least one cancer “treatment” claim
▪ An average number of claims per article is 1.93 ▪ A maximal number of claims was 9
Source of claims: https://docs.google.com/spreadsheets/d/1EyhHFv2WswRNrFZ-O6SjF5m_9EhnV6zCZ0RdSX5TtFM/edit#gid=0
CASE STUDY - HEALTHCARE MISINFORMATION
38
Articles: 40,198 news articles from 23 sites Claims: 139 cancer "treatments" Mapping: 6,222 news articles (15.5%) contains at least one cancer “treatment” claim
▪ An average number of claims per article is 1.93 ▪ A maximal number of claims was 9 ▪ The most frequent claims
▪ Antioxidants (2459 articles) ▪ Herbalism (1715 articles) ▪ Poly-MVA (Lipoic Acid Mineral Complex, 723 articles) ▪ Superfood (609 articles)
Source of claims: https://docs.google.com/spreadsheets/d/1EyhHFv2WswRNrFZ-O6SjF5m_9EhnV6zCZ0RdSX5TtFM/edit#gid=0
CONCLUSIONS
39
Monant addresses a lack of datasets and suitable
CONCLUSIONS
40
Monant addresses a lack of datasets and suitable
More interesting problems (e.g., automatic detection) lie ahead of us. We have some first results in fake news detection that we plan to deploy to the platform.
CONCLUSIONS
41
Monant addresses a lack of datasets and suitable
More interesting problems (e.g., automatic detection) lie ahead of us. We have some first results in fake news detection that we plan to deploy to the platform. Interested in more info?
https://rebelion.fiit.stuba.sk/