Data & Visual Analytics Mahdi Roozbahani Lecturer, - - PowerPoint PPT Presentation
Data & Visual Analytics Mahdi Roozbahani Lecturer, - - PowerPoint PPT Presentation
CX4242: Data & Visual Analytics Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech Assignments Overview (Tentative and subject to change) CX 4242 Assignment 1 Platforms, Languages & Technologies Python,
Assignments Overview
(Tentative and subject to change)
CX 4242
Assignment 1
Platforms, Languages & Technologies
Python, Gephi, SQLite, D3, OpenRefine
Questions
Q1: Collecting and visualizing data (Python & Gephi) Q2: Analysing data using SQLite Q3: D3 Warmup Q4: Analysing data through OpenRefine
Assignment 2
Platforms, Languages & Technologies
D3, Tableau
Questions
Q1: Designing a good table and visualizing data with Tableau Q2: Force directed graph using D3 Q3: Scatter plots using D3 Q4: Heatmap using D3 Q5: Interactive visualization using D3 Q6: Choropleth map using D3 Q7: Pros and cons of various visualization tools
Assignment 3
Platforms, Languages & Technologies
Java, Hadoop, Spark, Pig, Azure
Questions
Q1: Analyzing a graph withHadoop/Java Q2: Analyzing a graph with Spark/Scala on Databricks Q3: Analyzing data with Pig on AWS Q4: Analyzing a graph using Hadoop on Microsoft Azure Q5: Regression using Azure ML Studio
Assignment 4
Platforms, Languages & Technologies
Pypy, PageRank, Random Forest, SciKit Learn
Questions
Q1: Scalable single-machine PageRank Q2: Implementing a random forest classifier Q3: Using Scikit-Learn for running various classifiers
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
Building blocks. Not Rigid “Steps”.
Can skip some Can go back (two-way street)
- Data types inform visualization design
- Data size informs choice of algorithms
- Visualization motivates more data cleaning
- Visualization challenges algorithm
assumptions e.g., user finds that results don’t make sense
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
How “big data” affects the process?
(Hint: almost everything is harder!)
The Vs of big data (3Vs originally, then 7, now 42) Volume: “billions”, “petabytes” are common Velocity: think Twitter, fraud detection, etc. Variety: text (webpages), video (youtube)… Veracity: uncertainty of data Variability Visualization Value
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://dataconomy.com/seven-vs-big-data/ https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
Three Example Projects
from Polo and Mahdi Research group
Apolo Graph Exploration: Machine Learning + Visualization
18 Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. CHI 2011.
19
Beautiful Hairball Death Star Spaghetti
Finding More Relevant Nodes
Apolo uses guilt-by-association (Belief Propagation)
HCI
Paper
Data Mining
Paper
Citation network
20
Demo: Mapping the Sensemaking Literature
22
Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations
Key Ideas (Recap)
Specify exemplars Find other relevant nodes (BP)
24
What did Apolo go through?
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
Scrape Google Scholar. No API. 😪 Design inference algorithm
(Which nodes to show next?)
Paper, talks, lectures Interactive visualization you just saw You will a new Apolo prototype
(called Argo)
26
Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. ACM Conference on Human Factors in Computing Systems (CHI) 2011. May 7-12, 2011.
NetProbe: Fraud Detection in Online Auction
NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007
Find bad sellers (fraudsters) on eBay who don’t deliver their items
NetProbe: The Problem
Buyer
$$$
Seller
28
Non-delivery fraud is a common auction fraud
source: https://www.fbi.gov/contact-us/field-offices/portland/news/press-releases/fbi-tech-tuesday---building-a-digital-defense-against-auction-fraud
29
NetProbe: Key Ideas
Fraudsters fabricate their reputation by “trading” with their accomplices Fake transactions form near bipartite cores How to detect them?
30
NetProbe: Key Ideas
Use Belief Propagation
31
F A H Fraudster Accomplice Honest
Darker means more likely
NetProbe: Main Results
33
34
“Belgian Police”
35
What did NetProbe go through?
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
Scraping (built a “scraper”/“crawler”) Design detection algorithm Not released Paper, talks, lectures
37
NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. International Conference on World Wide Web (WWW) 2007. May 8-12, 2007. Banff, Alberta, Canada. Pages 201-210.
FONT TELLER
Homework 1 (out next week; tasks subject to change)
- Simple “End-to-end” analysis
- Collect data using API
- Store in SQLite database
- Create graph from data
- Analyze, using SQL queries (e.g.,
create graph’s degree distribution)
- Visualize graph using Gephi
- Describe your discoveries
Collection Cleaning Integration Visualization Analysis Presentation Dissemination