Data & Visual Analytics Mahdi Roozbahani Lecturer, - - PowerPoint PPT Presentation

data visual analytics
SMART_READER_LITE
LIVE PREVIEW

Data & Visual Analytics Mahdi Roozbahani Lecturer, - - PowerPoint PPT Presentation

CX4242: Data & Visual Analytics Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech Assignments Overview (Tentative and subject to change) CX 4242 Assignment 1 Platforms, Languages & Technologies Python,


slide-1
SLIDE 1

CX4242:

Data & Visual Analytics

Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

slide-2
SLIDE 2

Assignments Overview

(Tentative and subject to change)

CX 4242

slide-3
SLIDE 3

Assignment 1

Platforms, Languages & Technologies

Python, Gephi, SQLite, D3, OpenRefine

Questions

Q1: Collecting and visualizing data (Python & Gephi) Q2: Analysing data using SQLite Q3: D3 Warmup Q4: Analysing data through OpenRefine

slide-4
SLIDE 4

Assignment 2

Platforms, Languages & Technologies

D3, Tableau

Questions

Q1: Designing a good table and visualizing data with Tableau Q2: Force directed graph using D3 Q3: Scatter plots using D3 Q4: Heatmap using D3 Q5: Interactive visualization using D3 Q6: Choropleth map using D3 Q7: Pros and cons of various visualization tools

slide-5
SLIDE 5

Assignment 3

Platforms, Languages & Technologies

Java, Hadoop, Spark, Pig, Azure

Questions

Q1: Analyzing a graph withHadoop/Java Q2: Analyzing a graph with Spark/Scala on Databricks Q3: Analyzing data with Pig on AWS Q4: Analyzing a graph using Hadoop on Microsoft Azure Q5: Regression using Azure ML Studio

slide-6
SLIDE 6

Assignment 4

Platforms, Languages & Technologies

Pypy, PageRank, Random Forest, SciKit Learn

Questions

Q1: Scalable single-machine PageRank Q2: Implementing a random forest classifier Q3: Using Scikit-Learn for running various classifiers

slide-7
SLIDE 7

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-8
SLIDE 8

Building blocks. Not Rigid “Steps”.

Can skip some Can go back (two-way street)

  • Data types inform visualization design
  • Data size informs choice of algorithms
  • Visualization motivates more data cleaning
  • Visualization challenges algorithm

assumptions e.g., user finds that results don’t make sense

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-9
SLIDE 9

How “big data” affects the process?

(Hint: almost everything is harder!)

The Vs of big data (3Vs originally, then 7, now 42) Volume: “billions”, “petabytes” are common Velocity: think Twitter, fraud detection, etc. Variety: text (webpages), video (youtube)… Veracity: uncertainty of data Variability Visualization Value

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://dataconomy.com/seven-vs-big-data/ https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

slide-10
SLIDE 10

Three Example Projects

from Polo and Mahdi Research group

slide-11
SLIDE 11

Apolo Graph Exploration: Machine Learning + Visualization

18 Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. CHI 2011.

slide-12
SLIDE 12

19

Beautiful Hairball Death Star Spaghetti

slide-13
SLIDE 13

Finding More Relevant Nodes

Apolo uses guilt-by-association (Belief Propagation)

HCI

Paper

Data Mining

Paper

Citation network

20

slide-14
SLIDE 14

Demo: Mapping the Sensemaking Literature

22

Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations

slide-15
SLIDE 15
slide-16
SLIDE 16

Key Ideas (Recap)

Specify exemplars Find other relevant nodes (BP)

24

slide-17
SLIDE 17

What did Apolo go through?

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

Scrape Google Scholar. No API. 😪 Design inference algorithm

(Which nodes to show next?)

Paper, talks, lectures Interactive visualization you just saw You will a new Apolo prototype

(called Argo)

slide-18
SLIDE 18

26

Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. ACM Conference on Human Factors in Computing Systems (CHI) 2011. May 7-12, 2011.

slide-19
SLIDE 19

NetProbe: Fraud Detection in Online Auction

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

slide-20
SLIDE 20

Find bad sellers (fraudsters) on eBay who don’t deliver their items

NetProbe: The Problem

Buyer

$$$

Seller

28

Non-delivery fraud is a common auction fraud

source: https://www.fbi.gov/contact-us/field-offices/portland/news/press-releases/fbi-tech-tuesday---building-a-digital-defense-against-auction-fraud

slide-21
SLIDE 21

29

slide-22
SLIDE 22

NetProbe: Key Ideas

Fraudsters fabricate their reputation by “trading” with their accomplices Fake transactions form near bipartite cores How to detect them?

30

slide-23
SLIDE 23

NetProbe: Key Ideas

Use Belief Propagation

31

F A H Fraudster Accomplice Honest

Darker means more likely

slide-24
SLIDE 24

NetProbe: Main Results

33

slide-25
SLIDE 25

34

“Belgian Police”

slide-26
SLIDE 26

35

slide-27
SLIDE 27

What did NetProbe go through?

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

Scraping (built a “scraper”/“crawler”) Design detection algorithm Not released Paper, talks, lectures

slide-28
SLIDE 28

37

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. International Conference on World Wide Web (WWW) 2007. May 8-12, 2007. Banff, Alberta, Canada. Pages 201-210.

slide-29
SLIDE 29

FONT TELLER

slide-30
SLIDE 30

Homework 1 (out next week; tasks subject to change)

  • Simple “End-to-end” analysis
  • Collect data using API
  • Store in SQLite database
  • Create graph from data
  • Analyze, using SQL queries (e.g.,

create graph’s degree distribution)

  • Visualize graph using Gephi
  • Describe your discoveries

Collection Cleaning Integration Visualization Analysis Presentation Dissemination