Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

analytics building blocks
SMART_READER_LITE
LIVE PREVIEW

Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Analytics Building Blocks

Duen Horng (Polo) Chau


Assistant Professor
 Associate Director, MS Analytics
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

2

What is Data & Visual Analytics?

slide-3
SLIDE 3

2

What is Data & Visual Analytics?

No formal definition!

slide-4
SLIDE 4

2

Polo’s definition: 
 the interdisciplinary science of combining 
 computation techniques and 
 interactive visualization 
 to transform and model data to aid 
 discovery, decision making, etc.

What is Data & Visual Analytics?

No formal definition!

slide-5
SLIDE 5

3

What are the “ingredients”?

slide-6
SLIDE 6

3

What are the “ingredients”?

Need to worry (a lot) about: storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc. Wasn’t this complex before this big data era. Why?

slide-7
SLIDE 7

4

http://spanning.com/blog/choosing-between-storage-based-and-unlimited-storage-for-cloud-data-backup/

slide-8
SLIDE 8

What is big data? Why care?

  • Many companies’ businesses are based on big data (Google, Facebook, Amazon, Apple,

Symantec, LinkedIn, and many more)

  • Web search
  • Rank webpages (PageRank algorithm)
  • Predict what you’re going to type
  • Advertisement (e.g., on Facebook)
  • Infer users’ interest; show relevant ads
  • Infer what you like, based on what your friends like
  • Recommendation systems (e.g., Netflix, Pandora, Amazon)
  • Online education
  • Health IT: patient records (EMR)
  • Bio and Chemical modeling:
  • Finance
  • Cybersecruity
  • Internet of Things (IoT)

(“big data” is buzz word, so is “IoT” - Internet of Things)

slide-9
SLIDE 9

Good news! Many jobs!

Most companies are looking for “data scientists” The data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be fulfilled better as a team


  • Gartner (http://www.gartner.com/it-glossary/data-scientist)

Breadth of knowledge is important. This course helps you learn some important skills.

slide-10
SLIDE 10

Analytics Building Blocks

slide-11
SLIDE 11

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-12
SLIDE 12

Building blocks, not “steps”

  • Can skip some
  • Can go back (two-way street)
  • Examples
  • Data types inform visualization design
  • Data informs choice of algorithms
  • Visualization informs data cleaning

(dirty data)

  • Visualization informs algorithm design

(user finds that results don’t make sense)

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-13
SLIDE 13

How big data affects the process?

The Vs of big data (3Vs, 4Vs, now 7Vs) Volume: “billions”, “petabytes” are common Velocity: think Twitter, fraud detection, etc. Variety: text (webpages), video (youtube)… Veracity: uncertainty of data Variability Visualization Value

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

http://www.ibmbigdatahub.com/infographic/four-vs-big-data
 http://dataconomy.com/seven-vs-big-data/

slide-14
SLIDE 14

Gartner's 2016 Hype Cycle

http://www.gartner.com/newsroom/id/3412017 https://en.wikipedia.org/wiki/Hype_cycle

slide-15
SLIDE 15

“Artificial Intelligence”

slide-16
SLIDE 16

We’re in the 3rd wave

  • f “AI” boom
  • Two “AI winters” before


https://en.wikipedia.org/wiki/History_of_artificial_intelligence

  • We should be cautiously optimistic

(Polo’s motto)

slide-17
SLIDE 17
slide-18
SLIDE 18

AI Safety

slide-19
SLIDE 19

Good Read about AI:
 White House Report

Preparing for The Future

  • f Artificial Intelligence



 https://www.whitehouse.gov/sites/default/files/ whitehouse_files/microsites/ostp/NSTC/ preparing_for_the_future_of_ai.pdf

slide-20
SLIDE 20

“The Current State of AI Remarkable progress has been made on what is known as Narrow AI, which addresses specific application areas such as playing strategic games, language translation, self-driving vehicles, and image

  • recognition. Narrow AI underpins many commercial

services such as trip planning, shopper recommendation systems, and ad targeting, and is finding important applications in medical diagnosis, education, and scientific research. These have all had significant societal benefits and have contributed to the economic vitality of the Nation.

slide-21
SLIDE 21

General AI (sometimes called Artificial General Intelligence, or AGI) refers to a notional future AI system that exhibits apparently intelligent behavior at least as advanced as a person across the full range

  • f cognitive tasks. A broad chasm seems to separate

today’s Narrow AI from the much more difficult challenge of General AI. Attempts to reach General AI by expanding Narrow AI solutions have made little headway over many decades of research. The current consensus of the private-sector expert community, with which the NSTC Committee on Technology concurs, is that General AI will not be achieved for at least decades.”

slide-22
SLIDE 22
slide-23
SLIDE 23

No Matrix or SkyNet in Your Life Time

slide-24
SLIDE 24

Schedule

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-25
SLIDE 25

Two Example Projects 


from Polo Club

slide-26
SLIDE 26

Apolo Graph Exploration: 
 Machine Learning + Visualization


22 Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. 
 Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. CHI 2011.

slide-27
SLIDE 27

23

slide-28
SLIDE 28

23

Beautiful Hairball Death Star Spaghetti

slide-29
SLIDE 29

Finding More Relevant Nodes

HCI

Paper

Data Mining


Paper

Citation network

24

slide-30
SLIDE 30

Finding More Relevant Nodes

HCI

Paper

Data Mining


Paper

Citation network

24

slide-31
SLIDE 31

Finding More Relevant Nodes

Apolo uses guilt-by-association
 (Belief Propagation)

HCI

Paper

Data Mining


Paper

Citation network

24

slide-32
SLIDE 32

Demo: Mapping the Sensemaking Literature

25

Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

Key Ideas (Recap)

Specify exemplars Find other relevant nodes (BP)

27

slide-36
SLIDE 36

What did Apolo go through?

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

Scrape Google Scholar. No API :( Design inference algorithm 


(Which nodes to show next?)

Paper, talks, lectures Interactive visualization you just saw

slide-37
SLIDE 37

29

Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. ACM Conference on Human Factors in Computing Systems (CHI) 2011. May 7-12, 2011.

slide-38
SLIDE 38

NetProbe: 
 Fraud Detection in Online Auction

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

slide-39
SLIDE 39

Find bad sellers (fraudsters) on eBay who don’t deliver their items

NetProbe: The Problem

Buyer

$$$

Seller

31

Auction fraud is #3 online crime in 2010

source: www.ic3.gov

slide-40
SLIDE 40

32

slide-41
SLIDE 41

NetProbe: Key Ideas

§

Fraudsters fabricate their reputation by “trading” with their accomplices

§

Fake transactions form near bipartite cores

§

How to detect them?

33

slide-42
SLIDE 42

NetProbe: Key Ideas

Use Belief Propagation

34

F A H Fraudster Accomplice Honest

Darker means more likely

slide-43
SLIDE 43

NetProbe: Main Results

35

slide-44
SLIDE 44

36

slide-45
SLIDE 45

36

slide-46
SLIDE 46

36

“Belgian Police”

slide-47
SLIDE 47

37

slide-48
SLIDE 48

What did NetProbe go through?

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

Scraping (built a “scraper”/“crawler”) Design detection algorithm Not released Paper, talks, lectures

slide-49
SLIDE 49

39

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. International Conference on World Wide Web (WWW) 2007. May 8-12, 2007. Banff, Alberta, Canada. Pages 201-210.

slide-50
SLIDE 50

Homework 1 (out next week; tasks subject to change)

  • Simple “End-to-end” analysis
  • Collect data using API)
  • Movies (Actors, directors, related

movies, etc.)

  • Store in SQLite database
  • Transform data to movie-movie network
  • Analyze, using SQL queries (e.g., create

graph’s degree distribution)

  • Visualize, using Gephi
  • Describe your discoveries

Collection Cleaning Integration Visualization Analysis Presentation Dissemination