http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242:
Data & Visual Analytics
Duen Horng (Polo) Chau
Assistant Professor Associate Director, MS Analytics Georgia Tech
How to address Polo? Grammatically correct Prof. Chau Dr. Chau - - PowerPoint PPT Presentation
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Google Polo Chau (only one in the world) How to
http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242:
Data & Visual Analytics
Duen Horng (Polo) Chau
Assistant Professor Associate Director, MS Analytics Georgia Tech
Google “Polo Chau” (only one in the world)
How to address Polo?
Grammatically correct
Grammatically incorrect, but popular
Course Registration
This class room seats 300. Almost all physical seats have been filled. If you are on the waitlist, please wait for seats to released (some students typically “drop” after today).
Course TAs Be very very nice to them!
Office hours and locations (TBD) on course homepage
poloclub.gatech.edu/cse6242
Neetha Ravishankar Jennifer Ma Mansi Mathur Arathi Arivayutham Vineet Vinayak Pasupulety Siddharth Gulati
Acar
@Symantec
Robert Brian Chad
@Southwestern UnivShang Srishti
@Apple
Florian
Shan
@Oracle
Aakash
Samuel
CMU Masters
Jerry
Stanford PhD
Paras
Berkeley PhD
Victor
Peter
UCLA PhD
Meera
@Microsoft
Fred Andy Nilaksh Madhuri Matthew Bob
poloclub.gatech.edu
poloclub.gatech.edu
We work with (really) large data.
Internet
50 Billion Web Pages
www.worldwidewebsize.com www.opte.org2 Billion Users
Citation Network
www.scirus.com/press/html/feb_2006.html#2 Modified from well-formed.eigenfactor.org250 Million Articles
Who-follows-whom (500 million users) Who-buys-what (120 million users)
cellphone network
Who-calls-whom (100 million users)
Protein-protein interactions
200 million possible interactions in human genome
12Many More
Sources: www.selectscience.net www.phonedog.com www.mediabistro.com www.practicalecommerce.com/“Big Data” Analyzed
DATA INSIGH
Graph Nodes Edges
YahooWeb 1.4 Billion 6 Billion Symantec Machine-File Graph 1 Billion 37 Billion Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million
We also work with small data. Small data also needs love.
Number of items an average human holds in working memory
George Miller, 1956
How to do that?
Both develop methods for making sense of network data
18How to do that?
COMPUTATION INTERACTIVE VIS
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of nodes Thousands of nodes
How to do that?
COMPUTATION INTERACTIVE VIS
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of nodes Thousands of nodes
How to do that?
COMPUTATION INTERACTIVE VIS
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of nodes Thousands of nodes
How to do that?
COMPUTATION INTERACTIVE VIS
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of nodes Thousands of nodes
How to do that?
COMPUTATION INTERACTIVE VIS
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of nodes Thousands of nodes
How to do that?
COMPUTATION INTERACTIVE VIS
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of nodes Thousands of nodes
Our research combines the Best of Both Worlds
19Our Approach for Big Data Analytics
DATA MINING HCI
Automatic User-driven; iterative Summarization, clustering, classification Interaction, visualization >Millions of items Thousands of items
Human-Computer Interaction
Our mission & vision:
Scalable, interactive, usable tools for big data analytics
“Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.”
(Einstein might or might not have said this.)
Machine Learning + Visualization
http://www.scs.gatech.edu/news/522401/12m-nsf-award-helps-consumers-enter-age-big-dataRecently received $1.2 Million NSF award
Carina: Million-node Graph Exploration in Web Browser [www’17]
24
Find co-directors who made at least two films together, starring the same actor.
VISAGE: Interactive Visual Graph Querying
VISAGE: Interactive Visual Graph Querying. Robert Pienta, Acar Tamersoy, Sham Navathe, Hanghang Tong, Alex Endert, Duen Horng Chau. International Working Conference on Advanced Visual Interfaces (AVI 2016).SIGMOD’17 Best Demo, honorable mention
25
ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. Minsuk Kahng, Pierre Andrews, Aditya Kalro, Duen Horng (Polo) Chau. IEEE Transactions on Visualization and Computer Graphics (Proc. VAST'17), Jan 2018.Visualization & Interpretation of Deep Learning Models
Deployed on ML platform of
Polo’s primary application area:
Cyber Security
26Patented with Symantec Finds malware from 37 billion file relationships Serving 120 million users worldwide Published at SDM’11, KDD’14
Polonium & AESOP
Text
Auction Fraud Detection on eBay
$$$
Best papers of SDM 2014
(top data mining conference)
Detecting Fake Yelp Reviews
Insider Trading Detection
with Securities and Exchange Commission (SEC)
Course homepage All assignments, slides posted here poloclub.gatech.edu/cse6242/ Discussion, Q&A, find teammates
Piazza: goo.gl/cGvHeE
Assignment Submission T-Square
(Use Piazza for discussion)
Logistics
Make sure you’re at the right Piazza! (CSE-6242-O01, CSE-6242-OAN have their Piazza forums too)
Course Homepage
For syllabus, HWs, projects, datasets, etc.
Google “cse6242”
poloclub.gatech.edu/cse6242/2018spring
Join Piazza ASAP
goo.gl/cGvHeE
Important to join Piazza because…
data science in general
Important to join Piazza because…
Course Goals
36What is Data & Visual Analytics?
What is Data & Visual Analytics?
No formal definition!
Polo’s definition: the interdisciplinary science of combining computation techniques and interactive visualization to transform and model data to aid discovery, decision making, etc.
What is Data & Visual Analytics?
No formal definition!
What are the “ingredients”?
What are the “ingredients”?
Need to worry (a lot) about: storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc. Wasn’t this complex before this big data era. Why?
http://spanning.com/blog/choosing-between-storage-based-and-unlimited-storage-for-cloud-data-backup/
What is big data? Why care?
Many businesses are based on big data.
Search engines: rank webpages, predict what you’re going to type Advertisement: infer what you like, based on what your friends like; show relevant ads E-commerce: recommends movies/products (e.g., Netflix, Amazon) Health IT: patient records (EMR) Finance
Good news! Many jobs!
Most companies are looking for “data scientists” The data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be fulfilled better as a team
Breadth of knowledge is important. This course helps you learn some important skills.
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
Course Schedule
(Analytics Building Blocks)
Building blocks. Not Rigid “Steps”
Can skip some Can go back (two-way street)
assumptions e.g., user finds that results don’t make sense
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
and use them in complementary ways
real data & problems
Course Goals
Grading
Collaborating on homework Late submission policy
Working on Homework
You’ll be writing a lot of code
Q: Is it OK to copy and use code found on the web? A: No Q: Why? A: Here’s why…
WARNING: Do not plagiarize!
modifying it.
understand what it is doing, and then try to accomplish what it is trying to do using your own code. And it’s a good practice to cite the the sources (e.g., as part of your code comments).
use your own words, otherwise it will be considered
Are You Ready to Take this Course?
short amount of time
Javascript + CSS + HTML
The best way to find out is to check out previous semester’s homework assignments
Are You Ready to Take this Course?
e.g., http://poloclub.gatech.edu/cse6242/2017fall/
From Previous Classes…
conferences (KDD, IUI, etc.)
IUI Full conference paper
KDD Workshop paper
IUI Poster paper
“I feel like the concepts from your class are like a rite of passage for an aspiring data scientist. Assignments lead to a feelings of accomplishment and truly progressing in my area of passion.” “I really get more intuition about how to deal with data with some powerful tools in HW3 [uses AWS]. That feeling is beyond description for me.” “I would like to say thank you for your class! Thanks to the skills I got from the class and the project, I got the offer.”
60What Polo expects from you
answer questions on Piazza
class for Q&A
FREE After-class Coffee ☕
(+2 volunteers) for FREE after-class coffee
pastries — whatever you want
starting next week!