Data Collection Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

data collection
SMART_READER_LITE
LIVE PREVIEW

Data Collection Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Data Collection

Duen Horng (Polo) Chau


Assistant Professor
 Associate Director, MS Analytics
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

How to Collect Data?

2

Method Effort Download Low API Medium Scrape/Crawl High

slide-3
SLIDE 3

How to Collect Data?

2

Method Effort Download Low API Medium Scrape/Crawl High

slide-4
SLIDE 4

Data you can just download

Yahoo WebScope datasets NYC Taxi data: Trip (11GB), Fare (7.7GB) StackOverflow (xml) Atlanta crime data (csv) Soccer statistics …

3

More on course website:
 http://poloclub.gatech.edu/cse6242/2016spring/#datasets

slide-5
SLIDE 5

Data you can just download

If you have leads, let us know on Piazza!

4

More datasets on course website:
 http://poloclub.gatech.edu/cse6242/2016spring/#datasets

slide-6
SLIDE 6

5

http://yahoolabs.tumblr.com/post/137281912191/yahoo-releases-the-largest-ever-machine-learning?soc_src=mail&soc_trk=ma

slide-7
SLIDE 7

6

https://webscope.sandbox.yahoo.com

slide-8
SLIDE 8

Collect Data via APIs

Twitter (small subset)


https://dev.twitter.com/streaming/overview

Last.fm (Pandora has unofficial API) Flickr Facebook (your friends only) CrunchBase (database about companies) Rotten Tomatoes not free anymore :-( iTunes

7

slide-9
SLIDE 9

Data that needs scraping

Amazon (reviews, product info) ESPN eBay Google Play Google Scholar

8

slide-10
SLIDE 10

How to Scrape?


Google Play example
 Goal: build network of similar apps

9

slide-11
SLIDE 11

How to Scrape?


Google Play example
 Goal: build network of similar apps

10

https://play.google.com/store/apps/details?
 id=com.shazam.android https://play.google.com/store/apps/details?
 id=com.spotify.music

slide-12
SLIDE 12

How to Store the Data?
 What’s the Easiest Way?

slide-13
SLIDE 13

Easiest Way to Store Data

As comma-separated files (CSV) But may not be easy to parse. Why?

12

slide-14
SLIDE 14

Easiest Way to Store Data

13

https://en.wikipedia.org/wiki/Comma-separated_values

slide-15
SLIDE 15

Most popular embedded database in the world iPhone (iOS), Android, Chrome (browsers), Mac, etc. Self-contained: one file contains data + schema Serverless: database right on your computer Zero-configuration: no need to set up!

http://www.sqlite.org/different.html 14 http://www.sqlite.org

slide-16
SLIDE 16

SQL Refresher: create table

>sqlite3 database.db sqlite> create table student(ssn integer, name text); sqlite> .schema CREATE TABLE student(ssn integer, name text);

ssn name

15

slide-17
SLIDE 17

SQL Refresher: insert rows

insert into student values(111, "Smith"); insert into student values(222, "Johnson"); insert into student values(333, "Obama"); select * from student;

ssn name 111 Smith 222 Johnson 333 Obama

16

slide-18
SLIDE 18

SQL Refresher: create another table

create table takes
 (ssn integer, course_id integer, grade integer);
 sqlite>.schema CREATE TABLE student(ssn integer, name text); CREATE TABLE takes (ssn integer, course_id integer, grade integer);

ssn course_id grade

17

slide-19
SLIDE 19

SQL Refresher: joining 2 tables

More than one tables - joins E.g., create roster for this course (6242)

ssn course_id grade 111 6242 100 222 6242 90 222 4000 80 ssn name 111 Smith 222 Johnson 333 Obama

18

slide-20
SLIDE 20

SQL Refresher: joining 2 tables + filtering

select name from student, takes
 where student.ssn = takes.ssn and takes.course_id = 6242;

ssn course_id grade 111 6242 100 222 6242 90 222 4000 80 ssn name 111 Smith 222 Johnson 333 Obama

19

slide-21
SLIDE 21

SQL General Form

select a1, a2, ... an 
 from t1, t2, ... tm 
 where predicate
 [order by ....]
 [group by ...]
 [having ...]

20

slide-22
SLIDE 22

Find ssn and GPA for each student

select ssn, avg(grade) 
 from takes 
 group by ssn
 having avg(grade) > 90;

ssn course_id grade 111 6242 100 222 6242 90 222 4000 80 ssn avg(grade) 111 100 222 85

21

slide-23
SLIDE 23

What if slow?

Important sanity check: Have you (or someone) created appropriate indexes? 
 SQLite’s indices use B-tree data structure.
 O(logN) speed for adding/finding/deleting an item create index student_ssn_index on student(ssn);

22