Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus - - PowerPoint PPT Presentation

syllabus link 1 syllabus bda17 syllabus version c 2
SMART_READER_LITE
LIVE PREVIEW

Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus - - PowerPoint PPT Presentation

Syllabus Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus supplement about memos Memorandum format for assignments 3 Overheads: Microsoft intro BDA day 1 Microsoft presentation 4 Overheads: some sample problems BDA


slide-1
SLIDE 1
slide-2
SLIDE 2

Syllabus link

Syllabus

slide-3
SLIDE 3
  • 1 syllabus BDA17 Syllabus version C
  • 2 Syllabus supplement about memos Memorandum

format for assignments

  • 3 Overheads: Microsoft intro BDA day 1 Microsoft

presentation

  • 4 Overheads: some sample problems BDA examples
slide-4
SLIDE 4
slide-5
SLIDE 5
  • 1 Main course Web site https://irgn452.wordpress.com
  • 2 Handouts page https://irgn452.wordpress.com/irgn452-big-data-

analytics/handouts/

  • 3 TritonED page

▾ 5 Udacity course page https://classroom.udacity.com/courses/ud651/lessons/7556187 12/concepts/8140985970923

slide-6
SLIDE 6
  • • 1 main text DMBA-R Shmueli w R Wiley(Bohn)
  • • 2 Rattle text http://link.springer.com/book/10.1007/978-1-4419-9890-3
  • • 3 ISLR book http://link.springer.com/book/10.1007/978-1-4614-7138-7
  • • 4 Use R! series http://link.springer.com/search?facet-

series=%226991%22&facet-content-type=%22Book%22

  • • 5 R for Stata users http://link.springer.com/book/10.1007/978-1-4419-

1318-0

  • • 6 R in Action R in Action, Second Edition
  • • 7 Library page for O’REilly books http://ucsd.worldcat.org/title/r-

cookbook/oclc/733755354?referer=br&ht=edition

  • Udacity course

https://classroom.udacity.com/courses/ud651/lessons/755298985/concepts/ 8651687310923#

slide-7
SLIDE 7
  • From Week 3 onward, all homework must be done

using R.

  • Attend the weekly TA tutorials, which will cover both R

and data analytics. These tutorials are recommended for all students.

  • Take a special evening R test in week 10. The test is
  • pen book, open computer.
  • Submit all code for your final project as a loadable R

workspace.

slide-8
SLIDE 8

§ §

§ § §

§ § § §

slide-9
SLIDE 9
  • Go through

Udacity course, up to ggplot

  • DMBA Ch. 3:

Can you reproduce pp 60 and 61?

  • Use ggplot2.
  • Reach Ch. 2
slide-10
SLIDE 10

BIG DATA

slide-11
SLIDE 11

What happened? Why did it happen? What will happen? How can we make it happen?

Traditional BI Advanced Analytics

slide-12
SLIDE 12

Drew Conway http://www.dataists.com/2010/09/the-data- science-venn-diagram/ Data Integration Mashups Applications Models Visualization Predictions Uncertainty Problems Data Sources Credibility Effective Data Applications

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
  • ETL
  • Marketing channel data
  • Behavioral variables
  • Promotional data
  • Overlay data
  • Exploratory data analysis
  • Time-to-event models
  • GAM survival models
  • Scoring for inference
  • Scoring for prediction
  • 5 billion scores per day

per retailer

CUSTOM DATA FORMAT CUSTOM VARIABLES (PMML)

slide-16
SLIDE 16
slide-17
SLIDE 17

Trends

slide-18
SLIDE 18

http://commons.wikimedia.org/wiki/File:Google%E2%80%99s _First_Production_Server.jpg CC-BY-2.0

1996: 10x 4Gb Hard Drives 2000: 5000 Linux PCs Today: > 2 billion servers (estimated) “I don't think the web would exist without

  • pen source and Linux.

So there would have been no Google.”

— Chris DiBona, Google

slide-19
SLIDE 19
  • Cost Reduction (freedom to use / redistribute)
  • Time-to-market (freedom to share)
  • Innovation (freedom to tinker)
slide-20
SLIDE 20
  • Most widely used data analysis software
  • Most powerful statistical programming language
  • Create beautiful and unique data visualizations
  • Thriving open-source community
  • Fills the talent gap

www.revolutionanalytics.com/what-is-r

slide-21
SLIDE 21

New York Times, June 25 2009 (3 hours after Michael Jackson’s death)

slide-22
SLIDE 22
  • TruSkill Matchmaking System
  • Player Churn
  • Game design
  • In-game purchase optimization
  • Fraud detection
  • Player communities
slide-23
SLIDE 23
slide-24
SLIDE 24
  • R ≃ Stata
  • Across all fields
  • In economics, Stata

dominates (not shown)

slide-25
SLIDE 25

blog.revolutionanalytics.com/popularity

  • Rexer Data Miner Survey
  • IEEE Spectrum, July 2014

#9: R Language P Popularity

IEEE Spectrum Top Programming Languages

slide-26
SLIDE 26

THE FUTURE: CLOUD DATA ANALYTICS SERVICES

slide-27
SLIDE 27
  • Exposing the expertise of data scientists as APIs
  • Bringing the utility of data science to applications
  • Addressing the Data Science talent gap
slide-28
SLIDE 28

Azure: Huge infrastructure scale

19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing

§ 100+ datacenters § One of the top 3 networks in the world (coverage, speed, connections) § 2 x AWS and 6x Google number of offered regions § G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…

Operational Announced

Central US

Iowa

West US

California

North Europe

Ireland

East US

Virginia

East US 2

Virginia

US Gov

Virginia

North Central US

Illinois

US Gov

Iowa

South Central US

Texas

Brazil South

Sao Paulo

West Europe

Netherlands

China North *

Beijing

China South *

Shanghai

Japan East

Saitama

Japan West

Osaka

India West

TBD

India East

TBD

East Asia

Hong Kong

SE Asia

Singapore

Australia West

Melbourne

Australia East

Sydney

* Operated by 21Vianet

slide-29
SLIDE 29
slide-30
SLIDE 30

Data Scientist

Interact directly with data

Built-in to SQL Server

Data Developer/DBA

Manage data and analytics together

SQL Server 2016

Built-in in-database analytics

Example Solutions

  • Fraud detection
  • Sales forecasting
  • Warehouse efficiency
  • Predictive maintenance

Relational Data

Analytic Library

T-SQL Interface

Extensibility ? R

R Integration

010010 100100 010101

Microsoft Azure Machine Learning Marketplace

New R scripts

010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101

slide-31
SLIDE 31

rows minutes

R on a server pulling data via SQL R on a server Invoking RRE ScaleR Inside the EDW

slide-32
SLIDE 32