Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus - - PowerPoint PPT Presentation
Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus - - PowerPoint PPT Presentation
Syllabus Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus supplement about memos Memorandum format for assignments 3 Overheads: Microsoft intro BDA day 1 Microsoft presentation 4 Overheads: some sample problems BDA
Syllabus link
Syllabus
- 1 syllabus BDA17 Syllabus version C
- 2 Syllabus supplement about memos Memorandum
format for assignments
- 3 Overheads: Microsoft intro BDA day 1 Microsoft
presentation
- 4 Overheads: some sample problems BDA examples
- 1 Main course Web site https://irgn452.wordpress.com
- 2 Handouts page https://irgn452.wordpress.com/irgn452-big-data-
analytics/handouts/
- 3 TritonED page
▾ 5 Udacity course page https://classroom.udacity.com/courses/ud651/lessons/7556187 12/concepts/8140985970923
- • 1 main text DMBA-R Shmueli w R Wiley(Bohn)
- • 2 Rattle text http://link.springer.com/book/10.1007/978-1-4419-9890-3
- • 3 ISLR book http://link.springer.com/book/10.1007/978-1-4614-7138-7
- • 4 Use R! series http://link.springer.com/search?facet-
series=%226991%22&facet-content-type=%22Book%22
- • 5 R for Stata users http://link.springer.com/book/10.1007/978-1-4419-
1318-0
- • 6 R in Action R in Action, Second Edition
- • 7 Library page for O’REilly books http://ucsd.worldcat.org/title/r-
cookbook/oclc/733755354?referer=br&ht=edition
- Udacity course
https://classroom.udacity.com/courses/ud651/lessons/755298985/concepts/ 8651687310923#
- From Week 3 onward, all homework must be done
using R.
- Attend the weekly TA tutorials, which will cover both R
and data analytics. These tutorials are recommended for all students.
- Take a special evening R test in week 10. The test is
- pen book, open computer.
- Submit all code for your final project as a loadable R
workspace.
§ §
§ § §
§ § § §
- Go through
Udacity course, up to ggplot
- DMBA Ch. 3:
Can you reproduce pp 60 and 61?
- Use ggplot2.
- Reach Ch. 2
BIG DATA
What happened? Why did it happen? What will happen? How can we make it happen?
Traditional BI Advanced Analytics
Drew Conway http://www.dataists.com/2010/09/the-data- science-venn-diagram/ Data Integration Mashups Applications Models Visualization Predictions Uncertainty Problems Data Sources Credibility Effective Data Applications
- ETL
- Marketing channel data
- Behavioral variables
- Promotional data
- Overlay data
- Exploratory data analysis
- Time-to-event models
- GAM survival models
- Scoring for inference
- Scoring for prediction
- 5 billion scores per day
per retailer
CUSTOM DATA FORMAT CUSTOM VARIABLES (PMML)
Trends
http://commons.wikimedia.org/wiki/File:Google%E2%80%99s _First_Production_Server.jpg CC-BY-2.0
1996: 10x 4Gb Hard Drives 2000: 5000 Linux PCs Today: > 2 billion servers (estimated) “I don't think the web would exist without
- pen source and Linux.
So there would have been no Google.”
— Chris DiBona, Google
- Cost Reduction (freedom to use / redistribute)
- Time-to-market (freedom to share)
- Innovation (freedom to tinker)
- Most widely used data analysis software
- Most powerful statistical programming language
- Create beautiful and unique data visualizations
- Thriving open-source community
- Fills the talent gap
www.revolutionanalytics.com/what-is-r
New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
- TruSkill Matchmaking System
- Player Churn
- Game design
- In-game purchase optimization
- Fraud detection
- Player communities
- R ≃ Stata
- Across all fields
- In economics, Stata
dominates (not shown)
blog.revolutionanalytics.com/popularity
- Rexer Data Miner Survey
- IEEE Spectrum, July 2014
#9: R Language P Popularity
IEEE Spectrum Top Programming Languages
THE FUTURE: CLOUD DATA ANALYTICS SERVICES
- Exposing the expertise of data scientists as APIs
- Bringing the utility of data science to applications
- Addressing the Data Science talent gap
Azure: Huge infrastructure scale
19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing
§ 100+ datacenters § One of the top 3 networks in the world (coverage, speed, connections) § 2 x AWS and 6x Google number of offered regions § G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
Osaka
India West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet
Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and analytics together
SQL Server 2016
Built-in in-database analytics
Example Solutions
- Fraud detection
- Sales forecasting
- Warehouse efficiency
- Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility ? R
R Integration
010010 100100 010101
Microsoft Azure Machine Learning Marketplace
New R scripts
010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101
rows minutes