ww www. w.big bigbang bang-datasc atascience.com ience.com - - PowerPoint PPT Presentation

▶

Sep 04, 2022 351 likes •1.26k views

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science Portfolio Data Science Process Data Explosion Why Data Science? Career in Data Science What is Data Science? Machine Learning Type of Analytics

SLIDE 1

ww www. w.big bigbang bang-datasc atascience.com ience.com

SLIDE 2

Agenda

BBDS Team Data Explosion Why Data Science? What is Data Science? Type of Analytics Data Science Portfolio Data Science Process Career in Data Science Machine Learning Data Types BBDS 12 Weeks Program & BBDS Programs

SLIDE 3

BBDS Team



Dr. Ying Xie is a tenured full professor and PhD

advisor with intensive research and industrial experiences in the field of machine learning and deep learning 

Dr. Xie is currently served as the Director of Equifax

Data Science Research Lab @KSU.  In the past, Dr. Xie was the chief scientist at Araicom Life Science. He worked with numerous companies, such as LexisNexis and Emerson Climate Technology,

n collaborative researches
Dr. Ying Xie

 14+ Years of experience in IT  7+ Years of experience in SAP consultant in Cloud Services  1+ Years of experience in Data Science and related technologies  Master degrees (MS-IT)

Shan Nabi

 25 years of experience in IT  18 years of experience in education: computer science, mathematics, engineering  2 Masters degrees (MS-Electrical Engineering & Computer Science, MS-Education

Edward Bujak

 12+ Years of experience in IT (Service Delivery Management)  7+ Years of experience in Data Analytics  3+ Years of experience in Data Science and related technologies  3 Master degrees (MBA, MS-IT & MS- Data Science)  Founder of Big Bang Data Science Solutions

Mo Medwani

SLIDE 4

Data Explosion

SLIDE 5

Some interesting facts about Data

Every day, we create 2.5 quintillion bytes of data, so much that

90% of the data in the world today has been created in the last two years alone.

Walmart handles more than 1 million customer transactions every

hour, which is imported into databases estimated to contain more than 2.5 PB of data

Twitter generates 12 TB of data every day.
Airbus A380 generates 10 TB every 30 minutes of flight.
NYSE generates a TB of data every month.

What do we do we so much amount of data? Ignore or use it.

SLIDE 6

How much data is getting generated?

SLIDE 7

How much data is getting generated?

SLIDE 8

The model has changed …

Old Model – only a few companies were generating data (like news outlets), all others are consuming data New Model – all of us are generating data, and all of us are consuming data

SLIDE 9

Opportunities for New Approach to Analytics

In 2020, the world will generate 50x more data than we generated in 2011 Over 2.5 Exabyte (2.5 billion gigabytes) of data is generated every day.

SLIDE 10

“In its raw form, oil has little value. Once processed and refined, it helps power the world.” —Ann Winblad “Data is the new oil.” —Clive Humby, CNBC

Data -The Most Valuable Resource

SLIDE 11

What is Data Science ?

SLIDE 12

Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products. “The ability to take data—to be able to understand it, process it, to extract value from it, to visualize it, to communicate it —that’s going to be a hugely important skill in the next decades.”

Hal Varian, Google’s Chief Economist

Data Science – A Definition

A decade after the term data science was first used, there is continued debate among practitioners and academics about what data science means.

SLIDE 13

Data Science – A Definition

Source – Big data University

SLIDE 14

Multidisciplinary

Statistics quantifies numbers
Data Mining explains patterns
Machine Learning predicts with models
Artificial Intelligence behaves and reasons

From my perspective a data scientist have a blend of many skills

Data Science – A Visual Definition

SLIDE 15

Data Science is a “concept of unifying Statistics, data analysis and their related methods” in order to “ understand and analyze an actual phenomena “ with data. IBM

Data Science – A Definition

SLIDE 16

Data science overlaps with

 Computer science: computational complexity, Internet topology and graph theory,

distributed architectures such as Hadoop, data plumbing ,data compression, computer programming (Python, Perl, R) and processing sensor and streaming data

 Statistics: design of experiments including multivariate testing, cross-validation,

stochastic processes, sampling…

 Machine learning and data mining  Operations research: data science encompasses most of operations research as well as

any techniques aimed at optimizing decisions based on analyzing data.

 Business intelligence: every BI aspect of designing/creating/identifying great metrics and

KPI's, creating database schemas (be it NoSQL or not), dashboard design and visuals, and data-driven strategies to optimize decisions and ROI, is data science

SLIDE 17

SLIDE 18

DS vs Analytics Disciplines fields

SLIDE 19

DS vs Analytics Disciplines fields

SLIDE 20

Why Data Science ?

SLIDE 21

Why Data Science ?

Harvard Business : Data scientist is the sexiest career of the 21st century

Salary trends have followed the impact of data science. With a national average salary of $118,000 (which increases to $126,000 in Silicon Valley), data science has become a lucrative career path where you can solve hard problems and drive social impact. LinkedIn: Statistical Analysis & Data Mining were the hottest skills that got recruiters’ attention in 2014 Glassdoor ranked data scientist as the #1 job to pursue in 2016 McKinsey: the US alone faces a shortage of 150,000+ data analysts and an additional 1.5 million data- savvy managers

SLIDE 22

“Data Science” an Emerging Field

O’Reilly Radar report, 2011 The future belongs to the companies and people that turn data into products

Turn data into data products.

Goal of Data Science

SLIDE 23

Types of Analytics

SLIDE 24

There are four distinct types of Analytics

Explained what has happened Suggests why it happened Indicates what could happen Recommends what should happen

SLIDE 25

There are four distinct types of Analytics

SLIDE 26

SLIDE 27

There are several area of Analytics

SLIDE 28

Customer Analytics is a process that helps organizations make critical decision and deliver

ptions that are anticipated

All the telecom companies these days use different marketing methods to retain their customers

Financial Analytics helps financial executives explore different ways to answer specific finance-related business questions and forecast future financial situations Reading cash flow statement, balance sheets, and income statements comes under financial analytics Performance Analytics is the practice of using data and technology to study how your business is performing to continuously make it better In HR Management, the performance of the employees is monitored on a regular basis dependent on the expected outcome In the banking industry, credit scores are built to predict an individual’s delinquency behavior and is used to represent the credit worthiness of each individual Risk Analytics foresees the uncertainties of the predicted future that helps evaluate a project’s success of failure

SLIDE 29

Data Science Portfolio

SLIDE 30

Data Scientist Profile (Competencies)

1.Quantitative skills, such as mathematics or statistics 2.Technical aptitude, such as software engineering, machine learning, and programming skills. 3.Skeptical…..this may be a counterintuitive trait, although it is important that data scientists can examine their work critically rather than in a one-sided way. 4.Curious & Creative, data scientists must be passionate about data and finding creative ways to solve problems and portray information

5.Communicative & Collaborative: it is not enough to have strong quantitative skills or engineering skills. To make a project resonate, you must be able to articulate the business value in a clear way, and work collaboratively with project sponsors and key stakeholders.

SLIDE 31

Data Science Is a T eam Sport

SLIDE 32

Data Science Is a T eam Sport

SLIDE 33

“Citizen Data Scientist” ?

Market trends indicate that the emergence of “ Citizen Data Scientist”

SLIDE 34

Different Data Science Roles

Role Skills Citizen Data Scientist No coding R/Python background, but some Statistical & Analytical experience Data Scientists Rely on their training in statistics and mathematical modeling, Business Analysts Rely more heavily on their analytical skills and domain expertise Data Engineers Rely mostly on software engineering skills, Before we dive into what skills you need to become a data scientist, you should be aware that there are different roles in data science.

SLIDE 35

Different Data Science Roles

Before we dive into what skills you need to become a data scientist, you should be aware that there are different roles in data science. Data Scientists One definition of a data scientist is someone who knows more about programming than a statistician, and more statistics than a software engineer. Store and clean large amounts of data, explore data sets to identify potential insights, build predictive models, and weave a story around the findings. Data scientists are the bridge between programming and algorithmic thinking. A data scientist might use historical data to build a model that predicts the number of credit card defaults in the following month, and use their data engineering skills to implement a simulation of their model on some sample data.

SLIDE 36

Different Data Science Roles

Data Analysts & Business Analysts Data analysts sift through data and provide reports and visualizations to explain what the data can offer. In some ways, you can think of them as junior data scientists, or the first step

n the way to a data

science job. Business analysts are adjacent to data analysts, but are more concerned with the business implications of the data

SLIDE 37

Different Data Science Roles

Data Engineers Data engineers are software engineers who handle large amounts of data, They are responsible for managing database systems, scaling the data architecture to multiple servers, and writing complex queries to sift through the data. They might also clean up data sets, and implement complex requests from data scientists (e.g. they take the predictive model from the data scientist and implement it into production-ready code). Data engineers, in addition to knowing a breadth of programming languages (e.g. Ruby or Python), will usually know some Hadoop-based technologies (e.g. MapReduce, Hive, and Pig) and database technologies (e.g. MySQL, Cassandra, and MongoDB).

SLIDE 38

Data Science Process

“What does a data scientist do?” or “What does a day in the data science life look like?”

SLIDE 39

Data Analytics Lifecycle Dell

SLIDE 40

Data Analytics Lifecycle IBM

SLIDE 41

Data Analytics Process

Process Description

CRISP-DM Provides useful input on ways to frame analytics problems and is popular approach for data mining Scientific Method In use for centuries, still provides a solid framework for thinking about and deconstructing problems into their principal parts. One

f the most valuable ideas of the scientific method related to

forming hypotheses and fining ways to test ideas Tom Davenport’s DELTA framework The DELTA framework offers an approach for data analytics projects, including the context of the organization’s skills, datasets, and leadership engagement. Doug Hubbard’s Applied Information Economies (AIE) approach AIE provides framework for measuring intangibles and provides guidance on developing decision models, calibrating experts estimates, and deriving the expected value of information

“MAD Skills” by Cohen et

Al Offers Input for several of data mining techniques that focus on model planning, execution and key findings

SLIDE 42

Data Analytics Process (CRISP-DM)

Business Understanding

Determine Business Objectives Assess Situation Determine Data Mining Goals Produce Project Plan

Data

Understanding

Explore Data Describe Data Collect Initial Data Verify Data Quality

Data

Preparation

Format Data Integrate Data Construct Data Clean Data Select Data

Modeling

Assess Model Build Model Generate Test Design Select Modeling Technique

Evaluation

Determine Next Steps Review Process Evaluate Results

SLIDE 43

Before you can solve a problem, you have to define the problem. You’ll often get ambiguous inputs from the people who have problems. You’ll have to develop the intuition to translate scarce inputs into actionable outputs - and to ask the questions that nobody else is asking.

Data Scientist Daily Activities

Step 1: Frame the problem Once you’ve defined the problem, you’ll need data to give you the insight needed to develop a

solution. This part of the process involves thinking through what data you'll need, and

finding ways to get that data, whether it's querying internal databases or purchasing external datasets. Step 2: Collect the raw data needed for your problem

SLIDE 44

After you've collected all the raw data, you’ll need to process it before you can do any analysis. Oftentimes, data can be messy, especially if it hasn’t been well- maintained. You'll see errors that will corrupt your analysis: values set to null though they are actually zero, duplicate values, and missing values. It's up to you to go through and check your data and make sure you'll get accurate insights You’ll want to check for the following common errors:

1. Missing values
2. Corrupted values
3. Time zone differences
4. Date range errors, such as data registered from before sales started

Data Analytics Process (CRISP-DM)

Step 3: Process the data for analysis

SLIDE 45

When your data is clean, you should start playing with it. The difficulty here isn't coming up with ideas to test, it's coming up with ideas that are likely to turn into useful insight. You should look for interesting patterns that explain why sales are reduced for the segment of the populations you've identified as the problem. You may notice they're not very active on social media, with few having Twitter or Facebook accounts. You may also notice that most people in this segment are older than your general audience. At this point, you can now begin to trace these patterns to analyze the data more deeply.

Step 4: Explore the data This step of the process is where you will need to apply your statistical, mathematical and technological knowledge, and leverage all the data science tools at your disposal to crunch the data and find every insight you can 5: Perform in-depth analysis

Data Analytics Process (CRISP-DM)

SLIDE 46

It’s important that your stakeholders understand why the insights you’ve uncovered are

important. Ultimately, you’ve been called upon to create a solution throughout the data science

properly communicate your results will define action and inaction on your proposals. Step 6: Communicate results of the analysis

Data Analytics Process (CRISP-DM)

SLIDE 47

Career in Data Science

SLIDE 48

What You Need to Learn to Become a Data Scientist

This next section covers all of the data science skills you’ll need to learn. You’ll also learn about the tools you need to do your job. Most data scientists use a combination of skills every day, some of which they have taught themselves on the job or otherwise. They also come from various backgrounds. There isn’t any one specific academic credential that is required to be an effective data scientist.

SLIDE 49

Is Data Science for me ?

Source – Big data University

SLIDE 50

Three-Legged Stool

Skills (5) Domains (5) Lifecycle (5) One way to understand the collaborations that lead to Data Science success is to think of a three-legged stool. Each leg is critical to the stool remaining stable and fulfilling its intended purpose

SLIDE 51

Three Legged Stool (Skills)

Tableau

Tableau is not only an ultra-powerful tool for seasoned analytics, but is also so easy to learn … that is a great entry point into the World of Data. Tableau is like a Data Science career hack

R/Python

These two programming languages have become the two titans of Data Science. While very different in nature, they both facilitate the same thing – statistical analysis on unlimited complexity. Knowing at least one is must. Knowing both puts you miles a head

SQL (PostgreSQL)

Knowing how to efficiently query database is a crucial part of Data Scientist’s job – to analyze the data you first need to go get it. SQL programming also develops e certain way of thinking about data which helps you se the big picture and workflow of your analysis

Statistics

Needless to say that if you want to be successful as a Data Scientist you will need to develop a certain level of statistical acumen. Start with Logistic Regression, A/B test and the law of Large Number

Presentation

Preparing the data, building models, creating visualizations and deriving insights – are only half of the job. To be a successful Data Scientist you need to be able to communicate your insights to your audience

SLIDE 52

Three Legged Stool (Domains)

Data Mining /BI Tools

Also known as ad-hoc analytics, data mining is the process of deriving new insights from data. Though different in essence, creating business intelligence (BI) Tools is closely related, because often these insights need to be streamlined and integrated into the business

Machine Learning/Modeling

Machine Learning is popping up everywhere: recommender systems on Amazon & Netflix, speech-to-text, face recognition on your phone – the list goes one

Advanced Analytics

With Advanced Analytics you create simulations to help real-world businesses identify opportunities for improvements

Computer Forensics

Computer Forensics/Fraud Analytics/Cyber Security all deal with slightly different things, however the overall

bjectives are extractions, analysis, protection and even ethical hacking of information for legal purposes

Big Data

Big data refers to dealing with large and complex data sets which traditional applications simply cannot cope with. Rule of “3Vs” – Volume, Variety, Velocity

SLIDE 53

Three Legged Stool (Lifecycle)

Phase 1 : Identify The Problem

Ever heard the phrase “Here’s some data, can you find some insights?”. Too often stakeholders approach Data Scientists with vague or even undefined goals. Understanding the end goal is very important and sets up the rest

f the project for success – (Time consumption : 10%)

Phase 2 :Prepare the Data

Data can come from many sources, be in the wrong format, have anomalies and a myriad of other problems. A single mistake in this stage can render the rest of the analysis useless – (Time consumption : 70%)

Phase 3: Analyze the Data

Creating models, performing data mining, running text analytics, setting up simulations. This is the most fun and exciting part if the previous stages have been done correctly, analyzing the data and deriving insights will feel like a breeze – (Time consumption : 10%)

Phase 4 : Visualize Insights

Visualizing comes hand –in-hand with analyzing. This is a very powerful technique as seeing the data in various forms and shapes can help uncover insights that are otherwise not evident - – (Time consumption : 10%)

Phase 5: Present Findings

Presenting findings is a whole separate “Bonus” stage. You need to not only convey the insights in your audience’s language but also get buy-in from them to take action based on those insights

SLIDE 54

Machine Learning

SLIDE 55

Machine Learning vs. Programming

A young child is playing at home... And he sees a candle! He cautiously waddles over. Out of curiosity, he sticks his hand over the candle flame. "Ouch!," he yells, as he yanks his hand back. "Hmm... that red and bright thing really hurts!" Two days later, he's playing in the kitchen... And he sees a stove-top! Again, he cautiously waddles over. He's curious again, and he's thinking about sticking his hand over it. Suddenly, he notices that it's red and bright! "Ahh..." he thinks to himself, "not today!" He remembers that red and bright means pain, and he ignores the stove top.

He learned that the pattern of "red and bright means pain.“ On the other hand, if he ignored the stove-top simply because his parents warned him, that'd be "explicit programming" instead of machine learning.

SLIDE 56

Machine Learning

Supervised Learning (Regression)

Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise – your model is preceding present but unknown values. Regression technique vary from MLR to SVR and Boosted TRESS

Supervised Learning (Classification)

Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing. Classification models include linear models like logistic Regression , SVM, and nonlinear ones like K-NN, Kernel SVM and Random Forests.

Unsupervised Learning (Clustering)

Clustering is similar to classification, but the basis is different – In Clustering you don’t know what you are looking for. When you use clustering algorithms on your dataset, unexpected things can suddenly pop up – like structures, clusters, and groupings you would have never thought of otherwise

Reinforcement Learning

Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t +

1. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired
utcomes provide the AI with reward, undesired with punishment. Machines learn through trial and error.

Techniques like Thompson Sampling, Upper Confidence Bound and Q-Learning.

Natural Language Processing

Teaching machines to understand what is said in spoken and written word is the focus of Natural Language

Processing. Whenever you dictate something into your iPhone/Android device and it is converted to text – that’s

an NLP algorithm in action. Methods include decision trees, Markov processes, and more

SLIDE 57

Data Structure Types

SLIDE 58

Data Structures

SLIDE 59

Four Main Types of Data Structures

Structured data: Data containing a defined data type,

format, and structure (that is transaction data, online analytical processing[OLAP], Data cubes, traditional RDBMS, CSV files and even simple spread sheets)

Semi-structured data: Textual data files

with a discernible pattern that enables parsing (such as Extensible Markup Language[XML] data files that are self-describing and defined by an XML schema)

SLIDE 60

Four Main Types of Data Structures

Quasi-structure data: Textual data with

erratic data formats that can be formatted with efforts, tools, and time (for instance, web clickstream that that mat contain inconsistencies in data values and format )

Unstructured data: Data that has no

inherent structure, which may include text documents, PDFs, images and videos

SLIDE 61

Four Main Types of Data Structures

For each data type shown, answer these questions:

1. What type of analytics are performed on these data?
2. Who analyzes this kind of data?
3. What types of data repositories are suited for each, or requirements you may

have for storing and cataloguing this kind of data?

4. Who consumes the data?
5. Who manages and owns the data?

SLIDE 62

Bringing Tools into the Data Science Process

Each of these most common data analytics tools have their strengths and weaknesses, and each one can be applied to different stages in the data science process. Excel SQL Python R Hadoop NoSQL

Collect Data X X X Process Data X X X X X Explore Data X X X X X Analyze Data X X X X X Communicate Data X X X

SLIDE 63

Python vs R

The data science community tends to use either Python or R. Here are some of the differences. SAGE

Python is often used by computer programmers since it is the Swiss knife of programming languages, versatile enough so that you can build web- sites and do data analysis at the same time. R is primarily used by researchers

SYNTAX

Python has a nice clear “English-like” syntax that makes debugging and understanding code easier, while R has unconventional syntax that can be tricky to understand, especially if you have learned another programming language

LEARNING CURVE

R is slightly harder to pick up, especially since it doesn’t follow the normal conventions other common programming languages have. Python is simple enough that it makes for a really good first programming language to learn.

POPULARITY

Python has always been among the top 5 most popular programming languages on GitHub, a common repository of code that often tracks usage habits across all programmers quite accurately, while R typically hovers below the top 10.

FOCUS ON DATA SCIENCE

Python is a general-purpose language, and there is less focus on data analysis packages then in R. Nevertheless, there are very cool options for Python such as Pandas, a data analysis library built just for it

SLIDE 64

Recap : Five Things you Need to Remember

First things first – the Harvard Business Review calls data science the hottest job of the 21st century. (and so does job site, Glassdoor) Data Science is a rewarding career. Data Science is one field that’s continually welcoming talent and paying well – the average salary for a Data Scientist is $105,000 in the U.S., and the demand for jobs is only increasing. Every company has a distinct approach to data science, and because the field is rapidly changing, it is important to stay up-to-date with the latest technologies. The demand is high, but there is a major shortage in qualified Data Scientists. To grow your career, you should seek knowledge in universally recognized and adopted technologies like SAS/R, Python coding, SQL database and Hadoop to help you move into data science. You don’t have to necessarily possess a degree or a Ph.D. to be a data scientist. But this career does require proficiency in statistics, analytics tools, communication skills, commendable knowledge in quants, and business acumen. A successful data scientist puts all these skills to use, which is no small feat.

The good news is…..

SLIDE 65

BBDS Program

Data Science from ZERO to HERO

SLIDE 66

The course assume that you know close to nothing about Data Science and ML. Its goal is to give you the concepts, intuitions you need to actually implement programs capable of learning from data We will cover large number of techniques, from simplest and most commonly used (such as Linear Regression) to some Deep Learning techniques that regularly win competitions

The course Assumption

SLIDE 67

The course Key Topics

Skills Domains Lifecycle CRISP-DM

SLIDE 68

Week 1 - Fundamentals

Learn Data Science basics, Good understanding of Data Science and Data Analytics, Overview

f EMC and CAP certificates, Introduction to concepts, methodologies and best practices

Course 1 : Introduction to Data Science Course 2 : Learning Path, CAP Certificate Course 3 : Crash Course in R Course 4 : Tableau: Introduction & Basics : Bar charts RapidMiner : Introduction to RapidMiner Course 5 : Capstone Project – Project Selection (Open data sets and problems)

Week 2 - Business & Problem Understanding

Determine Business Objectives, Assess Situation, and Determine Data Mining Goals, Produce Project Plan. Good understanding of problem framing, Decision Framing, Decision Analysis and Decision implementation using Decision First Molder Course 1 : Business Understanding & Problem Framing Course 2 : Learning Path, EMC Certificate Course 3 : Crash Course in Python (Scikit-Learn, NumPy, Matplotlib, Pandas) Course 4 : Introduction to BigML, SAS for Enterprise Miner, IBM Watson Course 5 : Capstone Project: Project Scope (Data preparation & analytic approaches)

SLIDE 69

Week 3 - Data Understanding & Data Preparation

Exploratory Data Analysis using R & Python, Descriptive statistics, hypothesis testing, data preprocessing, missing values imputation, data transformation, Dive deep into R programming language from basic syntax to advanced packages and data visualization (e.g. reshape2, dplyr, string manipulation, ggplot2, R Shiny). Course 1 : Data Understanding & Data Preparation Course 2 : Exploratory Data Analysis in R Course 3 : Exploratory Data Analysis in Python Course 4 : Tableau Time series, Aggregation, and Filters RapidMiner : Data Preparation & Correlation Analysis Course 5 : Capstone Project – Analytics Approach 1(Data preparation & classification )

Week 4 - Supervised Learning – Classification (Part 1)

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Data Mining & Machine Learning (Classification Analysis) Course 2 : Decision Tree & Random Forest in R Course 3 : Decision Tree & Random Forest in Python Course 4 : Tableau Basics : Maps, Scatterplots, & Dashboard RapidMiner : Decision Tree Course 5 : Capstone Project – Analytic Approach II (Machine learning techniques)

SLIDE 70

Week 5 - Supervised Learning – Classification (Part 2)

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Logistic Regression, KNN, Naïve Bayes in R Course 2 : Logistic Regression, KNN, Naïve Bayes in Python Course 3 : SVM, Kernel SVM in R & Python Course 4 : Tableau: Joining and Blending Data, PLUS: Dual Axis Charts RapidMiner : Logistic Regression, KNN, Naïve Bayes Course 5 : Capstone Project – Project Analysis Techniques (Presentation techniques)

Week 6 - Supervised Learning – Regression (Part 1)

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Data Mining & Machine Learning (Regression Analysis) Course 2 : Decision Tree & Random Forest in R Course 3 : Decision Tree & Random Forest in Python Course 4 : Tableau: Table Calculations, Advanced Dashboards, Storytelling RapidMiner : Linear Regression Course 5 : Capstone Project – Project Analysis Techniques (Data visualization techniques)

SLIDE 71

Week 7 - Supervised Learning – Regression (Part 2)

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Simple Linear, Multiple Linear, Polynomial Linear in R Course 2 : Simple Linear, Multiple Linear, Polynomial Linear in Python Course 3 : Support Vector Machine in R & Python Course 4 : Tableau: Table Calculations, Advanced Dashboards, Storytelling Course 5 : Capstone Project – Data Analysis Execution Plan (Data visualization tools)

Week 8 - Unsupervised Learning – Clustering & Association Rules

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Unsupervised Learning Course 1 : Clustering Analysis & Association Rules Course 2 : Kmeans, Hierarchical Clustering, Apriori, Eclat in R Course 3 : Kmeans, Hierarchical Clustering, Apriori, Eclat in Python Course 4 : Tableau: Advanced Data Preparation Course 5 : Capstone Project – Data Analysis Review (Interpretation) RapidMiner : K-Means Clustering and Association Rules

SLIDE 72

Week 10 - Reinforcement Learning & Dimensionality Reduction

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Reinforcement Learning Course 1 : RL- Random Selection, UCB, Thompson Sampling in R Course 2 : RL- Random Selection, UCB, Thompson Sampling in Python Course 3 : DR -PCA, LDA & Kernel PCA in R Course 4 : DR- PCA, LDA & Kernel PCA in Python Course 5 : Capstone Project – Data Analysis Execution Plan (Data visualization tools)

Week 9 - Deep Learning, NLP, Text Mining

Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using NLP, Text Mining, Sentiment Analysis, Deep learning with Theano, TenserFlow & Keras, Neural Networks learn, Convolutional Neural Networks Course 1 : Text Analysis – Neural Networks Course 2 : Natural Language Processing in R & Python Course 3 : ANN in R & Python Course 4 : CNN in Python RapidMiner : Text Mining and Neural Networks Course 5 : Capstone Project – Data Analysis Review (Interpretation)

SLIDE 73

Week 11- Model Assessment, Validation, Optimization & Tuning

Introduction to Cost Function, Object Function, Model Optimization, Model Tuning, Gradient Boosting, Grid and Random Search. Analyze the performance of each algorithms and discuss the result Course 1 : Model Assessment CM, ROC, Rank-Ordered Approach, R2, MSE, MAE… Course 2 : k-Fold Cross Validation & Grid Search in R Course 3 : k-Fold Cross Validation & Grid Search in Python Course 4 : XGBoost, AdaBoost in R & Python RapidMiner : Cross-Validation Course 5 : Capstone Project – Presentation (Research and trends in data analytics)

Week 12 – Predictive Analytics, Cognitive Computing & Big Data

Learn the concepts of high-performance computing with parallel computing and Cognitive computing, skills in IBM BleuMix and Watson Analytics. Introduction to MapReduce, Hadoop, Hive, Spark, and Spark MLlib Course 1 : Classification & Regression Analysis in SAS & BigML Course 2 : Clustering, Texting Analysis & Neural Nets in SAS & BigML Course 3 : Cognitive Computing in IBM Watson Analytics Course 4 : Introduction to Big Data and Apache Spark Course 5 : Capstone Project – Final Report

SLIDE 74

Week 1-12: Capstone Analytical Project

Complete a capstone project. Project Selection, Project Scope, Analytics Approach, Project Analysis, Data Analysis Techniques, Data Analysis Execution Plan, Data Analysis Review, Analytical Technique, Data Model Analysis, Data Analysis Presentation, Final Project Report

SLIDE 75

Predicting the Future Supervised Learning Classification Regression Unsupervised Learning Clustering

Dim. Reduction
Ass. Rules

Reinforcement Learning

Logis istic ic Regres ression ion K-Ne Near arest est Neigh ighbor

Support

rt Vector
r

Machin ine e SVM Naïv ïve e Bayes yes Decis ision ion Tree Random

m Forest

est Kern rnel el SVM K- Mean ans Hierarchic erarchical l (HCA CA) Expec ectat ation ion Maximiza ximization ion PCA Kern rnel el PCA Locally ally L. Embed eddin ing Linear ear Discrimin riminan ant Analy alysis is LDA Simple le Linea ear Multip ltiple le Linear ear Polyn ynomial

mial

Linear ear Support

rt Vector
r

Machin ine e SVR Decis ision ion Tree Random

m Forest

est Neur ural al Nets Aprio iori Eclat lat

Upper r Confid idence

Bound (UCB) Thomp mpson

Samplin ing XGBoos

Business Understanding Data Understanding Data Preparation

T-SNE NE Text xt Minin ing Deep ep Lear arnin ing Big Data ata

SLIDE 76

12 Weeks Program (3rd batch stars on Sept 13th thru April 7th )

Frequency 3 Times a week ( 10 to 12 hours a week) – (Mandatory)
2 to 3 Times a week if needed (Optional)
Extensive Live Online Training
Instructor-Led Course
Training Video Recordings
Quality Training Materials
Two-Way Interactive Sessions
Flexible Online Schedules
Job Oriented Training
Mock Exam/Assessment
Graded Assignments & Professional Certificate
Interview Prep
Job Placement and Placement Guidance

SLIDE 77

What do you get when all done:

Working Model showcase your skills
Data Science Methodology
Skills matrix for roles best fit
Career Path to learn, grow and prosper
Repeat anytime at no costs
Possible apprenticeship 2-3 projects
Post training support in career search

SLIDE 78

36 + LABS (12 Tableau Labs - 12 Python Labs - 12 R Labs and more )
3 + Cohort Projects
1 Group Project
12 + Mini projects (R & Python)
Algorithms/Models:
Time Series Analysis & Anomalies detection
Least Absolute Shrinkage and Selection Operator Regression (LASSO) , Elastic Net Regression

(Regularized) & Ridge regression (Data Suffering from Multicollinearity)

Supervised Neural Net (RCC)
100 + Interview Q&A

SLIDE 79

#1: A skilled chef (human guidance) First, even though we are "teaching computers to learn on their own," human guidance plays a huge role. As you'll see, you'll need to make dozens of decisions along the way. In fact, the very first major decision is how to road-map your project for guaranteed success. #2: Fresh ingredients (clean, relevant data) The second essential element is the quality of your data. Garbage In = Garbage Out, no matter which algorithms you use. Professional data scientists spend most their time understanding the data, cleaning it, and engineering new features. #3: Don't overcook it (avoid

verfitting)

One of the most dangerous pitfalls in machine learning is overfitting. An overfit model has "memorized" the noise in the training set, instead of learning the true underlying patterns. An overfit model within a hedge fund can cost millions of dollars in losses. An overfit model within a hospital can costs thousands of lives. For most applications, the stakes won't be quite that high, but

verfitting is still the single

largest mistake you must avoid.

SLIDE 80

 Become an immediate contributor on a data science team  Assist reframing a business challenge as an analytics challenge  Deploy a structured lifecycle approach to data analytics problems  Apply appropriate analytic techniques and tools to analyze big data  Tell a compelling story with the data to drive business action  Use open source tools such as R, Python  Use Tableau, RapidMiner, BigML, SAS for Enterprise Miner, IBM Watson Analytics,

IBM BleuMix

SLIDE 81

EMC Data Scientist Associate (EMCDSA) Certification. CAP: Certified Analytics Professional from INFORMS

SLIDE 82

SLIDE 83

SLIDE 84

SLIDE 85

85 85

SLIDE 86

86 86

Learning- Cross - Platform Learning-

Unlimited users- Share Files- Pin Down posts as per your choice- Only notified when you are tagged.- Private chats

SLIDE 87

87 87

Advanced R-Bootcamp Advanced Python-Bootcamp Deep Learning Big Data Data Visualization http://www.bigbang-datascience.com/training-programs List of Big Bang Data Science Institute training services provided

SLIDE 88

88 88

Because of the media coverage around data science and the characterization of data scientists as “rock stars,” you may feel like it’s impossible for you to enter into this realm. If you’re the type of person who loves to solve puzzles and find patterns, whether or not you consider yourself a quant, then data science is for you. Cathy O’Neil & Rachel Schutt - Doing Data Science

SLIDE 89

Q & A

SLIDE 90

BIG BANG DATA SCIENCE INSTITUTE

LEARN . ACHIEVE. STANDOUT