ww www. w.big bigbang bang-datasc atascience.com ience.com
ww www. w.big bigbang bang-datasc atascience.com ience.com - - PowerPoint PPT Presentation
ww www. w.big bigbang bang-datasc atascience.com ience.com - - PowerPoint PPT Presentation
ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science Portfolio Data Science Process Data Explosion Why Data Science? Career in Data Science What is Data Science? Machine Learning Type of Analytics
Agenda
2
BBDS Team Data Explosion Why Data Science? What is Data Science? Type of Analytics Data Science Portfolio Data Science Process Career in Data Science Machine Learning Data Types BBDS 12 Weeks Program & BBDS Programs
BBDS Team
- Dr. Ying Xie is a tenured full professor and PhD
advisor with intensive research and industrial experiences in the field of machine learning and deep learning
- Dr. Xie is currently served as the Director of Equifax
Data Science Research Lab @KSU. In the past, Dr. Xie was the chief scientist at Araicom Life Science. He worked with numerous companies, such as LexisNexis and Emerson Climate Technology,
- n collaborative researches
- Dr. Ying Xie
14+ Years of experience in IT 7+ Years of experience in SAP consultant in Cloud Services 1+ Years of experience in Data Science and related technologies Master degrees (MS-IT)
Shan Nabi
25 years of experience in IT 18 years of experience in education: computer science, mathematics, engineering 2 Masters degrees (MS-Electrical Engineering & Computer Science, MS-Education
Edward Bujak
12+ Years of experience in IT (Service Delivery Management) 7+ Years of experience in Data Analytics 3+ Years of experience in Data Science and related technologies 3 Master degrees (MBA, MS-IT & MS- Data Science) Founder of Big Bang Data Science Solutions
Mo Medwani
Data Explosion
Some interesting facts about Data
- Every day, we create 2.5 quintillion bytes of data, so much that
90% of the data in the world today has been created in the last two years alone.
- Walmart handles more than 1 million customer transactions every
hour, which is imported into databases estimated to contain more than 2.5 PB of data
- Twitter generates 12 TB of data every day.
- Airbus A380 generates 10 TB every 30 minutes of flight.
- NYSE generates a TB of data every month.
What do we do we so much amount of data? Ignore or use it.
How much data is getting generated?
7
How much data is getting generated?
The model has changed …
Old Model – only a few companies were generating data (like news outlets), all others are consuming data New Model – all of us are generating data, and all of us are consuming data
Opportunities for New Approach to Analytics
9
In 2020, the world will generate 50x more data than we generated in 2011 Over 2.5 Exabyte (2.5 billion gigabytes) of data is generated every day.
“In its raw form, oil has little value. Once processed and refined, it helps power the world.” —Ann Winblad “Data is the new oil.” —Clive Humby, CNBC
Data -The Most Valuable Resource
What is Data Science ?
Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products. “The ability to take data—to be able to understand it, process it, to extract value from it, to visualize it, to communicate it —that’s going to be a hugely important skill in the next decades.”
- Hal Varian, Google’s Chief Economist
Data Science – A Definition
12
A decade after the term data science was first used, there is continued debate among practitioners and academics about what data science means.
Data Science – A Definition
13
Source – Big data University
Multidisciplinary
- Statistics quantifies numbers
- Data Mining explains patterns
- Machine Learning predicts with models
- Artificial Intelligence behaves and reasons
From my perspective a data scientist have a blend of many skills
Data Science – A Visual Definition
Data Science is a “concept of unifying Statistics, data analysis and their related methods” in order to “ understand and analyze an actual phenomena “ with data. IBM
Data Science – A Definition
15
Data science overlaps with
Computer science: computational complexity, Internet topology and graph theory,
distributed architectures such as Hadoop, data plumbing ,data compression, computer programming (Python, Perl, R) and processing sensor and streaming data
Statistics: design of experiments including multivariate testing, cross-validation,
stochastic processes, sampling…
Machine learning and data mining Operations research: data science encompasses most of operations research as well as
any techniques aimed at optimizing decisions based on analyzing data.
Business intelligence: every BI aspect of designing/creating/identifying great metrics and
KPI's, creating database schemas (be it NoSQL or not), dashboard design and visuals, and data-driven strategies to optimize decisions and ROI, is data science
DS vs Analytics Disciplines fields
18
DS vs Analytics Disciplines fields
19
Why Data Science ?
Why Data Science ?
Harvard Business : Data scientist is the sexiest career of the 21st century
21
Salary trends have followed the impact of data science. With a national average salary of $118,000 (which increases to $126,000 in Silicon Valley), data science has become a lucrative career path where you can solve hard problems and drive social impact. LinkedIn: Statistical Analysis & Data Mining were the hottest skills that got recruiters’ attention in 2014 Glassdoor ranked data scientist as the #1 job to pursue in 2016 McKinsey: the US alone faces a shortage of 150,000+ data analysts and an additional 1.5 million data- savvy managers
“Data Science” an Emerging Field
O’Reilly Radar report, 2011 The future belongs to the companies and people that turn data into products
Turn data into data products.
Goal of Data Science
22
Types of Analytics
There are four distinct types of Analytics
Explained what has happened Suggests why it happened Indicates what could happen Recommends what should happen
There are four distinct types of Analytics
There are several area of Analytics
Customer Analytics is a process that helps organizations make critical decision and deliver
- ptions that are anticipated
All the telecom companies these days use different marketing methods to retain their customers
Financial Analytics helps financial executives explore different ways to answer specific finance-related business questions and forecast future financial situations Reading cash flow statement, balance sheets, and income statements comes under financial analytics Performance Analytics is the practice of using data and technology to study how your business is performing to continuously make it better In HR Management, the performance of the employees is monitored on a regular basis dependent on the expected outcome In the banking industry, credit scores are built to predict an individual’s delinquency behavior and is used to represent the credit worthiness of each individual Risk Analytics foresees the uncertainties of the predicted future that helps evaluate a project’s success of failure
Data Science Portfolio
Data Scientist Profile (Competencies)
1.Quantitative skills, such as mathematics or statistics 2.Technical aptitude, such as software engineering, machine learning, and programming skills. 3.Skeptical…..this may be a counterintuitive trait, although it is important that data scientists can examine their work critically rather than in a one-sided way. 4.Curious & Creative, data scientists must be passionate about data and finding creative ways to solve problems and portray information
30
5.Communicative & Collaborative: it is not enough to have strong quantitative skills or engineering skills. To make a project resonate, you must be able to articulate the business value in a clear way, and work collaboratively with project sponsors and key stakeholders.
Data Science Is a T eam Sport
31
Data Science Is a T eam Sport
32
“Citizen Data Scientist” ?
Market trends indicate that the emergence of “ Citizen Data Scientist”
Different Data Science Roles
Role Skills Citizen Data Scientist No coding R/Python background, but some Statistical & Analytical experience Data Scientists Rely on their training in statistics and mathematical modeling, Business Analysts Rely more heavily on their analytical skills and domain expertise Data Engineers Rely mostly on software engineering skills, Before we dive into what skills you need to become a data scientist, you should be aware that there are different roles in data science.
Different Data Science Roles
Before we dive into what skills you need to become a data scientist, you should be aware that there are different roles in data science. Data Scientists One definition of a data scientist is someone who knows more about programming than a statistician, and more statistics than a software engineer. Store and clean large amounts of data, explore data sets to identify potential insights, build predictive models, and weave a story around the findings. Data scientists are the bridge between programming and algorithmic thinking. A data scientist might use historical data to build a model that predicts the number of credit card defaults in the following month, and use their data engineering skills to implement a simulation of their model on some sample data.
Different Data Science Roles
Data Analysts & Business Analysts Data analysts sift through data and provide reports and visualizations to explain what the data can offer. In some ways, you can think of them as junior data scientists, or the first step
- n the way to a data
science job. Business analysts are adjacent to data analysts, but are more concerned with the business implications of the data
Different Data Science Roles
Data Engineers Data engineers are software engineers who handle large amounts of data, They are responsible for managing database systems, scaling the data architecture to multiple servers, and writing complex queries to sift through the data. They might also clean up data sets, and implement complex requests from data scientists (e.g. they take the predictive model from the data scientist and implement it into production-ready code). Data engineers, in addition to knowing a breadth of programming languages (e.g. Ruby or Python), will usually know some Hadoop-based technologies (e.g. MapReduce, Hive, and Pig) and database technologies (e.g. MySQL, Cassandra, and MongoDB).
Data Science Process
“What does a data scientist do?” or “What does a day in the data science life look like?”
Data Analytics Lifecycle Dell
Data Analytics Lifecycle IBM
Data Analytics Process
Process Description
CRISP-DM Provides useful input on ways to frame analytics problems and is popular approach for data mining Scientific Method In use for centuries, still provides a solid framework for thinking about and deconstructing problems into their principal parts. One
- f the most valuable ideas of the scientific method related to
forming hypotheses and fining ways to test ideas Tom Davenport’s DELTA framework The DELTA framework offers an approach for data analytics projects, including the context of the organization’s skills, datasets, and leadership engagement. Doug Hubbard’s Applied Information Economies (AIE) approach AIE provides framework for measuring intangibles and provides guidance on developing decision models, calibrating experts estimates, and deriving the expected value of information
“MAD Skills” by Cohen et
Al Offers Input for several of data mining techniques that focus on model planning, execution and key findings
Data Analytics Process (CRISP-DM)
Business Understanding
Determine Business Objectives Assess Situation Determine Data Mining Goals Produce Project Plan
Data
Understanding
Explore Data Describe Data Collect Initial Data Verify Data Quality
Data
Preparation
Format Data Integrate Data Construct Data Clean Data Select Data
Modeling
Assess Model Build Model Generate Test Design Select Modeling Technique
Evaluation
Determine Next Steps Review Process Evaluate Results
42
Before you can solve a problem, you have to define the problem. You’ll often get ambiguous inputs from the people who have problems. You’ll have to develop the intuition to translate scarce inputs into actionable outputs - and to ask the questions that nobody else is asking.
Data Scientist Daily Activities
Step 1: Frame the problem Once you’ve defined the problem, you’ll need data to give you the insight needed to develop a
- solution. This part of the process involves thinking through what data you'll need, and
finding ways to get that data, whether it's querying internal databases or purchasing external datasets. Step 2: Collect the raw data needed for your problem
After you've collected all the raw data, you’ll need to process it before you can do any analysis. Oftentimes, data can be messy, especially if it hasn’t been well- maintained. You'll see errors that will corrupt your analysis: values set to null though they are actually zero, duplicate values, and missing values. It's up to you to go through and check your data and make sure you'll get accurate insights You’ll want to check for the following common errors:
- 1. Missing values
- 2. Corrupted values
- 3. Time zone differences
- 4. Date range errors, such as data registered from before sales started
Data Analytics Process (CRISP-DM)
Step 3: Process the data for analysis
When your data is clean, you should start playing with it. The difficulty here isn't coming up with ideas to test, it's coming up with ideas that are likely to turn into useful insight. You should look for interesting patterns that explain why sales are reduced for the segment of the populations you've identified as the problem. You may notice they're not very active on social media, with few having Twitter or Facebook accounts. You may also notice that most people in this segment are older than your general audience. At this point, you can now begin to trace these patterns to analyze the data more deeply.
45
Step 4: Explore the data This step of the process is where you will need to apply your statistical, mathematical and technological knowledge, and leverage all the data science tools at your disposal to crunch the data and find every insight you can 5: Perform in-depth analysis
Data Analytics Process (CRISP-DM)
46
It’s important that your stakeholders understand why the insights you’ve uncovered are
- important. Ultimately, you’ve been called upon to create a solution throughout the data science
properly communicate your results will define action and inaction on your proposals. Step 6: Communicate results of the analysis
Data Analytics Process (CRISP-DM)
Career in Data Science
What You Need to Learn to Become a Data Scientist
This next section covers all of the data science skills you’ll need to learn. You’ll also learn about the tools you need to do your job. Most data scientists use a combination of skills every day, some of which they have taught themselves on the job or otherwise. They also come from various backgrounds. There isn’t any one specific academic credential that is required to be an effective data scientist.
Is Data Science for me ?
Source – Big data University
Three-Legged Stool
Skills (5) Domains (5) Lifecycle (5) One way to understand the collaborations that lead to Data Science success is to think of a three-legged stool. Each leg is critical to the stool remaining stable and fulfilling its intended purpose
Three Legged Stool (Skills)
Tableau
Tableau is not only an ultra-powerful tool for seasoned analytics, but is also so easy to learn … that is a great entry point into the World of Data. Tableau is like a Data Science career hack
R/Python
These two programming languages have become the two titans of Data Science. While very different in nature, they both facilitate the same thing – statistical analysis on unlimited complexity. Knowing at least one is must. Knowing both puts you miles a head
SQL (PostgreSQL)
Knowing how to efficiently query database is a crucial part of Data Scientist’s job – to analyze the data you first need to go get it. SQL programming also develops e certain way of thinking about data which helps you se the big picture and workflow of your analysis
Statistics
Needless to say that if you want to be successful as a Data Scientist you will need to develop a certain level of statistical acumen. Start with Logistic Regression, A/B test and the law of Large Number
Presentation
Preparing the data, building models, creating visualizations and deriving insights – are only half of the job. To be a successful Data Scientist you need to be able to communicate your insights to your audience
Three Legged Stool (Domains)
Data Mining /BI Tools
Also known as ad-hoc analytics, data mining is the process of deriving new insights from data. Though different in essence, creating business intelligence (BI) Tools is closely related, because often these insights need to be streamlined and integrated into the business
Machine Learning/Modeling
Machine Learning is popping up everywhere: recommender systems on Amazon & Netflix, speech-to-text, face recognition on your phone – the list goes one
Advanced Analytics
With Advanced Analytics you create simulations to help real-world businesses identify opportunities for improvements
Computer Forensics
Computer Forensics/Fraud Analytics/Cyber Security all deal with slightly different things, however the overall
- bjectives are extractions, analysis, protection and even ethical hacking of information for legal purposes
Big Data
Big data refers to dealing with large and complex data sets which traditional applications simply cannot cope with. Rule of “3Vs” – Volume, Variety, Velocity
Three Legged Stool (Lifecycle)
Phase 1 : Identify The Problem
Ever heard the phrase “Here’s some data, can you find some insights?”. Too often stakeholders approach Data Scientists with vague or even undefined goals. Understanding the end goal is very important and sets up the rest
- f the project for success – (Time consumption : 10%)
Phase 2 :Prepare the Data
Data can come from many sources, be in the wrong format, have anomalies and a myriad of other problems. A single mistake in this stage can render the rest of the analysis useless – (Time consumption : 70%)
Phase 3: Analyze the Data
Creating models, performing data mining, running text analytics, setting up simulations. This is the most fun and exciting part if the previous stages have been done correctly, analyzing the data and deriving insights will feel like a breeze – (Time consumption : 10%)
Phase 4 : Visualize Insights
Visualizing comes hand –in-hand with analyzing. This is a very powerful technique as seeing the data in various forms and shapes can help uncover insights that are otherwise not evident - – (Time consumption : 10%)
Phase 5: Present Findings
Presenting findings is a whole separate “Bonus” stage. You need to not only convey the insights in your audience’s language but also get buy-in from them to take action based on those insights
Machine Learning
Machine Learning vs. Programming
55
A young child is playing at home... And he sees a candle! He cautiously waddles over. Out of curiosity, he sticks his hand over the candle flame. "Ouch!," he yells, as he yanks his hand back. "Hmm... that red and bright thing really hurts!" Two days later, he's playing in the kitchen... And he sees a stove-top! Again, he cautiously waddles over. He's curious again, and he's thinking about sticking his hand over it. Suddenly, he notices that it's red and bright! "Ahh..." he thinks to himself, "not today!" He remembers that red and bright means pain, and he ignores the stove top.
He learned that the pattern of "red and bright means pain.“ On the other hand, if he ignored the stove-top simply because his parents warned him, that'd be "explicit programming" instead of machine learning.
Machine Learning
Supervised Learning (Regression)
Regression models (both linear and non-linear) are used for predicting a real value, like salary for example. If your independent variable is time, then you are forecasting future values, otherwise – your model is preceding present but unknown values. Regression technique vary from MLR to SVR and Boosted TRESS
Supervised Learning (Classification)
Unlike regression where you predict a continuous number, you use classification to predict a category. There is a wide variety of classification applications from medicine to marketing. Classification models include linear models like logistic Regression , SVM, and nonlinear ones like K-NN, Kernel SVM and Random Forests.
Unsupervised Learning (Clustering)
Clustering is similar to classification, but the basis is different – In Clustering you don’t know what you are looking for. When you use clustering algorithms on your dataset, unexpected things can suddenly pop up – like structures, clusters, and groupings you would have never thought of otherwise
Reinforcement Learning
Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t +
- 1. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired
- utcomes provide the AI with reward, undesired with punishment. Machines learn through trial and error.
Techniques like Thompson Sampling, Upper Confidence Bound and Q-Learning.
Natural Language Processing
Teaching machines to understand what is said in spoken and written word is the focus of Natural Language
- Processing. Whenever you dictate something into your iPhone/Android device and it is converted to text – that’s
an NLP algorithm in action. Methods include decision trees, Markov processes, and more
Data Structure Types
Data Structures
Four Main Types of Data Structures
59
Structured data: Data containing a defined data type,
format, and structure (that is transaction data, online analytical processing[OLAP], Data cubes, traditional RDBMS, CSV files and even simple spread sheets)
Semi-structured data: Textual data files
with a discernible pattern that enables parsing (such as Extensible Markup Language[XML] data files that are self-describing and defined by an XML schema)
Four Main Types of Data Structures
Quasi-structure data: Textual data with
erratic data formats that can be formatted with efforts, tools, and time (for instance, web clickstream that that mat contain inconsistencies in data values and format )
Unstructured data: Data that has no
inherent structure, which may include text documents, PDFs, images and videos
Four Main Types of Data Structures
For each data type shown, answer these questions:
- 1. What type of analytics are performed on these data?
- 2. Who analyzes this kind of data?
- 3. What types of data repositories are suited for each, or requirements you may
have for storing and cataloguing this kind of data?
- 4. Who consumes the data?
- 5. Who manages and owns the data?
Bringing Tools into the Data Science Process
Each of these most common data analytics tools have their strengths and weaknesses, and each one can be applied to different stages in the data science process. Excel SQL Python R Hadoop NoSQL
Collect Data X X X Process Data X X X X X Explore Data X X X X X Analyze Data X X X X X Communicate Data X X X
62
Python vs R
The data science community tends to use either Python or R. Here are some of the differences. SAGE
Python is often used by computer programmers since it is the Swiss knife of programming languages, versatile enough so that you can build web- sites and do data analysis at the same time. R is primarily used by researchers
SYNTAX
Python has a nice clear “English-like” syntax that makes debugging and understanding code easier, while R has unconventional syntax that can be tricky to under- stand, especially if you have learned another programming language
LEARNING CURVE
R is slightly harder to pick up, especially since it doesn’t follow the normal conventions other common programming languages have. Python is simple enough that it makes for a really good first programming language to learn.
POPULARITY
Python has always been among the top 5 most popular programming languages on GitHub, a common repository of code that often tracks usage habits across all programmers quite accurately, while R typically hovers below the top 10.
FOCUS ON DATA SCIENCE
Python is a general-purpose language, and there is less focus on data analysis packages then in R. Nevertheless, there are very cool options for Python such as Pandas, a data analysis library built just for it
Recap : Five Things you Need to Remember
First things first – the Harvard Business Review calls data science the hottest job of the 21st century. (and so does job site, Glassdoor) Data Science is a rewarding career. Data Science is one field that’s continually welcoming talent and paying well – the average salary for a Data Scientist is $105,000 in the U.S., and the demand for jobs is only increasing. Every company has a distinct approach to data science, and because the field is rapidly changing, it is important to stay up-to-date with the latest technologies. The demand is high, but there is a major shortage in qualified Data Scientists. To grow your career, you should seek knowledge in universally recognized and adopted technologies like SAS/R, Python coding, SQL database and Hadoop to help you move into data science. You don’t have to necessarily possess a degree or a Ph.D. to be a data scientist. But this career does require proficiency in statistics, analytics tools, communication skills, commendable knowledge in quants, and business acumen. A successful data scientist puts all these skills to use, which is no small feat.
64
The good news is…..
BBDS Program
Data Science from ZERO to HERO
The course assume that you know close to nothing about Data Science and ML. Its goal is to give you the concepts, intuitions you need to actually implement programs capable of learning from data We will cover large number of techniques, from simplest and most commonly used (such as Linear Regression) to some Deep Learning techniques that regularly win competitions
The course Assumption
The course Key Topics
Skills Domains Lifecycle CRISP-DM
Week 1 - Fundamentals
Learn Data Science basics, Good understanding of Data Science and Data Analytics, Overview
- f EMC and CAP certificates, Introduction to concepts, methodologies and best practices
Course 1 : Introduction to Data Science Course 2 : Learning Path, CAP Certificate Course 3 : Crash Course in R Course 4 : Tableau: Introduction & Basics : Bar charts RapidMiner : Introduction to RapidMiner Course 5 : Capstone Project – Project Selection (Open data sets and problems)
Week 2 - Business & Problem Understanding
Determine Business Objectives, Assess Situation, and Determine Data Mining Goals, Produce Project Plan. Good understanding of problem framing, Decision Framing, Decision Analysis and Decision implementation using Decision First Molder Course 1 : Business Understanding & Problem Framing Course 2 : Learning Path, EMC Certificate Course 3 : Crash Course in Python (Scikit-Learn, NumPy, Matplotlib, Pandas) Course 4 : Introduction to BigML, SAS for Enterprise Miner, IBM Watson Course 5 : Capstone Project: Project Scope (Data preparation & analytic approaches)
Week 3 - Data Understanding & Data Preparation
Exploratory Data Analysis using R & Python, Descriptive statistics, hypothesis testing, data preprocessing, missing values imputation, data transformation, Dive deep into R programming language from basic syntax to advanced packages and data visualization (e.g. reshape2, dplyr, string manipulation, ggplot2, R Shiny). Course 1 : Data Understanding & Data Preparation Course 2 : Exploratory Data Analysis in R Course 3 : Exploratory Data Analysis in Python Course 4 : Tableau Time series, Aggregation, and Filters RapidMiner : Data Preparation & Correlation Analysis Course 5 : Capstone Project – Analytics Approach 1(Data preparation & classification )
Week 4 - Supervised Learning – Classification (Part 1)
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Data Mining & Machine Learning (Classification Analysis) Course 2 : Decision Tree & Random Forest in R Course 3 : Decision Tree & Random Forest in Python Course 4 : Tableau Basics : Maps, Scatterplots, & Dashboard RapidMiner : Decision Tree Course 5 : Capstone Project – Analytic Approach II (Machine learning techniques)
Week 5 - Supervised Learning – Classification (Part 2)
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Logistic Regression, KNN, Naïve Bayes in R Course 2 : Logistic Regression, KNN, Naïve Bayes in Python Course 3 : SVM, Kernel SVM in R & Python Course 4 : Tableau: Joining and Blending Data, PLUS: Dual Axis Charts RapidMiner : Logistic Regression, KNN, Naïve Bayes Course 5 : Capstone Project – Project Analysis Techniques (Presentation techniques)
Week 6 - Supervised Learning – Regression (Part 1)
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Data Mining & Machine Learning (Regression Analysis) Course 2 : Decision Tree & Random Forest in R Course 3 : Decision Tree & Random Forest in Python Course 4 : Tableau: Table Calculations, Advanced Dashboards, Storytelling RapidMiner : Linear Regression Course 5 : Capstone Project – Project Analysis Techniques (Data visualization techniques)
Week 7 - Supervised Learning – Regression (Part 2)
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Supervised Learning Course 1 : Simple Linear, Multiple Linear, Polynomial Linear in R Course 2 : Simple Linear, Multiple Linear, Polynomial Linear in Python Course 3 : Support Vector Machine in R & Python Course 4 : Tableau: Table Calculations, Advanced Dashboards, Storytelling Course 5 : Capstone Project – Data Analysis Execution Plan (Data visualization tools)
Week 8 - Unsupervised Learning – Clustering & Association Rules
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Unsupervised Learning Course 1 : Clustering Analysis & Association Rules Course 2 : Kmeans, Hierarchical Clustering, Apriori, Eclat in R Course 3 : Kmeans, Hierarchical Clustering, Apriori, Eclat in Python Course 4 : Tableau: Advanced Data Preparation Course 5 : Capstone Project – Data Analysis Review (Interpretation) RapidMiner : K-Means Clustering and Association Rules
72
Week 10 - Reinforcement Learning & Dimensionality Reduction
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using Reinforcement Learning Course 1 : RL- Random Selection, UCB, Thompson Sampling in R Course 2 : RL- Random Selection, UCB, Thompson Sampling in Python Course 3 : DR -PCA, LDA & Kernel PCA in R Course 4 : DR- PCA, LDA & Kernel PCA in Python Course 5 : Capstone Project – Data Analysis Execution Plan (Data visualization tools)
Week 9 - Deep Learning, NLP, Text Mining
Deepen machine learning skills with R and scikit learn. Focus on data cleaning, feature extraction, modeling, and model selection using NLP, Text Mining, Sentiment Analysis, Deep learning with Theano, TenserFlow & Keras, Neural Networks learn, Convolutional Neural Networks Course 1 : Text Analysis – Neural Networks Course 2 : Natural Language Processing in R & Python Course 3 : ANN in R & Python Course 4 : CNN in Python RapidMiner : Text Mining and Neural Networks Course 5 : Capstone Project – Data Analysis Review (Interpretation)
73
Week 11- Model Assessment, Validation, Optimization & Tuning
Introduction to Cost Function, Object Function, Model Optimization, Model Tuning, Gradient Boosting, Grid and Random Search. Analyze the performance of each algorithms and discuss the result Course 1 : Model Assessment CM, ROC, Rank-Ordered Approach, R2, MSE, MAE… Course 2 : k-Fold Cross Validation & Grid Search in R Course 3 : k-Fold Cross Validation & Grid Search in Python Course 4 : XGBoost, AdaBoost in R & Python RapidMiner : Cross-Validation Course 5 : Capstone Project – Presentation (Research and trends in data analytics)
Week 12 – Predictive Analytics, Cognitive Computing & Big Data
Learn the concepts of high-performance computing with parallel computing and Cognitive computing, skills in IBM BleuMix and Watson Analytics. Introduction to MapReduce, Hadoop, Hive, Spark, and Spark MLlib Course 1 : Classification & Regression Analysis in SAS & BigML Course 2 : Clustering, Texting Analysis & Neural Nets in SAS & BigML Course 3 : Cognitive Computing in IBM Watson Analytics Course 4 : Introduction to Big Data and Apache Spark Course 5 : Capstone Project – Final Report
74
Week 1-12: Capstone Analytical Project
Complete a capstone project. Project Selection, Project Scope, Analytics Approach, Project Analysis, Data Analysis Techniques, Data Analysis Execution Plan, Data Analysis Review, Analytical Technique, Data Model Analysis, Data Analysis Presentation, Final Project Report
Predicting the Future Supervised Learning Classification Regression Unsupervised Learning Clustering
- Dim. Reduction
- Ass. Rules
Reinforcement Learning
Logis istic ic Regres ression ion K-Ne Near arest est Neigh ighbor
- r
Support
- rt Vector
- r
Machin ine e SVM Naïv ïve e Bayes yes Decis ision ion Tree Random
- m Forest
est Kern rnel el SVM K- Mean ans Hierarchic erarchical l (HCA CA) Expec ectat ation ion Maximiza ximization ion PCA Kern rnel el PCA Locally ally L. Embed eddin ing Linear ear Discrimin riminan ant Analy alysis is LDA Simple le Linea ear Multip ltiple le Linear ear Polyn ynomial
- mial
Linear ear Support
- rt Vector
- r
Machin ine e SVR Decis ision ion Tree Random
- m Forest
est Neur ural al Nets Aprio iori Eclat lat
Upper r Confid idence
Bound (UCB) Thomp mpson
- n
Samplin ing XGBoos
- st
Business Understanding Data Understanding Data Preparation
T-SNE NE Text xt Minin ing Deep ep Lear arnin ing Big Data ata
76
12 Weeks Program (3rd batch stars on Sept 13th thru April 7th )
- Frequency 3 Times a week ( 10 to 12 hours a week) – (Mandatory)
- 2 to 3 Times a week if needed (Optional)
- Extensive Live Online Training
- Instructor-Led Course
- Training Video Recordings
- Quality Training Materials
- Two-Way Interactive Sessions
- Flexible Online Schedules
- Job Oriented Training
- Mock Exam/Assessment
- Graded Assignments & Professional Certificate
- Interview Prep
- Job Placement and Placement Guidance
77
What do you get when all done:
- Working Model showcase your skills
- Data Science Methodology
- Skills matrix for roles best fit
- Career Path to learn, grow and prosper
- Repeat anytime at no costs
- Possible apprenticeship 2-3 projects
- Post training support in career search
78
- 36 + LABS (12 Tableau Labs - 12 Python Labs - 12 R Labs and more )
- 3 + Cohort Projects
- 1 Group Project
- 12 + Mini projects (R & Python)
- Algorithms/Models:
- Time Series Analysis & Anomalies detection
- Least Absolute Shrinkage and Selection Operator Regression (LASSO) , Elastic Net Regression
(Regularized) & Ridge regression (Data Suffering from Multicollinearity)
- Supervised Neural Net (RCC)
- 100 + Interview Q&A
#1: A skilled chef (human guidance) First, even though we are "teaching computers to learn on their own," human guidance plays a huge role. As you'll see, you'll need to make dozens of decisions along the way. In fact, the very first major decision is how to road-map your project for guaranteed success. #2: Fresh ingredients (clean, relevant data) The second essential element is the quality of your data. Garbage In = Garbage Out, no matter which algorithms you use. Professional data scientists spend most their time understanding the data, cleaning it, and engineering new features. #3: Don't overcook it (avoid
- verfitting)
One of the most dangerous pitfalls in machine learning is overfitting. An overfit model has "memorized" the noise in the training set, instead of learning the true underlying patterns. An overfit model within a hedge fund can cost millions of dollars in losses. An overfit model within a hospital can costs thousands of lives. For most applications, the stakes won't be quite that high, but
- verfitting is still the single
largest mistake you must avoid.
Become an immediate contributor on a data science team Assist reframing a business challenge as an analytics challenge Deploy a structured lifecycle approach to data analytics problems Apply appropriate analytic techniques and tools to analyze big data Tell a compelling story with the data to drive business action Use open source tools such as R, Python Use Tableau, RapidMiner, BigML, SAS for Enterprise Miner, IBM Watson Analytics,
IBM BleuMix
EMC Data Scientist Associate (EMCDSA) Certification. CAP: Certified Analytics Professional from INFORMS
82
83
84
85 85
86 86
- Learning- Cross - Platform Learning-
Unlimited users- Share Files- Pin Down posts as per your choice- Only notified when you are tagged.- Private chats
87 87
Advanced R-Bootcamp Advanced Python-Bootcamp Deep Learning Big Data Data Visualization http://www.bigbang-datascience.com/training-programs List of Big Bang Data Science Institute training services provided
88 88
Because of the media coverage around data science and the characterization of data scientists as “rock stars,” you may feel like it’s impossible for you to enter into this realm. If you’re the type of person who loves to solve puzzles and find patterns, whether or not you consider yourself a quant, then data science is for you. Cathy O’Neil & Rachel Schutt - Doing Data Science
Q & A
BIG BANG DATA SCIENCE INSTITUTE
LEARN . ACHIEVE. STANDOUT