The bottom line We are the data science people but the world needs - - PowerPoint PPT Presentation
The bottom line We are the data science people but the world needs - - PowerPoint PPT Presentation
The bottom line We are the data science people but the world needs to know about it Wrangling vs Analytics wrangling analytics Wrangling: data processing that allows meaningful analysis to begin (extraction, integration, cleaning, querying,
Wrangling vs Analytics
wrangling analytics Wrangling: data processing that allows meaningful analysis to begin (extraction, integration, cleaning, querying, etc - basically SIGMOD/PODS CFP) Requires more effort (usually 50-80%)
This is what we do
- But the world sees the end result
- The 80-20 rule: 20% of effort gets 80% of PR
- But we need to be better at it
- Some ammunition...
Data analysts’ favorite tools
0% 10% 20% 30% 40% 50% 60% 70% Teradata SPSS Perl Amazon Elastic MapReduce (EMR) Hbase Weka Amazon RedShift Pig C SQLite Scala PowerPivot C++ SAS Apache Hadoop MongoDB Visual Basic/VBA Cloudera Spark Hive Homegrown analysis tools D3 Oracle PostgreSQL Java Matplotlib (Python) JavaScript Tableau Microsoft SQL Server ggplot Python: numpy, scipy, scikit-learn MySQL R Python Excel SQL
TOOLS
LANGUAGES, DATA PLATFORMS, ANALYTICS
Share of Respondents Tool: language, data platform, analytics
Future data analysts’ favorite tools
The world needs to know
- ... but it’s much more fun doing research than talk to
the “real world”
- Still, we are not a small community, and we have
people with different skills
- One example: we convinced our funders (EPSRC)
that data management is an essential part of “big data”
- The more people get the message, the healthier our