The Hitch-Hackers Guide to Data Science ... or what I wish Id known - - PowerPoint PPT Presentation

the hitch hackers guide to data science
SMART_READER_LITE
LIVE PREVIEW

The Hitch-Hackers Guide to Data Science ... or what I wish Id known - - PowerPoint PPT Presentation

Science Data Acquisition Machines ToolBox Conclusion The Hitch-Hackers Guide to Data Science ... or what I wish Id known when I was younger Jaroslav Vn Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz


slide-1
SLIDE 1

Science Data Acquisition Machines ToolBox Conclusion

The Hitch-Hackers Guide to Data Science

... or what I wish I’d known when I was younger Jaroslav Vážný

Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz

  • 3. dubna 2014

Jaroslav Vážný Practical approach

slide-2
SLIDE 2

Science Data Acquisition Machines ToolBox Conclusion

1 Science 2 Data Acquisition 3 Machines 4 ToolBox 5 Conclusion

Jaroslav Vážný Practical approach

slide-3
SLIDE 3

Science Data Acquisition Machines ToolBox Conclusion

What is Science?

The whole of science is nothing more than a refinement

  • f everyday thinking. Albert Einstein

Jaroslav Vážný Practical approach

slide-4
SLIDE 4

Science Data Acquisition Machines ToolBox Conclusion

More than Science

Mistakes/Feedback No pain no gain Pain == gain? Everything is hard until someone makes it easy

Jaroslav Vážný Practical approach

slide-5
SLIDE 5

Science Data Acquisition Machines ToolBox Conclusion

MOOC == new era?

https://www.khanacademy.org/ https://www.coursera.org/ https://www.udacity.com/ https://www.edx.org/

Jaroslav Vážný Practical approach

slide-6
SLIDE 6

Science Data Acquisition Machines ToolBox Conclusion

Reproducibility

http://jakevdp.github.io/blog/2013/10/26/ big-data-brain-drain/ http://nbviewer.ipython.org/ http://pdos.csail.mit.edu/scigen/ ;-)

Jaroslav Vážný Practical approach

slide-7
SLIDE 7

Science Data Acquisition Machines ToolBox Conclusion

We are all humans

Jaroslav Vážný Practical approach

slide-8
SLIDE 8

Science Data Acquisition Machines ToolBox Conclusion

We are all humans/animals

Jaroslav Vážný Practical approach

slide-9
SLIDE 9

Science Data Acquisition Machines ToolBox Conclusion

We are all humans/animals/idiots

Jaroslav Vážný Practical approach

slide-10
SLIDE 10

Science Data Acquisition Machines ToolBox Conclusion

Probability

Test your intuition! Roll dice. 5 times you got 6. What is P(6)=? Monty Hall problem Show examples in IPython! 1 2

? ?

Jaroslav Vážný Practical approach

slide-11
SLIDE 11

Science Data Acquisition Machines ToolBox Conclusion

Bayes’s theorem

Suppose the probability (for anyone) to have AIDS is: P(AIDS) = 0.001 P(no AIDS) = 0.999 Consider an AIDS test: result is + or - P(+|AIDS) = 0.98 P(-|AIDS) = 0.02 P(+|no AIDS) = 0.03 P(-|no AIDS) = 0.97

Jaroslav Vážný Practical approach

slide-12
SLIDE 12

Science Data Acquisition Machines ToolBox Conclusion

Bayes’s theorem solution

P(AIDS|+) = P(+|AIDS)P(AIDS) P(+|AIDS)P(AIDS) + P(+|noAIDS)P(noAIDS) = 0.98 × 0.001 0.98 × 0.001 + 0.03 × 0.999 = 0.032 Your viewpoint: my degree of belief that I have AIDS is 3.2% Your doctor’s viewpoint: 3.2% of people like this will have AIDS

Jaroslav Vážný Practical approach

slide-13
SLIDE 13

Science Data Acquisition Machines ToolBox Conclusion

We are all humans/animals/idiots/liars

Jaroslav Vážný Practical approach

slide-14
SLIDE 14

Science Data Acquisition Machines ToolBox Conclusion

Data Avalanche?

Large Synoptic Survey Telescope

20 TB per night 60 PB for the raw data (after 10 years) 15 PB for the catalog database The total data volume after processing will be several hundred PB CERN 1 PB per day

Jaroslav Vážný Practical approach

slide-15
SLIDE 15

Science Data Acquisition Machines ToolBox Conclusion

Sloan Digital Sky Survey

Why is it important?

Lots of data (>106 objects) Perfect documentation Tools to access the data

Where I can learn it?

http://www.sdss3.org/

Jaroslav Vážný Practical approach

slide-16
SLIDE 16

Science Data Acquisition Machines ToolBox Conclusion

Virtual Observatory

Why is it important?

Uniform access to astronomy data Based on Web standards Many tools with vo support (Topcat, Aladin, Tapsh)

Where I can learn it?

http://physics.muni.cz/~vazny/wiki/index.php/ Diploma_work

Jaroslav Vážný Practical approach

slide-17
SLIDE 17

Science Data Acquisition Machines ToolBox Conclusion

What is

Machine Learning (Data astrology) Data Mining Artificial Inteligence

Jaroslav Vážný Practical approach

slide-18
SLIDE 18

Science Data Acquisition Machines ToolBox Conclusion

Supervised Machine Learning

Training Text, Documents, Images, etc. Feature Vectors Machine Learning Algorithm New Text, Document, Image, etc. Feature Vector

Predictive Model

Labels Expected Label

Supervised Learning Model

Jaroslav Vážný Practical approach

slide-19
SLIDE 19

Science Data Acquisition Machines ToolBox Conclusion

Overfit/underfit

Jaroslav Vážný Practical approach

slide-20
SLIDE 20

Science Data Acquisition Machines ToolBox Conclusion

Unsupervised Machine Learning

Training Text, Documents, Images, etc. Feature Vectors Machine Learning Algorithm New Text, Document, Image, etc. Feature Vector

Predictive Model Likelihood

  • r Cluster ID
  • r Better

Representation

Unsupervised Learning Model

Jaroslav Vážný Practical approach

slide-21
SLIDE 21

Science Data Acquisition Machines ToolBox Conclusion

Star spectrum

Jaroslav Vážný Practical approach

slide-22
SLIDE 22

Science Data Acquisition Machines ToolBox Conclusion

Example of feature extraction

Jaroslav Vážný Practical approach

slide-23
SLIDE 23

Science Data Acquisition Machines ToolBox Conclusion

Example: Decison Tree

1

ug <= 0.663668

2

| gr <= -0.191208: 1 (7.0)

3

| gr > -0.191208: 3 (104.0/5.0)

4

ug > 0.663668

5

| ri <= 0.285854: 1 (88.0/5.0)

6

| ri > 0.285854

7

| | ri <= 0.314657

8

| | | gr <= 0.692108: 2 (6.0)

9

| | | gr > 0.692108: 1 (3.0)

10

| | ri > 0.314657: 2 (90.0/2.0)

Jaroslav Vážný Practical approach

slide-24
SLIDE 24

Science Data Acquisition Machines ToolBox Conclusion

Example: Suport Vector Machine

Jaroslav Vážný Practical approach

slide-25
SLIDE 25

Science Data Acquisition Machines ToolBox Conclusion

Data exploration

http://ipython.org/ http://scikit-learn.org/stable/ http://pandas.pydata.org/

Jaroslav Vážný Practical approach

slide-26
SLIDE 26

Science Data Acquisition Machines ToolBox Conclusion

Developement

https://github.com/ Tests Funny hat https://www.python.org/

Jaroslav Vážný Practical approach

slide-27
SLIDE 27

Science Data Acquisition Machines ToolBox Conclusion

References

http://ipython.org/ http://www.greenteapress.com/thinkstats/ http://www.greenteapress.com/thinkpython/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ http://jakevdp.github.io/ blog/2013/10/26/big-data-brain-drain/ http://www.galaxyzoo.org/ http://www.planethunters.org/ http://www.sdss3.org/

Jaroslav Vážný Practical approach

slide-28
SLIDE 28

Science Data Acquisition Machines ToolBox Conclusion

Discussion

Jaroslav Vážný Practical approach