Big Data so whats the big deal? Jevin West Information School, - - PowerPoint PPT Presentation

big data so what s the big deal
SMART_READER_LITE
LIVE PREVIEW

Big Data so whats the big deal? Jevin West Information School, - - PowerPoint PPT Presentation

Big Data so whats the big deal? Jevin West Information School, University of Washington DataLab (MGH 310E) jevinw@uw.edu January 26, 2017 What is Data Science? Spring Quarter, 2017 http://callingbullshit.org


slide-1
SLIDE 1

Big Data – so what’s the big deal?

Jevin West Information School, University of Washington DataLab (MGH 310E) jevinw@uw.edu January 26, 2017

slide-2
SLIDE 2

What is Data Science?

slide-3
SLIDE 3

http://callingbullshit.org Spring Quarter, 2017

slide-4
SLIDE 4

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

slide-5
SLIDE 5

Want to be a data scientist?

slide-6
SLIDE 6

‘The Data Scientist’

Communication skills Ethical Reasoning Information/Data Management Personnel Management Interdisciplinary Adaptable

slide-7
SLIDE 7

Data Scientist

Drew Conway, NYU

slide-8
SLIDE 8

Examples of data science

slide-9
SLIDE 9

Agenda

  • What is data science?
  • Cautionary Tales
  • Data Science at UW and in Seattle
  • Big data – why should you care?
  • More cautionary Tales (Data and Society)
  • Data Science, in action
  • DataLab
  • Data for Social Good
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Universities are going big

slide-13
SLIDE 13
slide-14
SLIDE 14

Big Data at UW

  • LSST
  • CS (Farecast)
  • Libraries (digital content)
  • Oceanography
  • Neuroscience
slide-15
SLIDE 15
slide-16
SLIDE 16

Data Science at the Information School

  • Data Science Option (~ Spring 2016)
  • INFO 370: Introduction to Data Science (Fall)
  • INFO 371: Machine Learning (Spring)
  • INFO 445: Advanced Database Design,

Management, and Maintenance

  • INFO 474: Interactive Data Visualization
slide-17
SLIDE 17

Other Classes in iSchool

  • INFX 551 (4 credits) – Fundamentals of Data Curation
  • INFX 576 (4 credits) – Social Network Analysis
  • INFO 470 (5 credits) – Research Methods
  • INFX 573 (4 credits) – Introduction to Data Science
  • INFX 574 (4 credits) – Core Methods in Data Science

and Analytics

  • INFX 575 (4 credits) – Advanced Methods in Data

Science and Analytics

slide-18
SLIDE 18

Extra Credit

slide-19
SLIDE 19

What is big data?

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

“Yes, some of the best theorizing comes after collecting data because then you become aware of another reality…”

Robert Shiller, Nobel Price in Economics (2013)

slide-23
SLIDE 23

Data Exhaust: by-product of human activity

Examples: cell phone locations, purchase transactions, social media

Barabasi et al., Nature (2008), Ginsperg et al., Nature (2009)

slide-24
SLIDE 24

Why big data?

  • Cheaper sensors (climate research, astronomy,

high energy physics, high-throughput gene sequencing, cell phones)

  • Cheaper storage (4 TB, $168)
  • People willing to share their personal

information (Facebook, social media)

  • Faster communication (internet, cell phones)
  • Other reasons?
slide-25
SLIDE 25

The Four A’s and V’s

  • Architecture
  • Acquisition
  • Analysis
  • Archiving
  • Volume
  • Velocity
  • Variety
  • Veracity
slide-26
SLIDE 26

References

slide-27
SLIDE 27

Why should you care about big data?

A shortage of 1.5 million jobs!

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31

Concerns

  • Privacy
  • Overconfidence and Overfitting
  • Correlation versus causation
  • Who owns big data?
  • What else?
slide-32
SLIDE 32
slide-33
SLIDE 33

Big Data is messy

slide-34
SLIDE 34

http://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/

slide-35
SLIDE 35

New MIT algorithm rubs shoulders with human intuition in big data analysis

https://www.washingtonpost.com/news/speaking-of-science/wp/2015/10/19/new-mit-algorithm-rubs-shoulders-with-human-intuition-in-big-data- analysis/

slide-36
SLIDE 36
slide-37
SLIDE 37

Correlation versus Causation

slide-38
SLIDE 38

http://www.washingtonpost.com/news/wonkblog/wp/2015/10/01/the-hidden-inequality-of-who-dies-in-car-crashes/

slide-39
SLIDE 39

Sampling

slide-40
SLIDE 40

Big Data in action

slide-41
SLIDE 41

DJ Patil

slide-42
SLIDE 42

If you had access to the personal calendars of 200 million people, what could you do with it? What products could you create?

slide-43
SLIDE 43

Is there a secondary market for the data that companies are collecting?

slide-44
SLIDE 44

Big data is about asking good questions

slide-45
SLIDE 45
slide-46
SLIDE 46

Science of Science

JW Jevin West Jevin West | jevinw@uw.edu | @jevinwest | jevinwest.org

slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49

Molecular & Cell Biology

Medicine Physics

Ecology & Evolution Economics Geosciences

Psychology Chemistry Psychiatry Environmental Chemistry & Microbiology Mathematics

Computer Science Analytic Chemistry Business & Marketing Political Science Fluid Mechanics Medical Imaging Material Engineering Sociology Probability & Statistics Astronomy & Astrophysics

Gastroenterology Law Chemical Engineering Education Telecommunication Control Theory Operations Research Ophthalmology Crop Science Geography Anthropology Computer Imaging Agriculture Parasitology Dentistry Dermatology Urology Rheumatology Applied Acoustics Pharmacology Pathology

Otolaryngology Electromagnetic Engineering Circuits Power Systems Tribology

Neuroscience

Orthopedics Veterinary Environmental Health

A

Citation flow from B to A Citation flow within field Citation flow from A to B Citation flow out of field

B

slide-50
SLIDE 50

JW

slide-51
SLIDE 51
slide-52
SLIDE 52

West, Wesley-Smith, Bergstrom (2016) A recommendation system based on hierarchical clustering of an article-level citation network. IEEE, Transactions on Big Data (in press)

JW

slide-53
SLIDE 53

Mining the literature

In collaboration with P . I. Imoukhuede, University of Illinois

slide-54
SLIDE 54

http://jevinwest.org

slide-55
SLIDE 55

Why should you care about big data?

Jobs Privacy

slide-56
SLIDE 56

Enjoy the wave but be cautious…

slide-57
SLIDE 57

Big Data involves people

slide-58
SLIDE 58

“Data is increasingly digital air: the oxygen we breathe and the carbon dioxide that we exhale. It can be a source of both sustenance and pollution.” -- Dana Boyd

  • D. Boyd & K. Crawford (2011) Six Provocations on Big Data. SSRN
slide-59
SLIDE 59
slide-60
SLIDE 60

Jevin West jevinw@uw.edu @jevinwest Website: jevinwest.org Lab: datalab.ischool.uw.edu