Data Mining and Exploration Michael Gutmann - - PowerPoint PPT Presentation

data mining and exploration
SMART_READER_LITE
LIVE PREVIEW

Data Mining and Exploration Michael Gutmann - - PowerPoint PPT Presentation

Data Mining and Exploration Michael Gutmann michael.gutmann@ed.ac.uk http://homepages.inf.ed.ac.uk/mgutmann Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh 19th January 2017 Michael Gutmann DME 1


slide-1
SLIDE 1

Data Mining and Exploration

Michael Gutmann

michael.gutmann@ed.ac.uk http://homepages.inf.ed.ac.uk/mgutmann Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh

19th January 2017

Michael Gutmann DME 1 / 14

slide-2
SLIDE 2

Data

Oxford dictionary:

◮ Plural of datum

◮ From Latin: dare, to give; datum: something given ◮ A piece of information

◮ Facts [...] collected together for reference or analysis ◮ Things [...] making the basis of reasoning or calculation

Michael Gutmann DME 2 / 14

slide-3
SLIDE 3

by Frederic Dorr Steele

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay”

Sherlock Holmes

slide-4
SLIDE 4

Data sources

◮ Scientific measurements ◮ Business records ◮ Medical tests ◮ Paying by credit card ◮ Using the mobile phone ◮ Social media ◮ Machines ◮ . . .

Michael Gutmann DME 4 / 14

slide-5
SLIDE 5

Scientific data

Large Hadron Collider:

◮ Particles collide at high energies, creating new particles that

decay in complex ways

◮ The raw data per collision event is around one MB. ◮ About 600 million events per second.

⇒ 600 terabyte of data per second

Source: https://home.cern/about/computing Michael Gutmann DME 5 / 14

slide-6
SLIDE 6

Human generated data

On a single day

◮ 500 million tweets ◮ 4.3 billion Facebook messages ◮ 6 billion Google searches ◮ 205 billion emails ◮ . . .

Source: https://www.gwava.com/blog/internet-data-created-daily Michael Gutmann DME 6 / 14

slide-7
SLIDE 7

Human generated data

Source: https://www.domo.com/blog/data-never-sleeps-4-0

Michael Gutmann DME 7 / 14

slide-8
SLIDE 8

Machine generated data

◮ Airplane engine: 5,000 sensors, 10 GB of data per second ◮ Internet of Things

Sources: From Machine-To-Machine to the Internet of Things, Ch 2, 2014; aviationweek.com/connected-aerospace/internet-aircraft-things-industry-set-be-transformed

Michael Gutmann DME 8 / 14

slide-9
SLIDE 9

Data mining ≈ data analysis ≈ data science

First sentences from corresponding wikipedia pages:

◮ The overall goal of the data mining process is to extract

information from a data set and transform it into an understandable structure for further use

◮ Analysis of data is a process of [...] with the goal of

discovering useful information, suggesting conclusions, and supporting decision-making

◮ Data science [...] is an interdisciplinary field about scientific

processes and systems to extract knowledge or insights from data in various forms [...]

Michael Gutmann DME 9 / 14

slide-10
SLIDE 10

Data mining ≈ data analysis ≈ data science

In short:

◮ Data −

→ knowledge

◮ Evidence −

→ conclusions

◮ Pieces of information −

→ actionable information

◮ The process of “making the bricks out of the clay”

Michael Gutmann DME 10 / 14

slide-11
SLIDE 11

Data analysis as statistical inference

Given a data generating process, what are the properties of the outcomes (the data)? Given the outcomes (the data), what can we say about the process that generated them?

(data source)

Based on Figure 1 of All of statistics by Larry Wasserman

Michael Gutmann DME 11 / 14

slide-12
SLIDE 12

Data analysis as statistical inference

Given a data generating process, what are the properties of the outcomes (the data)? Given the outcomes (the data), what can we say about the process that generated them?

(data source)

Data are a realisation of a random vector x with some probability distribution that we don’t know.

Michael Gutmann DME 12 / 14

slide-13
SLIDE 13

Data analysis process

Get (raw) data Deploy the product / Communicate findings Sanity checks

  • where are you now?
  • what do you want to do?
  • constraints?
  • understand the

sampling process

  • any biases?
  • feeback loops

between data analysis and collection?

  • become familiar with the data
  • spot unexpected properties
  • anomalies, outliers, missing data?

Exploratory data analysis

  • merge data sets, reformat
  • select/exclude data
  • provide clear rationale for

selection/exclusion

Prep data for further analysis Build and fit model

  • generalisation is the goal
  • choice of evaluation metric
  • choice of hyperparameters

Summarise, vis- ualise results

  • can you tell a simple and

coherent story?

  • what makes sense, what not?
  • limitations, uncertainties?

Objectives and key results Michael Gutmann DME 13 / 14

slide-14
SLIDE 14

Plan for DME

Get (raw) data Lectures 1-3 Lecture 4 Lecture 5 Mini-project Presentations Deploy the product / Communicate findings Sanity checks

  • where are you now?
  • what do you want to do?
  • constraints?
  • understand the

sampling process

  • any biases?
  • feeback loops

between data analysis and collection?

  • become familiar with the data
  • spot unexpected properties
  • anomalies, outliers, missing data?

Exploratory data analysis

  • merge data sets, reformat
  • select/exclude data
  • provide clear rationale for

selection/exclusion

Prep data for further analysis Build and fit model

  • generalisation is the goal
  • choice of evaluation metric
  • choice of hyperparameters

Summarise, vis- ualise results

  • can you tell a simple and

coherent story?

  • what makes sense, what not?
  • limitations, uncertainties?

Objectives and key results Michael Gutmann DME 14 / 14