TADA! T opics Algorithmic Data Analysis Jilles Vreeken 24 April - - PowerPoint PPT Presentation

tada
SMART_READER_LITE
LIVE PREVIEW

TADA! T opics Algorithmic Data Analysis Jilles Vreeken 24 April - - PowerPoint PPT Presentation

TADA! T opics Algorithmic Data Analysis Jilles Vreeken 24 April 2015 Question of the Course What are the hot t topics in data mining that are coo cool*? * and important to know Question of the Course How can we extract no novel kno


slide-1
SLIDE 1

TADA!

T

  • pics Algorithmic Data Analysis

24 April 2015

Jilles Vreeken

slide-2
SLIDE 2

Question of the Course

What are the hot t topics in data mining that are coo cool*?

* and important to know
slide-3
SLIDE 3

Question of the Course

How can we extract no novel kno knowledge and nd insi nsight from large data?

slide-4
SLIDE 4

Organization

This is an advanced ced lecture,

 with lectures,  and reading,  and assignments.

Beware!

 this lecture will

ill be well-worth its 5 ECTS

slide-5
SLIDE 5

I’m I’m not a afraid id!

 You will be, you will be.

I’m not afraid.

slide-6
SLIDE 6

I’m I’m not a afraid id!

 You will be, you will be.

I’m not afraid. Yes… you will be. You will be.

slide-7
SLIDE 7

Organization

This is an advanced ced lecture,

 with lectures,  and reading,  and assignments.

Beware!

 this lecture will

ill be well-worth its 5 ECTS

 a lot of reading, a lot of thinking;

it’ll take quite some some effort, but you’ll le learn n a lo lot

slide-8
SLIDE 8

Reading Materials

We’ll mainly consider scientific articles All will be available on the website

 directly accessible from the MPI network,  or using login/password that you can get by email
slide-9
SLIDE 9

Lectures

Meetings that cover the basic topics

 format: ‘sit, listen, shut up interact’

Required reading

 announced on website  read at your own convenience

but, strongly pref efer erred ed, before the lecture

slide-10
SLIDE 10

Exam

Type tba

 most likely oral

Day and place tba

 most likely in early August

Grading

 final grade will be based on

final exam and assignments

slide-11
SLIDE 11

Assignments: gen enera ral

4 assignments Grading scale: fa fail il, pass, excel cellent ent. You may fail on

  • ne assignment
 two fails

ils and you fail il the course

Every excel cellent ent gives 1/3 bo bonus nus poin

  • int on final exam grade
 with maximum of 1 full point

You must u must p pass t ss the he fina nal exam t m to pass t ss the he co course

slide-12
SLIDE 12

Assignments: requir equirem emen ents

To be written in proper academic-style English Us Use proper cit citatio ions

 you are given sources  you are encouraged to find additional sources  all sources must be mentioned  pla

lagia iaris ism  instant fail il (at best)

slide-13
SLIDE 13

Assignments: format mat

Return assignment reports as PDF files by email

 no .doc(x), .odt, .rtf, .txt, .xml, .html, .pages, .ps, .eps, .etc

No page limit!

 probably most will need 3 to 5 pages  more is not necessarily better

Reports must clearly state on the first page

 name, matriculation number, email address and topic
slide-14
SLIDE 14

Assignments: returning ng

Return assignment reports are to be returned by email

 tada@

a@mpi-inf.mpg. g.de de

Deadline is on 1400 hours on the stated day

 NO

NO delays, no excuses, time base on mail time stamp.

Submissions that I receive before the DL day I will ACK

slide-15
SLIDE 15

Assignments: grading ing

Assignments are not for repeating what papers say

 perhaps surprisingly, but I have already read the papers.

You are expected to cr crit itic ically ly discuss the sources,

build connections, point out differences, provide

new insights, etc.

Some assignments are marked as hard rd

 this is because they are  and this will be taken

n in into account unt when grading

slide-16
SLIDE 16

News & Updates

Urgent and personal messages by email

 everything else via the website
slide-17
SLIDE 17

Question of the Course

How can we extract no novel kno knowledge and nd insi nsight from large data?

slide-18
SLIDE 18
slide-19
SLIDE 19

For thousands of years, science was empir iric ical: describing natural phenome

  • mena

1st

st Paradi

digm gm: Empir pirical S l Scien ience

slide-20
SLIDE 20

2nd

nd Paradig

igm: Th Theo eoretical l Scien ience

The last few hundred years science was theoretical al: used models, generalizations, made predic ictio ions ns

slide-21
SLIDE 21

3rd

rd Paradigm

gm: C Computatio iona nal S l Scienc nce

The last decades, science was comput utationa nal: complex models sim imul ulating ing complex phenome

  • mena
slide-22
SLIDE 22

4th

th Paradig

digm: Da Data-Intensi nsive S Scienc nce

Interesting phenomena are too

  • o compl

plex x to come up with good hypotheses. We need to unify theory, experimentation, and simulation

capture re data, mi mine ne hypotheses, inspec pect and evaluate, genera erate e extra data to sele lect ct the best ones, iterate itera erative e procedure between wo world and nd mod model, scientist in the middle

slide-23
SLIDE 23

Power

laws

slide-24
SLIDE 24

Sho hopp ppin ing Da Data

Which products are

  • ften bought

toget ether er?

slide-25
SLIDE 25

Train in Dela Delays

Which trains are delayed because of othe

  • ther trains?
slide-26
SLIDE 26

Dr Drug Disc Discover ery

What part of the molecule makes the drug work?

slide-27
SLIDE 27

More patterns than you can shake a stick at

slide-28
SLIDE 28

Pattern-based Modelling

Mining Algorithm

support vector machin svm associ rule mine nearest neighbor frequent itemset mine naïv bay linear discrimin analysi lda cluster high dimension state art frequent pattern mine algorithm synthet real summary of JMLR abstract database
slide-29
SLIDE 29

Summaris ising ing

Which sales chara racteri rise se your customers?

slide-30
SLIDE 30

Summaris ising ing

slide-31
SLIDE 31 Jilles Vreeken Jilles Vreeken’s Professional Network as of April 21, 2015
slide-32
SLIDE 32

Go Google gle Flu

slide-33
SLIDE 33

Quit uite He e Healt lthy hy

slide-34
SLIDE 34

Patient D Dece ceased

slide-35
SLIDE 35

Big Big Da Data, Bigg Bigger er Da Data, Big iggest gest Da Data

slide-36
SLIDE 36

No model is del is per erfec ect

slide-37
SLIDE 37

Scien ience h e has lo lots o s of data, not t the the to tools to to analy lyse se it it

slide-38
SLIDE 38

Soci cial Sci cience & e & th the Web

slide-39
SLIDE 39

Astronomy my

Sloan Sky Su Survey:

100TB between 2000 and 2008 1 billion objects: 260M galaxies, 260M stars non

  • n-trivia

ial l analy lysis: currently impossible

slide-40
SLIDE 40

With Your Help!

Maybe!