Wrapping It Up Jilles Vreeken 31 July 2015 What did we do? - - PowerPoint PPT Presentation

wrapping it up
SMART_READER_LITE
LIVE PREVIEW

Wrapping It Up Jilles Vreeken 31 July 2015 What did we do? - - PowerPoint PPT Presentation

Wrapping It Up Jilles Vreeken 31 July 2015 What did we do? Introduction Patterns Correlation and Causation (Subjective) Interestingness Graphs Wrap-up + < ask-me-anything> T ake Home: ove overa rall Overview of the hot topics in


slide-1
SLIDE 1

Wrapping It Up

Jilles Vreeken

31 July 2015

slide-2
SLIDE 2

What did we do?

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-me-anything> (Subjective) Interestingness

slide-3
SLIDE 3

Overview of the hot topics in data mining that Jilles thinks are cool

strongly biased sample – by interest and available time

I wanted to give a general picture of what data mining is, what makes it special, and what’s currently happening at the edge of human knowledge

T ake Home: ove

  • vera

rall

slide-4
SLIDE 4

Data mining is descriptive not predictive the goal is to give you insight into your data, to offer (parts of) candidate hypotheses, what you do with those is up to you.

Key T ake-Home Message

slide-5
SLIDE 5

Exploratory data analysis wandering around your data, looking for interesting things, without being asked questions you cannot know the answer of.

Questions like:

What distribution should we assume? How many clusters/factors/patterns do you want? Please parameterize this Bayesian network?

T ake Home: In Informatio ion Th Theo eory

slide-6
SLIDE 6

Pattern mining aims to provide a simple descriptions

  • f the structures that your data

exhibits locally. Mining patterns is easy. Mining interesting patterns that are significant and doing so without redundancy, not so much.

T ake Home: Patte tterns rns

slide-7
SLIDE 7

Information Theory is a branch of statistics, concerned with measuring information information = reduction of uncertainty Uncertainty can be quantified in bits Everything new you learn about your data allows you to compress it better

T ake Home: In Informatio ion Th Theo eory

slide-8
SLIDE 8

Correlations can be spurious and deceiving. Mutual information is a strong notion of interaction. Based on Shannon entropy MI is hard to compute for continuous-valued data without making assumptions on the distribution. Based on cumulative entropy MI can detect non-linear correlations without requiring assumptions.

T ake Home: Correlat ations

slide-9
SLIDE 9

Causality is a difficult concept. Standard probabilistic approaches based on likelihood cannot detect causal direction between pairs. Additive noise models and information theoretic measures can.

Oh, and storks cause babies.

T ake Home: Cau Causation

slide-10
SLIDE 10

Interestingness is ultimately subjective

Still, to have algorithms that can find potentially interesting things we somehow need to formalize it

T ake Home: Interest stingne ness ss

slide-11
SLIDE 11

Most graph mining approaches are global and predictive ‘Explain everything in one go’ real graphs are too complex for that Taking a local and descriptive approach allows for more detailed results, richer problems, easier formalization, efficient solutions

very little done so far, many cool open problems

T ake Home: Gr Graph ph Min Mining ing

slide-12
SLIDE 12

Data is rarely static even though many algorithms expect that Streaming algorithms work when data is too big to fit anywhere while dynamic algorithms aim to adjust the answer with the changing data

T ake Home: Dy Dynamic Da Data

slide-13
SLIDE 13

“What the hell was he thinking??” I wanted you to learn

to read scientific papers without getting lost in details quickly forming high level pictures of complex ideas read critically, seeing through scientific sales-pitches show independent thinking, make ideas your own

I was not disappointed.

T ake Home: Assign Assignmen ents

slide-14
SLIDE 14

Data analysis is important, upcoming, but still very young aims to tackle impossible problems, such as finding interesting things in enormous search spaces is a weird mix of theory and practice: likes to be foundational, yet not afraid of ad hoc

and, not unimportant, it’s lots of fun.

T ake Home: TA TADA

slide-15
SLIDE 15

The Exam

type:

  • ral

when: August 3rd and 4th time: individual where: E1.7 room 3.01 what: all material discussed in the lectures, plus

  • ne assignment (your choice) per topic

The Re-Exam

type:

  • ral

when: September 28th time: individual where: E1.7 room 3.01

Exam d date tes

slide-16
SLIDE 16

“Class should end in time :)” “The amount of time necessary for every assignment. ”

Evaluation: I did I did n not lik like

slide-17
SLIDE 17

“More motivated slide” More details on the why? “Bit heavy course for 5 ECTS“ Yes. “More practical follow-up to implement/text ideas” Maybe… “Discuss assignments the day it is brought online” We can do that.

Evaluation: Sugge gest stions ns

slide-18
SLIDE 18

Master thesis projects

 in principle:

yes!

 in practice:

depending background, motivation, interests, and grades --- plus, on whether I have time

 interested?

mail me

Student Research Assistant (HiWi) positions

 in principle:

maybe…

 in practice:

depends on background, grades, and in particular your motivation and interests

 interested?

mail me, include CV and grades

Things to do

slide-19
SLIDE 19

Sample T

  • pics

Graphs

  • characterising viruses
  • realistic graph generators
  • interesting subgraphs
  • comparing graphs

Causality

  • did X cause Y?
  • mining causal graphs
  • what’s the cause of this?
  • predicting the future

Useful Patterns

  • tell me… about this
  • privacy & data generation
  • pattern-based indexing
  • noise reduction

Rich Data & Text

  • pattern-based topic models
  • grammar & compression
  • rich MaxEnt modelling
  • outliers in rich data
slide-20
SLIDE 20

Good Reads

The Information James Gleick

(great light reading)

Elements of Information Theory Thomas Cover & Joy Thomas

(very good textbook)

Data Analysis: a Bayesian Tutorial D.S. Sivia & J. Skilling

(very good, but skip the MaxEnt stuff)

slide-21
SLIDE 21

Well, ok… let me advertise

Information Retrieval and Data Mining

together with Gerhard Weikum Core Lecture 9 ECTS In addition, Hoang Vu and Mario will likely teach

  • ne or two courses next semester

Options include:

Causal Inference (seminar+lectures) Mining High Dimensional Data (seminar+lectures) Mining (Correlated) Patterns (seminar+lectures)

T each us More!

slide-22
SLIDE 22

Quest uestio ion Tim Time! e!

slide-23
SLIDE 23

“What is your opinion on privacy preserving data mining? Have you ever worked with it? Do you think it is useful, or does it somehow contradicts 'the spirit' of data mining?”

Privacy & Data Mining

slide-24
SLIDE 24

“Have you ever worked with text mining? Do you think considering grammar is necessary,

  • r is mere statistics enough?”

T ext Mining

slide-25
SLIDE 25

“Does Big Data exist?” “How big is Big Data?” “When is the data Big enough? Is more data always better?”

Big Data

slide-26
SLIDE 26

Map Reduce, Hadoop, Big Table, Cassandra, Spark, Dremel, etc, etc engineering or science?

Essentially tricks – not magic – that work well for certain specific problems

For KDD 2014, at least 25 out of 150 presentations will be specifically aimed at ‘large scale’ stuff

Mining Massive Data

slide-27
SLIDE 27

“How about data analytics in the cloud?”

Mining the Cloud

slide-28
SLIDE 28

Many, many, many papers about social network analysis So far: lots of statistics, not much ‘mining’ That is, most are about how to model a graph probabilistically, how to fit a given distribution. The Elephant in the Room: what is the ‘graph’ distribution? Nobody knows. Yet.

Social Network Analysis

slide-29
SLIDE 29

Yo Your Quest uestio ion Here!

slide-30
SLIDE 30

Conclusi sions

This concludes TADA’15. I hope you enjoyed the ride.

slide-31
SLIDE 31

This concludes TADA’15. I hope you enjoyed the ride.

Thank you!