Apache Mahout Making data analysis easy Isabel Drost Nighttime: - - PowerPoint PPT Presentation

apache mahout
SMART_READER_LITE
LIVE PREVIEW

Apache Mahout Making data analysis easy Isabel Drost Nighttime: - - PowerPoint PPT Presentation

Apache Mahout Making data analysis easy Isabel Drost Nighttime: Co-Founder, committer Apache Mahout. Organiser of Berlin Hadoop Get Together. Daytime: Software developer. Guest lecturer at TU Berlin. Co-Organiser Berlin Buzzwords 2010.


slide-1
SLIDE 1

Apache Mahout

Making data analysis easy

slide-2
SLIDE 2

Isabel Drost

Nighttime:

Co-Founder, committer Apache Mahout. Organiser of Berlin Hadoop Get Together.

Daytime:

Software developer. Guest lecturer at TU Berlin. Co-Organiser Berlin Buzzwords 2010.

slide-3
SLIDE 3
  • “Mastering Data-Intensive Collaboration and

Decision Making”

  • EU funded research project

– Number of partners: 8 – Coordinator: Research Academic Computer Technology

Institute (CTI), Greece

slide-4
SLIDE 4

Hello Devoxx!

slide-5
SLIDE 5

Hello Devoxx!

slide-6
SLIDE 6

Hello Devoxx!

slide-7
SLIDE 7

Hello Devoxx!

slide-8
SLIDE 8

Hello Devoxx!

slide-9
SLIDE 9

Hello Devoxx!

Machine learning background?

slide-10
SLIDE 10

Hello Devoxx!

slide-11
SLIDE 11

Agenda

  • Data Mining/ Machine Learning?
  • Why is scaling hard?
  • Going beyond simple statistics.
slide-12
SLIDE 12

Data Mining Applications

  • Marketing.
  • Surveillance.
  • Fraud Detection.
  • Scientific Discovery.
  • Discover items usually purchased together.

= Extracting patterns from data.

slide-13
SLIDE 13

Machine Learning Applications

  • E-Mail spam classification.
  • News-topic discovery.
  • Building recommender systems.

= Extracting prediction models from data.

slide-14
SLIDE 14

Machine learning – what's that?

slide-15
SLIDE 15

Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett. Bradbury, Evans & Co, London, 1850s Archimedes taking a Warm Bath

slide-16
SLIDE 16

Archimedes model of nature

slide-17
SLIDE 17

June 25, 2008 by chase-me http://www.flickr.com/photos/sasy/2609508999

slide-18
SLIDE 18

An SVM's model of nature

slide-19
SLIDE 19

The challenge

slide-20
SLIDE 20

Mission

Provide scalable data mining algorithms.

slide-21
SLIDE 21

http://www.flickr.com/photos/honou/2936937247/

slide-22
SLIDE 22

HowTo: From data to information.

slide-23
SLIDE 23

January 3, 2006 by Matt Callow http://www.flickr.com/photos/blackcustard/81680010

slide-24
SLIDE 24

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

slide-25
SLIDE 25

http://www.flickr.com/photos/disowned/1158260369/

The HDFS filesystem is not restricted to MapReduce

  • jobs. It can be used for other applications, many of

which are under way at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and matrix operations.

slide-26
SLIDE 26

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/in/photostream/ http://www.flickr.com/photos/noodlepie/2675987121/

http://www.flickr.com/photos/topsy/204929063/

slide-27
SLIDE 27

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

slide-28
SLIDE 28

From data to information. From data to information.

  • Collect data and define your learning problem.
  • Data preparation.
  • Training a prediction model.
  • Checking the performance of your model.
slide-29
SLIDE 29
slide-30
SLIDE 30
  • Remove noise.
slide-31
SLIDE 31
  • Remove noise.
  • Convert text to vectors.
slide-32
SLIDE 32

From texts to vectors

slide-33
SLIDE 33

Sunny weather High performance computing

If we looked at two words only:

slide-34
SLIDE 34

Aaron Zuse

slide-35
SLIDE 35

Binary bag of words

  • Imagine a n-dimensional space.
  • Each dimension = one possible word in texts.
  • Entry in vector is one, if word occurs in text.
  • Problem:
  • Number of word occurrences not accounted for.

bi , j={ 1∀ xi∈d j 0else }

slide-36
SLIDE 36

Term Frequency

  • Imagine a n-dimensional space.
  • Each dimension = one possible word in texts.
  • Entry in vector equal to the words frequency.
  • Problem:
  • Common words dominate vectors.

bi , j=ni , j

slide-37
SLIDE 37

TF with stop wording

  • Imagine a n-dimensional space.
  • Each dimension = one possible word in texts.
  • Filter stopwords.
  • Entry in vector equal to the words frequency.
  • Problem:
  • Common and uncommon words with same weight.

bi , j=ni , j

slide-38
SLIDE 38

TF- IDF

  • Imagine a n-dimensional space.
  • Each dimension = one possible word in texts.
  • Filter stopwords.
  • Entry in vector equal to the weighted frequency.
  • Problem:
  • Long texts get larger values.

bi , j=ni , j×log ∣D∣ ∣{d : ti∈d }∣

slide-39
SLIDE 39

Normalized TF- IDF

  • Imagine a n-dimensional space.
  • Each dimension = one possible word in texts.
  • Filter stopwords.
  • Entry in vector equal to the weighted frequency.
  • Normalize vectors.
  • Problem:
  • Additional domain knowledge ignored.

bi , j= ni , j

∑k nk , j

×log ∣D∣ ∣{d : ti∈d }∣

slide-40
SLIDE 40

Reality

  • There are a few more words in news.
  • Use all relevant features/ signals available.
  • Words.
  • Header fields.
  • Characteristics of publishing url.
  • Usually pipeline of feature extractors.
slide-41
SLIDE 41

From data to information.

  • Collect data and define your learning problem.
  • Data preparation.
  • Training a prediction model.
  • Checking the performance of your model.
slide-42
SLIDE 42

Step 2: Similarity

slide-43
SLIDE 43

Euclidian

slide-44
SLIDE 44

Euclidian

slide-45
SLIDE 45

Euclidian Cosine

slide-46
SLIDE 46

Step 3: Clustering

slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51

Until stable.

slide-52
SLIDE 52

Reality

  • Seed selection.
  • Choice of initial k.
  • Continuous updates.
  • Regular addition of clusters.
slide-53
SLIDE 53

From data to information.

  • Collect data and define your learning problem.
  • Data preparation.
  • Training a prediction model.
  • Checking the performance of your model.
slide-54
SLIDE 54

Evaluation

  • Compare against gold standard.
  • Use quality measures.
  • Manual inspection.
slide-55
SLIDE 55

From data to information.

  • Collect data and define your learning problem.
  • Data preparation.
  • Training a prediction model.
  • Checking the performance of your model.
slide-56
SLIDE 56

http://www.flickr.com/photos/generated/943078008/

slide-57
SLIDE 57
slide-58
SLIDE 58

What else does Mahout have to offer.

slide-59
SLIDE 59

Identify dominant topics

  • Given a dataset of texts, identify main topics.
  • Examples:
  • Dominant topics in set of mails.
  • Identify news message categories.

Algorithms: Parallel LDA

slide-60
SLIDE 60

Assign items to defined categories.

  • Given pre-defined categories, assign items to it.
slide-61
SLIDE 61

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

slide-62
SLIDE 62
slide-63
SLIDE 63
slide-64
SLIDE 64

Recommendation mining.

  • Collaborative filtering.
slide-65
SLIDE 65

Show most relevant ads

slide-66
SLIDE 66

Show most relevant ads

slide-67
SLIDE 67

http://www.flickr.com/photos/alainpicard/4175214747

http://www.flickr.com/photos/25831000@N08/4156701164

http://www.flickr.com/photos/jfclere/4061801735 http://www.flickr.com/photos/claudio_ar/2643165035/

http://www.flickr.com/photos/claudio_ar/2643180457

Thanks to Falko Menge for the pictures of Brussels.

http://www.flickr.com/photos/joachim_s_mueller/2417313476/ http://www.flickr.com/photos/sebastian_bergmann/1244514498 http://www.flickr.com/photos/philfotos/4510197138/

Recommending places

slide-68
SLIDE 68

Recommending people

slide-69
SLIDE 69

Recommendation mining.

  • Online collaborative filtering on single machine.
  • Offline Map/Reduce based version.
  • Content similarity can be integrated.
  • Based on former Taste project.
slide-70
SLIDE 70

Frequent pattern mining

  • Given groups of items, find commonly co-
  • ccurring items.
  • Examples:
  • In shopping carts find items bought together.
  • In query logs find queries issued in one session.
slide-71
SLIDE 71

By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/ By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/

slide-72
SLIDE 72

By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/ By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/ By quinnanya, http://www.flickr.com/photos/quinnanya/2806883231/

slide-73
SLIDE 73

March 14, 2009 by Artful Magpie http://www.flickr.com/photos/kmtucker/3355551036/

Requirements to get started

slide-74
SLIDE 74
slide-75
SLIDE 75
slide-76
SLIDE 76
slide-77
SLIDE 77
slide-78
SLIDE 78

Why go for Apache Mahout?

slide-79
SLIDE 79

Jumpstart your project with proven code.

January 8, 2008 by dreizehn28 http://www.flickr.com/photos/1328/2176949559

slide-80
SLIDE 80

Discuss ideas and problems online.

November 16, 2005 [phil h] http://www.flickr.com/photos/hi-phi/64055296

slide-81
SLIDE 81
slide-82
SLIDE 82

Become a committer.

slide-83
SLIDE 83

Become a committer: Of Apache Mahout

Sebastian Schelter Jake Mannix Benson Margulies Robin Anil David Hall AbdelHakim Deneche Karl Wettin Sean Owen Grant Ingersoll Otis Gospodnetic Drew Farris Jeff Eastman Ted Dunning Isabel Drost Emeritus: Niranjan Balasubramanian Erik Hatcher Ozgur Yilmazel Dawid Weiss

slide-84
SLIDE 84

*-user@mahout.apache.org *-dev@mahout.apache.org Interest in solving hard problems. Being part of lively community. Engineering best practices. Bug reports, patches, features. Documentation, code, examples.

Image by: Patrick McEvoy

slide-85
SLIDE 85
slide-86
SLIDE 86

Thanks to Tim Lossen et. al for taking amazing pictures of the conf.

slide-87
SLIDE 87

Berlin Buzzwords 2011

Search/ Store/ Scale

May/ June 2011

Thanks to Tim Lossen et. al for taking amazing pictures of the conf.

slide-88
SLIDE 88

*-user@mahout.apache.org *-dev@mahout.apache.org Interest in solving hard problems. Being part of lively community. Engineering best practices. Bug reports, patches, features. Documentation, code, examples.

Image by: Patrick McEvoy

slide-89
SLIDE 89