Recommender System Experiments with MyMediaLite Or: Everything you - - PowerPoint PPT Presentation

recommender system experiments with mymedialite
SMART_READER_LITE
LIVE PREVIEW

Recommender System Experiments with MyMediaLite Or: Everything you - - PowerPoint PPT Presentation

Recommender System Experiments with MyMediaLite Or: Everything you always wanted to know about offline experiments* (*but were afraid to ask) Zeno Gantner <zeno.gantner@nokia.com> Nokia Location & Commerce, Berlin HERE Maps by Nokia


slide-1
SLIDE 1

Recommender System Experiments with MyMediaLite

Or: Everything you always wanted to know about

  • ffline experiments* (*but were afraid to ask)

Zeno Gantner <zeno.gantner@nokia.com> Nokia Location & Commerce, Berlin

slide-2
SLIDE 2

HERE Maps by Nokia … in Berlin

  • ca. 800 people
  • HERE Maps platform

– mobile apps

  • HERE Drive
  • HERE Maps
  • HERE Transit (public transport)

– customers

  • Yahoo Maps
  • Bing Maps
  • major car companies: BMW, VW,

Toyota, ...

slide-3
SLIDE 3

HERE Maps by Nokia … in Berlin

Maps Search Team

  • #bbuzz regulars
  • 3 of us contributed to

Lucene 4.3.0 ;-)

http://2011.berlinbuzzwords.de/content/improving-search-ranking-through-ab-tests-case-study http://2012.berlinbuzzwords.de/sessions/efficient-scoring-lucene http://2012.berlinbuzzwords.de/sessions/introducing-cascalog-functional-data-processing-hadoop http://2012.berlinbuzzwords.de/sessions/relevance-optimization-check-candidate-lists https://issues.apache.org/jira/browse/LUCENE-4930 https://issues.apache.org/jira/browse/LUCENE-4571

slide-4
SLIDE 4 (C) Paul L. Dineen; license: CC by; source http://www.flickr.com/photos/pauldineen/4529216647/sizes/o/in/photostream/
slide-5
SLIDE 5
slide-6
SLIDE 6

+ = ?

slide-7
SLIDE 7

Data + Software/Algorithms = ???

(c) Joon Han, license: CC by-sa 3.0, source: http://en.wikipedia.org/wiki/File:Groundhog_day_tip_top_bistro.jpg

(c) Diliff; license CC by-3.0

Real-world deployments

slide-8
SLIDE 8

Data mining competitions

slide-9
SLIDE 9

Research

slide-10
SLIDE 10

+ = ?

slide-11
SLIDE 11

RecSys Experiments with MyMediaLite

  • 1. Interaction Data
  • 2. Baseline Methods
  • 3. Apples and Oranges
  • 4. Metrics
  • 5. Hyperparameter Tuning
  • 6. Reproducibility
slide-12
SLIDE 12

Running Example: MyMediaLite

  • RecSys toolkit and

evaluation framework

  • written in C#/Mono
  • C#, Python, Ruby, F#
  • 2 Java ports

(RapidMiner plugin)

  • regular releases (every

2-3 months) since 2010

  • simple
  • choice
  • free
  • documented
  • tested

http://mymedialite.net/ http://github.com/zenogantner/MyMediaLite

slide-13
SLIDE 13

Running Example: MyMediaLite

command-line tools

  • rating_prediction
  • item_recommendation

Find all examples here: http://github.com/zenogantner/mml-eval-examples

slide-14
SLIDE 14
  • 1. Interaction Data

Explicit feedback Not always there. Implicit feedback

  • views
  • clicks
  • purchases

Often positive-only.

slide-15
SLIDE 15
  • 1. Interaction Data

User ID Item ID Timestamp 196 242 881250949 186 302 891717742 22 377 878887116 244 51 880606923 ... ... ...

item_recommendation --training-file=F1 --test-file=F2

IDs can be (almost) arbitrary strings

  • ptional

Separator: whitespace, tab, comma, :: Alternative format: yyyy-mm-dd

slide-16
SLIDE 16

Random Splits

item_recommendation … --test-ratio=0.25

Shuffle and split: Simple, but:

  • Does not take temporal trends into account.
  • Does not use all data for testing.
slide-17
SLIDE 17

k-fold Cross-Validation

item_recommendation … --cross-validation=4

Shuffle and split:

  • Uses each data point for evaluation.
  • Does not take temporal trends into account.
slide-18
SLIDE 18

Chronological Splits

rating_prediction … --chronological-split=0.25 rating_prediction … --chronological-split=01/01/2002

Sort chronologically and split:

  • Use the past to predict the “future”.
  • Takes trends in the data into account.

– time of day, day of week – season – trending products

slide-19
SLIDE 19 (c) Serolillo, license: CC by 2.5
slide-20
SLIDE 20
  • 2. Baseline Methods

Why compare against baselines?

  • Absolute numbers have no meaning.

– … well, at least here. – Relative numbers may also have no meaning.

  • … if you compare to the wrong things.

Good baselines:

  • the strongest solution that is still simple
  • the existing solution
  • standard solutions

– coll. filtering: kNN, vanilla matrix factorization

slide-21
SLIDE 21
  • 2. Baseline Methods

item_recommendation … --recommender=Random item_recommendation … --recommender=MostPopular item_recommendation …

  • -recommender=MostPopularByAttributes
  • -item-attributes=ARTISTS

Item recommendation baselines:

  • random
  • popular items (by attribute/category)
slide-22
SLIDE 22 (c) Michael Collins; license: CC by-2.0
slide-23
SLIDE 23
  • 3. Apples and Oranges

Always check if you measure on the same splits. It happens quite often …

slide-24
SLIDE 24
  • 3. Apples and Oranges

Always check if you measure on the same splits. It happens quite often … e.g. this ICML 2013 paper:

slide-25
SLIDE 25
  • 3. Apples and Oranges
slide-26
SLIDE 26
  • 3. Apples and Oranges
  • On chronological splits of the Netflix dataset,

matrix factorization (“SVD”) models usually do not perform below 0.9 RMSE.

  • Chronological splits can be much harder than

random splits! Lessons:

  • Baselines are important – they can also help us

to “debug” experiments.

  • Do not compare between simple splits and

chronological splits.

slide-27
SLIDE 27 (c) Pastorius; license: CC by 3.0; source: http://commons.wikimedia.org/wiki/File:Plastic_tape_m
slide-28
SLIDE 28
  • 4. Metrics

What is the right metric?

  • Know your goal.

– It always depends on what you want to achieve. – What to measure?

  • Criticize your metrics.

– They may ignore important aspects of your problem. – They are just approximations of user behavior.

  • Eyeball the results.

– Your metrics may fail to catch WTF results.

http://thenoisychannel.com/2012/08/20/wtf-k-measuring-ineffectiveness/

slide-29
SLIDE 29
  • 4. Metrics

item_recommendation ... --measures=”prec@5,NDCG”

Precision at k

  • number of “correct” items in the top k results
  • The choice of k is specific to your application.
  • very simple
  • easy to understand and explain

More ranking measures: NDCG, MAP, ERR

slide-30
SLIDE 30
  • 4. Metrics

Precision at k

recommendations precision at 4 bad good 1 bad bad bad

  • good
  • bad
  • 1/4
slide-31
SLIDE 31
  • 5. Hyperparameter Tuning

item_recommendation … --recommender=WRMF

  • -recommender-options=”reg=0.01 alpha=2”
  • Hyperparameters, e.g.

– regularization to control overfitting – learning rate (for gradient descent methods) – stopping criterion

  • You have to do it. Also for your baselines.
  • Don't get too fancy.

– Grid search will do it in most cases.

  • More advanced:

– Nelder-Mead/Simplex

slide-32
SLIDE 32
  • 5. Hyperparameter Tuning

rating_prediction … --search-hp

Grid search

  • simple
  • brute force
  • embarrassingly parallel

“A practical guide to SVM classification” http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

slide-33
SLIDE 33
  • 6. Reproducible Experiments

item_recommendation … --random-seed=1

Random seed

  • “random” splitting
  • training initialization
  • debugging
slide-34
SLIDE 34
  • 6. Reproducible Experiments

item_recommendation … --random-seed=1

Besides random seed:

  • Put everything in version control.

– data, software – scripts and configuration

  • Use build tools like make for automation.

– Knows when to re-run your data preprocessing steps.

http://bitaesthetics.com/posts/make-for-data-scientists.html

slide-35
SLIDE 35
  • 6. Reproducible Experiments

item_recommendations …

  • -recommender=ExternalItemRecommender
  • -recommender-options=”prediction_file=FILE”

Re-use evaluation code. Create predictions using external software. Use MyMediaLite for evaluation.

slide-36
SLIDE 36
  • 6. Reproducible Experiments

item_recommendations …

  • -recommender=ExternalItemRecommender
  • -recommender-options=”prediction_file=FILE”

Why re-use evaluation code?

  • Evaluation protocols (splitting+candidate

selection+metrics) are not easy to get right.

  • Ensures comparability.

– more configuration kept fixed => less risk of

accidental differences

  • Laziness!
slide-37
SLIDE 37 (c) by Caucas; license: CC by-nc-nd 2.0; source: http://www.flickr.com/photos/thecaucas/2597813380/sizes/o/
slide-38
SLIDE 38

Summary

  • 1. Split your data appropriately.
  • 2. Do not compare apples and oranges.
  • 3. Compare against simple and strong

baselines.

  • 4. Precision at k is a metric that is easy to

explain.

  • 5. Grid search is a simple method for

hyperparameter tuning.

  • 6. Make your experiments reproducible.
  • 7. MyMediaLite can help you with some of these

things ;-). Try it out!

slide-39
SLIDE 39

http://github.com/zenogantner/mml-eval-examples http://mymedialite.net/ http://github.com/zenogantner/MyMediaLite

(c) Michael Sauers; license CC by-nc-sa 2.0