new approaches for evaluation correctness and freshness
play

New approaches for evaluation: correctness and freshness Pablo S - PowerPoint PPT Presentation

New approaches for evaluation: correctness and freshness Pablo S anchez Rus M. Mesas Alejandro Bellog n Universidad Aut onoma de Madrid Escuela Polit ecnica Superior Departamento de Ingenier a Inform atica V Congreso


  1. New approaches for evaluation: correctness and freshness Pablo S´ anchez Rus M. Mesas Alejandro Bellog´ ın Universidad Aut´ onoma de Madrid Escuela Polit´ ecnica Superior Departamento de Ingenier´ ıa Inform´ atica V Congreso Espa˜ nol de Recuperaci´ on de Informaci´ on (CERI 2018) 1 / 62

  2. Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 2 / 62

  3. Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 3 / 62

  4. Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs 4 / 62

  5. Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? 5 / 62

  6. Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? Several evaluation dimensions: Error, Ranking, Novelty / Diversity 6 / 62

  7. Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? Several evaluation dimensions: Error, Ranking, Novelty / Diversity We will focus on Freshness and Correctness (from S´ anchez and Bellog´ ın (2018); Mesas and Bellog´ ın (2017) ) 7 / 62

  8. Different notions of quality 100 50 0 Coverage 100 50 0 Coverage 100 50 0 Coverage 8 / 62

  9. Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage 100 50 0 Coverage 100 50 0 Coverage 9 / 62

  10. Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 0 Coverage 100 50 0 Coverage 10 / 62

  11. Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 Best in Freshness ? R 3 > R 1 > R 2 0 Coverage 100 50 0 Coverage 11 / 62

  12. Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 Best in Freshness ? R 3 > R 1 > R 2 0 Coverage Best in Cov-Rel 100 Tradeoff ? 50 R 1 > R 3 > R 2 ?? R 1 > R 2 > R 3 ?? 0 Coverage 12 / 62

  13. Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 13 / 62

  14. Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u 14 / 62

  15. Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u Where: R u items recommended to user u θ contextual variable (e.g., the user profile) disc( n ) is a discount model (e.g. NDCG) p ( rel | i n , u ) relevance component nov( i n | θ ) novelty model 15 / 62

  16. Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u With this framework we can derive multiple metrics, however, all of them are time-agnostic 16 / 62

  17. Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ t ) = C disc( n ) p ( rel | i n , u ) nov( i n | θ t ) (1) i n ∈ R u With this framework we can derive multiple metrics, however, all of them are time-agnostic We propose to replace the novelty component defining new time-aware novelty models 17 / 62

  18. Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items 18 / 62

  19. Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) 19 / 62

  20. Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: 20 / 62

  21. Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: Metadata information: release date (movies or songs), creation time, etc. 21 / 62

  22. Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: Metadata information: release date (movies or songs), creation time, etc. Rating history of the items 22 / 62

  23. Time-Aware Novelty Metrics ... ... 23 / 62

  24. Modeling time profiles for items How can we aggregate the temporal representation? 24 / 62

  25. Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: 25 / 62

  26. Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: Take the first interaction (FIN) Take the last interaction (LIN) Take the average of the ratings times (AIN) Take the median of the ratings times (MIN) 26 / 62

  27. Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: Take the first interaction (FIN) Take the last interaction (LIN) Take the average of the ratings times (AIN) Take the median of the ratings times (MIN) Each case defines a function f ( θ t ( i )) 27 / 62

  28. Modeling time profiles for items: an example ... ... 28 / 62

  29. Modeling time profiles for items: an example Which model represents better the freshness of the items? FIN? i 2 > i 10 > i 9 > i 1 LIN? i 9 > i 1 > i 10 > i 2 ... MIN? i 10 > i 2 > i 9 > i 1 AIN? i 9 > i 10 > i 2 > i 1 ... 29 / 62

  30. Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 30 / 62

  31. Motivation Goal: balancing coverage and precision 31 / 62

  32. Motivation Goal: balancing coverage and precision Some researchers ( Herlocker et al. (2004) Gunawardana and Shani (2015) ) alerted this is still an open problem in Recommender Systems evaluation 32 / 62

  33. Motivation Goal: balancing coverage and precision Some researchers ( Herlocker et al. (2004) Gunawardana and Shani (2015) ) alerted this is still an open problem in Recommender Systems evaluation Typical situation: recommendations with low confidence should not be presented to the user (coverage is reduced at the expense of (potentially) more relevant recommendations) 33 / 62

  34. Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) 34 / 62

  35. Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct 35 / 62

  36. Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct If an answer is not given, it should not be considered as incorrect (the algorithm decided not to recommend ) 36 / 62

  37. Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct If an answer is not given, it should not be considered as incorrect (the algorithm decided not to recommend ) Applied to recommenders: if two systems have the same number of relevant items but one has retrieved less items, it should be better than the other one 37 / 62

  38. Our proposal: Correctness metrics Based on users: User Correctness = 1 � TP ( u ) + TP ( u ) NR ( u ) � (3) N N � � Recall User Correctness = 1 TP ( u ) + TP ( u ) | T ( u ) | NR ( u ) (4) N 38 / 62

  39. Our proposal: Correctness metrics Based on users: User Correctness = 1 � TP ( u ) + TP ( u ) NR ( u ) � (3) N N � � Recall User Correctness = 1 TP ( u ) + TP ( u ) | T ( u ) | NR ( u ) (4) N where TP ( u ): number of relevant items that we are recommending to the user FP ( u ): number of non-relevant items that we are recommending to the user N : cutoff NR ( u ) : N − ( TP + FP ) | T ( u ) | : number of relevant items in the test of user u 39 / 62

  40. Experiments Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 40 / 62

  41. Freshness results Are the recommendations obtained by different algorithms temporally novel (fresh)? Do the different novelty models produce similar results? 41 / 62

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend