1
Understanding the Implications
- f Recommender Systems
- n Our Views and Behaviors
Gedas Adomavicius University of Minnesota
Joint work with Jesse Bockstedt, Shawn Curley, Jingjing Zhang
2
Understanding the Implications of Recommender Systems on Our Views - - PDF document
Understanding the Implications of Recommender Systems on Our Views and Behaviors Gedas Adomavicius University of Minnesota Joint work with Jesse Bockstedt, Shawn Curley, Jingjing Zhang 2 1 Recommender Systems: Feedback Loop Predicted Ratings
2
Recommender System (Consumer preference estimation) Consumer (Preference, Purchasing, Consumption) Predicted Ratings (expressing recommendations for unknown items) Actual Ratings (expressing preferences for consumed items) Accuracy
3
4
– Think of the last two digits of your social security number – Now bid on products… – People with higher social security numbers made bids 60‐120% higher
5
SSN 14 SSN 86 Estimate 45 Estimate 67
– First: establishment of anchoring and adjustment as leading to biases in judgment
– Second: psychological explanations for anchoring effects (Russo 2010).
distribution of uncertain values
the correct answer; “trust” in the system)
– Third: anchoring in real world contexts
Galinsky & Murnighan (2006) investigate anchoring effects in auctions.
6
7
We think you’ll like it: We think you’ll like it: Preference Rating: 2.5/5 Preference Rating: 3.5/5 Unbiased Preference: 3.0/5 Recommendation Recommendation
Cosley et al. (2003) Our Prior Studies Setting Recommender systems Recommender systems Type of task Preference (no objective standard) Preference and Willingness‐to‐Pay (no objective standard) Stimuli Multiple movies Single/multiple TV shows, jokes, songs Recommendations System‐based System‐based, plus artificially generated Manipulations Two: High vs. Low Multiple: High vs. Low; also range of manipulations Timing (process implications) Retrospective (Retrieval; Uncertainty) Point of Consumption (Integrating & Responding; No Uncertainty) Explanations None Directly (timing, perceived reliability hypotheses) and indirectly provide evidence relative to possible explanations that have been posited for anchoring
8
9
10
11
12
13
14
15
– Significant observed anchoring effect of the provided artificial recommendation (High vs. Low)
– No significant difference of Before vs. After
– No significant impact in the Weak condition (WeakHigh vs. WeakLow)
– Artificial high recommendation did not raise ratings significantly (High
– Artificial low recommendation significantly lowered ratings (Low vs. Control)
16
– Seven recommendation techniques were tested on the dataset – Item‐based collaborative filtering approach was the best performer
– High (predicted rating plus 1.5) – Accurate (predicted rating) – Low (predicted rating minus 1.5) – Control (no prediction)
– She/he had never seen before – Had predicted rating for this user between 2.5 and 3.5
17
18
19
20
21
‐0.53 ‐0.41 ‐0.23 ‐0.20 0.07 0.28 0.53
‐1.5 ‐1 ‐0.5 0.5 1 1.5
Mean Rating Drift Perturbation of Recommendation
Control ‐0.04
22
– Anchoring not only impacts recalled preferences (e.g., Cosley et al. 2003), but also impacts preference ratings at the point of consumption
23
24
25
26
27
28
29
24.295 16.848 23.322 26.453 5 10 15 20 25 30 35 Control Low Mid High Willingness to Pay (¢) Treatment
Willingness to Pay by Treatment Group
30
– In addition, we ran a variation of Study 4 with mandatory song sampling – no qualitative change in results
31
– Recall of prior preferences – Generation of new preferences (i.e., at the time of consumption) – Economic actions and purchasing behaviors
– Movies – Jokes – TV shows – Song purchases
32
33
34
– Education, cognitive strategies, decision models, etc.
– Incentives, choice architecture, etc.
35
Removing Bias Maintaining usefulness of Recommendations
36
We think you’ll like it: User Rating
37
Unbiased Preference?
– Eliminate or reduce anchoring biases at rating‐collection time through the design of the user interface
– Presentation (interface design) of recommender system ratings – Some rating presentation formats may impact (reduce) the amount of bias created by recommendations
– Between‐subjects lab experiment – Random treatment of presentation styles – Measure relative changes in decision bias through user‐reported preference ratings
38
39
Information Representation Numeric Graphic Recommendation Vagueness Precise Numeric‐Precise Graphic‐Precise Vague Numeric‐Vague Graphic‐Vague
40
Group N Example Displays of Predicted Rating
Numeric-Precise 40 Numeric-Vague 39 Graphic Precise 40 Graphic-Vague 40 Star-Numeric 45 Star-Only 43 Binary 40
41
– Removed jokes that were not displayed/rated (according to Jester group) – Removed some with objectionable content – Removed jokes greatest in length
– Fast – can have multiple observations per participant – Subjective tastes – Information good – Used extensively in prior research – Jester DB makes it easy to create a real recommender system – Experiment participants enjoy the experience
42
43
Participant Summary Statistics # of participants (n) 287 Age: Mean (SD) 22.7 (4.68) Gender 144 M, 143 F Native speaker of English 70.7% (203/287) Prior experience with recommender systems 74.9% (215/287) Student level 185 undergrad, 87 grad, 15 others
44
45 46
47
48
49
50
51
Within-subject conditions Condition N Description High-Artificial 5 Randomly generated high recommendations, ~U[3.5, 4.5] Low-Artificial 5 Randomly generated low recommendations, ~U[1.5, 2.5] High-Perturbed 5 Actual predictions that were perturbed upward by 1 star Low-Perturbed 5 Actual predictions that were perturbed downward by 1 star Accurate 5 Actual algorithmic predictions (i.e., not perturbed) Control 5 No predictions were provided
52
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Binary Graphic‐Precise Graphic‐Vague Numeric‐Precise Numeric‐Vague Star‐Number Star‐Only
Mean Pre‐Treatment Joke Ratings
53
54
55
Star-Only Star-Number Numeric-Vague Numeric-Precise Graphic-Vague Graphic-Precise Binary L C H L C H L C H L C H L C H L C H L C H
3.4 3.2 3.0 2.8 2.6 2.4 2.2
Average User Rating Mean User Rating (Bars are One Standard Error)
representations (F(6, 280) = 2.24, p < 0.05)
56
Controls:
0.793***
0.287***
‐0.169
‐0.125
‐0.344*
‐0.328*
‐0.363*
‐0.426**
* p ≤ 0.05, ** p ≤ 0.01, *** p ≤ 0.001
Compared to numeric groups, the effect sizes of non‐ numeric groups are much smaller Anchoring bias is significant Artificial Recommendations Baseline: Numeric‐Precise Model 1 High Only Model 2 Low Only Model 3 High & Low
Anchoring (High=1) 0.793*** ShownRatingNoise 0.348*** 0.247** 0.287*** PredictedRating 0.300*** 0.280*** 0.274***
Group
Numeric‐Vague ‐0.234** ‐0.071 ‐0.070 Star‐Numeric ‐0.156 ‐0.006 ‐0.018 Star‐Only ‐0.383*** ‐0.013 ‐0.028 Graphic‐Precise ‐0.049 0.316** 0.298* Graphic‐Vague ‐0.203 0.178 0.167 Binary ‐0.390*** 0.042 0.039
Interactions
Numeric‐Vague×Anchoring ‐0.169 Star‐Numeric×Anchoring ‐0.125 Star‐Only×Anchoring ‐0.344* Graphic‐Precise×Anchoring ‐0.328* Graphic‐Vague×Anchoring ‐0.363* Binary×Anchoring ‐0.426** Controls jokeFunniness 0.635*** 0.541*** 0.595*** age ‐0.003 ‐0.006 ‐0.005 male 0.092 ‐0.005 0.045 undergrad ‐0.149* ‐0.097 ‐0.127* native ‐0.131* 0.002 ‐0.067 IfUsedRecSys 0.064 0.006 0.036 PredictionAccurate 0.122*** 0.008 0.066** PredictionUseful 0.079*** ‐0.020 0.030 Numeracy 0.011 0.001 0.006 Constant ‐0.549 0.132 ‐0.584 R2 overall 0.268 0.140 0.246 * p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001
the “bias‐reducing” effects can be highly asymmetric
Artificial Recommendations Model 4 Numeric/Graphic × Precise/Vague Anchoring (High=1) 0.404*** ShownRatingNoise 0.205* PredictedRating 0.22*** Representation (Numeric=1) ‐0.266* Vagueness (Precise=1) 0.099 Numeric×Precise 0.011 Numeric×Anchoring 0.253* Precise×Anchoring 0.104 Controls jokeFunniness 0.712*** Age ‐0.008 Male 0.044 Undergrad ‐0.186* native ‐0.111 IfUsedRecSys 0.061 PredictionAccurate 0.077* PredictionUseful 0.026 Numeracy 0.013 Constant ‐0.669 R2 overall 0.253
Anchoring bias is significant High‐Low difference is significantly larger for Numeric groups than Graphic groups No significant difference between Precise and Vague groups
* p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001
60
61
62
Star-Only Star-Number Numeric-Vague Numeric-Precise Graphic-Vague Graphic-Precise Binary L A H L A H L A H L A H L A H L A H L A H
0.5 0.0
Average Rating Drift Mean Rating Drift (Bars are One Standard Error)
Dependent Variable: Rating Drift = Submitted Rating – Predicted Rating
63
Controls:
0.777***
0.136***
‐0.040
‐0.189
‐0.140
‐0.285
‐0.301*
‐0.361* * p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001
Non‐numeric displays Binary and Graphic‐Vague generated much smaller rating drifts away from actual preference
Perturbed Recommendations Model 6 Numeric/Graphic × Precise/Vague Anchoring (High=1) 0.468*** PredictedRating ‐0.228* Representation (Numeric=1) ‐0.256** Vagueness (Precise=1) 0.031 Numeric×Precise 0.094 Numeric×Anchoring 0.264* Precise×Anchoring 0.030 Controls jokeFunniness 0.431*** Age 0.003 Male 0.070 Undergrad ‐0.136 native ‐0.052 IfUsedRecSys 0.066 PredictionAccurate 0.085* PredictionUseful ‐0.040 Numeracy 0.022* Constant ‐1.650* R2 overall 0.152
Anchoring bias is significant High‐Low difference is significantly larger for Numeric groups than Graphic groups No significant difference between Precise and Vague groups
* p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001
66
– Alternative interface designs? – Mechanisms behind the design outcomes? (E.g., vagueness – double anchor?) – Bias vs. usefulness tradeoff?
– Non‐interface design approaches (user education, post hoc de‐biasing) – Biases in aggregate review ratings vs. personalized recommender systems ratings – Longitudinal characteristics of recommendation biases – Bias implications in real world (e.g., recommender performance evaluation, manipulation/ abuse)
67