Understanding the Implications of Recommender Systems on Our Views - - PDF document

understanding the implications of recommender systems on
SMART_READER_LITE
LIVE PREVIEW

Understanding the Implications of Recommender Systems on Our Views - - PDF document

Understanding the Implications of Recommender Systems on Our Views and Behaviors Gedas Adomavicius University of Minnesota Joint work with Jesse Bockstedt, Shawn Curley, Jingjing Zhang 2 1 Recommender Systems: Feedback Loop Predicted Ratings


slide-1
SLIDE 1

1

Understanding the Implications

  • f Recommender Systems
  • n Our Views and Behaviors

Gedas Adomavicius University of Minnesota

Joint work with Jesse Bockstedt, Shawn Curley, Jingjing Zhang

2

slide-2
SLIDE 2

2

Recommender System (Consumer preference estimation) Consumer (Preference, Purchasing, Consumption) Predicted Ratings (expressing recommendations for unknown items) Actual Ratings (expressing preferences for consumed items) Accuracy

Recommender Systems: Feedback Loop

3

Relevant Notion: Decision Heuristics

Anchoring

The process of seeding a thought in people’s minds and having that thought influence their later actions (Ariely 2008).

Anchoring and Adjustment:

A person begins with a first approximation (anchor) and then makes incremental adjustments based on additional information (Tversky & Kahneman 1974).

4

slide-3
SLIDE 3

3

Anchoring and Adjustment Heuristic

  • Decision makers use implicitly suggested reference points (the

anchor) as a starting point and make adjustments from it until they reach a reasonable estimate (Tversky & Kahneman 1974)

  • Example: numeric anchoring (Ariely et al. 2006)

– Think of the last two digits of your social security number – Now bid on products… – People with higher social security numbers made bids 60‐120% higher

5

SSN 14 SSN 86 Estimate 45 Estimate 67

Related Literature: Anchoring Effects

  • Three waves of anchoring research (Epley and Gilovich 2010)

– First: establishment of anchoring and adjustment as leading to biases in judgment

  • E.g., Tversky & Kahneman 1974; Chapman & Bornstein 1996; Northcraft & Neale 1987

– Second: psychological explanations for anchoring effects (Russo 2010).

  • Uncertainty leads to a search from the anchor to the first plausible value among the

distribution of uncertain values

  • Anchor leads to biased retrieval of anchor‐consistent knowledge
  • Numerical priming
  • Providing content relevant to one’s preference (e.g., anchor is viewed as a suggestion to

the correct answer; “trust” in the system)

– Third: anchoring in real world contexts

  • E.g., Johnson, Schnytzer & Liu (2009) study anchoring in horserace betting; Ku,

Galinsky & Murnighan (2006) investigate anchoring effects in auctions.

  • Recommender systems (Cosley et al. 2003)

6

slide-4
SLIDE 4

4

Anchoring in Recommendations

7

We think you’ll like it: We think you’ll like it: Preference Rating: 2.5/5 Preference Rating: 3.5/5 Unbiased Preference: 3.0/5 Recommendation Recommendation

Related Literature: Anchoring and Recommender Systems

Cosley et al. (2003) Our Prior Studies Setting Recommender systems Recommender systems Type of task Preference (no objective standard) Preference and Willingness‐to‐Pay (no objective standard) Stimuli Multiple movies Single/multiple TV shows, jokes, songs Recommendations System‐based System‐based, plus artificially generated Manipulations Two: High vs. Low Multiple: High vs. Low; also range of manipulations Timing (process implications) Retrospective (Retrieval; Uncertainty) Point of Consumption (Integrating & Responding; No Uncertainty) Explanations None Directly (timing, perceived reliability hypotheses) and indirectly provide evidence relative to possible explanations that have been posited for anchoring

8

slide-5
SLIDE 5

5

Prior Research on Biases in Recommender Systems

Cosley et al. (2003)

  • Impact of system generated

recommendations on user re‐ratings of movies

  • Recall task
  • High test‐retest rating

consistency with no recommendations

  • Showing system’s ratings

biased users’ subsequently submitted ratings in the direction of recommendation

9

Are Biases Bad?

Biases could be undesirable for recommendation system (Cosley et al. 2003, Adomavicius et al. 2013)

– Contaminate the recommender system’s inputs, weakening the system’s ability to provide high‐quality recommendations in subsequent iterations – Can lead to users having a distorted view of items’ relevance – Can lead to the recommender system having a distorted view of users’ preferences – Provide opportunities for manipulation

10

slide-6
SLIDE 6

6

Our Prior Studies

  • Motivation:

– Deepen our understanding of anchoring biases within the important context of recommender systems – Anchoring effects in preference setting (both in terms of item ratings and willingness to pay) and at the time of consumption – Provide evidence relative to the proposed explanations for anchoring effects

11

Prior Studies: General Research Question

Whether and to what extent do system ratings that are displayed to users influence users’ preferences and behaviors at the time

  • f consumption?

12

Studies 1‐3: Preference ratings Studies 4‐5: Willingness‐to‐pay

slide-7
SLIDE 7

7

Studies 1‐3: Impact on Preference Ratings

  • Effect of system’s recommendations on self‐reported

preference ratings

– Observed with different information good types: TV shows, jokes

  • Research issues:

– Anchoring issue (High/Low recommendation) – Timing issue (Before/After consumption) – Perceived system reliability issue (Strong/Weak perceived system reliability) – Perturbation size issue (impact of perturbation size on anchoring effect) – Symmetry/asymmetry of effects

13

General Procedure

  • Rate multiple items – inputs for recommender

system

  • See a recommendation for the viewed

instance(s)

  • View 1 or more instances of item to be rated

– Preference at time of consumption! – Minimal uncertainty and biased recall

  • Provide a preference rating for the viewed

instance(s)

14

slide-8
SLIDE 8

8

Study 1 ‐ Design

  • Rated 105 TV shows
  • Watched an episode of 1 show (all saw same episode)
  • Received an artificial rating of 4.5 or 1.5
  • DV: Actual Rating (submitted by user after consumption)
  • Tested 3 hypotheses

– Anchoring (i.e., anchoring direction)

  • High (4.5 out of 5) vs. Low (1.5 out of 5) anchor

– Timing (of recommendation)

  • Before viewing vs. After viewing

– Perceived reliability (of recommendation)

  • Weak vs. Strong

– Control Group

15

Study 1 ‐ Results

  • Anchoring hypothesis – supported

– Significant observed anchoring effect of the provided artificial recommendation (High vs. Low)

  • Timing hypothesis – not supported

– No significant difference of Before vs. After

  • Perceived system reliability – supported

– No significant impact in the Weak condition (WeakHigh vs. WeakLow)

  • Asymmetry of the anchoring effect

– Artificial high recommendation did not raise ratings significantly (High

  • vs. Control)

– Artificial low recommendation significantly lowered ratings (Low vs. Control)

16

slide-9
SLIDE 9

9

Study 2 ‐ Design

  • Anchors were based on an actual recommender

system

– Seven recommendation techniques were tested on the dataset – Item‐based collaborative filtering approach was the best performer

  • Test of Anchoring Hypothesis

– High (predicted rating plus 1.5) – Accurate (predicted rating) – Low (predicted rating minus 1.5) – Control (no prediction)

  • Each subject watched a show (not all the same)

– She/he had never seen before – Had predicted rating for this user between 2.5 and 3.5

  • DV: Rating Drift = Actual Rating – Predicted Rating

17

Study 2 ‐ Results

  • Effects of providing recommendation

– Accurate prediction had no impact (Accurate vs. Control)

  • Anchoring effect

– High recommendation condition led to significant difference in rating drift compared to the Low condition (High vs. Low)

  • Symmetry

– Aggregate over multiple shows: High/Low effects are symmetric – Single show (Show effect): High/Low effects are asymmetric (& different from Study 1)

18

slide-10
SLIDE 10

10

Study 3: Granularity of Anchoring Effects

19

What is the functional form of the anchoring effect? Three possibilities:

Study 3 ‐ Design

  • Anchors were based on an actual recommender system
  • Anchoring: Within‐Subjects Design

– Each evaluated 50 jokes – Among the remaining 50 jokes

  • Perturbations of ‐1.5, ‐1, ‐.5, 0, .5, 1, 1.5
  • Control (no prediction)
  • Used Jokes to get multiple ratings, still at time of

consumption

  • DV: Rating Drift = Actual Rating – Predicted Rating
  • Regression done for each individual subject (N = 40 per

subject)

20

slide-11
SLIDE 11

11

Study 3 ‐ Aggregated Analysis

  • Aggregated across items and subjects, for

each perturbation

21

‐0.53 ‐0.41 ‐0.23 ‐0.20 0.07 0.28 0.53

‐1.5 ‐1 ‐0.5 0.5 1 1.5

Mean Rating Drift Perturbation of Recommendation

Control ‐0.04

Study 3 ‐ Results

  • Anchoring effect occurs at the individual level
  • Effect is linear (Mean Slope = .35)

– No significant curvilinearity found – Positive and negative slopes did not significantly differ

  • Symmetry

– Aggregate over multiple jokes: High/Low effects are symmetric

22

slide-12
SLIDE 12

12

Takeaways – Preference Ratings

  • Biased recommendations influence consumers’

preference ratings

– Anchoring not only impacts recalled preferences (e.g., Cosley et al. 2003), but also impacts preference ratings at the point of consumption

  • Perceived reliability of the recommendation matters
  • Timing of recommendation has no significant effect
  • Perturbations have a proportional (linear) effect on user

submitted ratings (both negative and positive)

  • Asymmetry of anchoring effects

– Context‐specific (e.g., item‐specific?) – Interesting direction for future work User preference ratings are malleable and can be significantly influenced by the recommendations received. 

23

General Research Question

Whether and to what extent do system ratings that are displayed to users influence users’ preferences and behaviors at the time

  • f consumption?

24

Studies 1‐3: Preference ratings Studies 4‐5: Willingness‐to‐pay

slide-13
SLIDE 13

13

Overview

  • Study 4: Randomly generated high (low)

recommendations

  • Study 5: Recommendation that contain significant

error in an upward (downward) direction

25

General Procedure: Design

Stimulus pool: 200 popular songs

  • Bottom half of year‐end Billboard 100 charts, 2006‐

2009

Within‐subjects design

  • Task 1: Rate 50 random songs – inputs for

recommender system

  • Task 2: Identify 40 songs not owned
  • Task 3: WTP for 40 songs: Study‐dependent

manipulations

Song samples readily available throughout the study

26

slide-14
SLIDE 14

14

General Procedure: Responses

  • Pay: $10 fixed + $5 endowment
  • Response for Task 3: Willingness‐To‐Pay (WTP)

judgments [$0, $0.99]

  • 5 of 40 songs randomly selected

– Selling prices generated randomly – Song purchases using endowment – Songs gifted through Amazon.com – Becker‐DeGroot‐Marschack (1964) procedure

  • To incentivize accurate WTP reporting

27

Study 4, Task 3

  • WTP for 40 non‐owned songs

– 10 randomly generated low recommendations ~ U[1 star, 2 stars] – 10 randomly generated mid‐range recommendations ~ U[2.5 stars, 3.5 stars] – 10 randomly generated high recommendations ~ U[4 stars, 5 stars] – 10 with no recommendation (control)

28

slide-15
SLIDE 15

15

Mean WTP, Study 4

29

24.295 16.848 23.322 26.453 5 10 15 20 25 30 35 Control Low Mid High Willingness to Pay (¢) Treatment

Willingness to Pay by Treatment Group

Study 5, Task 3

  • WTP for 40 non‐owned songs

– 15 songs perturbed upward Recommend = Predicted + (5 each @ .5, 1, 1.5) – 5 songs not perturbed Recommend = Predicted – 15 songs perturbed downward Recommend = Predicted ‐ (5 each @ .5, 1, 1.5) – 5 with no recommendation (control)

30

slide-16
SLIDE 16

16

Takeaways – Willingness to Pay

  • Consistent impact of recommendations on WTP
  • 7‐17% increase in WTP for 1‐star increase in shown rating
  • I.e., recommendations not only impact preference ratings, but

also economic behavior, even when factors of potential uncertainty and biased recall are essentially eliminated

– In addition, we ran a variation of Study 4 with mandatory song sampling – no qualitative change in results

  • Potential “scale compatibility” explanation is also essentially

eliminated

  • Effects observed with artificial recommendations, and also

with perturbed recommendations from an actual recommender system

31

Recommender Systems and Bias: What We Know

  • We know personalized recommendations can cause bias
  • We know they cause biases in various types of preference

decisions:

– Recall of prior preferences – Generation of new preferences (i.e., at the time of consumption) – Economic actions and purchasing behaviors

  • We know they cause biases in various contexts including:

– Movies – Jokes – TV shows – Song purchases

32

slide-17
SLIDE 17

17

Latest Study: Research Question

33

How can we correct the biases of consumers’ preference ratings caused by interacting with recommender system?

Two Potential Approaches

  • Bias‐aware interface design (proactive):

– Interface design‐based approach that incorporates a user interface for rating collection by presenting recommendations in a way that eliminates (or minimizes) anchoring effects

  • Post‐hoc rating adjustment (reactive):

– Computational/algorithmic approach that attempts to properly adjust the user‐submitted ratings by taking into account the system recommendation observed by the user

  • Some preliminary attempts made; to be explored more in the future

34

slide-18
SLIDE 18

18

Avoiding bias is hard in this context

Common approaches for removing bias a priori (e.g., Soll et al. 2014)

  • Modify the person

– Education, cognitive strategies, decision models, etc.

  • Modify the environment

– Incentives, choice architecture, etc.

35

Removing Bias Maintaining usefulness of Recommendations

Avoiding Bias

  • Information representation (and scale compatibility)

– The size of scales and compatibility among choices can create anchoring biases (e.g., Tversky et al. 1988) – Potential de‐biasing approach: change (i.e., “soften”) the representation of recommendation ratings (and possibly also the scale) by using graphical representation of numeric data (e.g., Galesic et al. 2009)

36

We think you’ll like it: User Rating

slide-19
SLIDE 19

19

Avoiding Bias (cont.)

  • Preference uncertainty and vagueness of information

– Uncertainty in preferences is one key driver of anchoring bias (e.g., Jacowitz and Kahneman 1995) – Adjustment from a precise anchor into a plausible preference range. – Potential de‐biasing approach: Introduce vagueness into the recommendation to reduce anchoring and prompt more consideration in judgment (e.g., Dieckmann et al. 2010, Joslyn and LeClerc 2012).

37

Unbiased Preference?

Bias‐Aware User Interface Study: High‐Level Overview

  • Objective:

– Eliminate or reduce anchoring biases at rating‐collection time through the design of the user interface

  • Manipulation:

– Presentation (interface design) of recommender system ratings – Some rating presentation formats may impact (reduce) the amount of bias created by recommendations

  • Methodology:

– Between‐subjects lab experiment – Random treatment of presentation styles – Measure relative changes in decision bias through user‐reported preference ratings

38

slide-20
SLIDE 20

20

Experiment Design

  • Outcome Variable: Bias in user preference ratings
  • Two main factors (2 × 2 between‐subjects design):

– Information representation: Numeric vs. Graphical rating displays – Vagueness of recommendation: Precise vs. Vague rating values

39

Information Representation Numeric Graphic Recommendation Vagueness Precise Numeric‐Precise Graphic‐Precise Vague Numeric‐Vague Graphic‐Vague

Additional Treatment Groups

  • Industry Standard Designs

– Star‐rating representations:

  • E.g., 4.5/5 stars on Netflix or Amazon
  • Star‐Only treatment
  • Star‐Numeric treatment (stars along with a numeric rating)

– Binary representation:

  • E.g., “Love it” or “Hate it” on iTunes or Pandora
  • Binary: only “thumbs up (down)” are displayed for high

(low) predictions

  • 7 total treatment groups

40

slide-21
SLIDE 21

21

Recommendation Displays (Manipulation Between Subjects)

Group N Example Displays of Predicted Rating

Numeric-Precise 40 Numeric-Vague 39 Graphic Precise 40 Graphic-Vague 40 Star-Numeric 45 Star-Only 43 Binary 40

  • r

41

Stimuli

  • Jester Online Joke Recommendation Repository (Goldberg et al.

2001) at Berkeley

  • 150 jokes with ratings from over 150K anonymous users
  • Selected 100 usable jokes for the study

– Removed jokes that were not displayed/rated (according to Jester group) – Removed some with objectionable content – Removed jokes greatest in length

  • Why Jokes?

– Fast – can have multiple observations per participant – Subjective tastes – Information good – Used extensively in prior research – Jester DB makes it easy to create a real recommender system – Experiment participants enjoy the experience

42

slide-22
SLIDE 22

22

Participants

43

Participant Summary Statistics # of participants (n) 287 Age: Mean (SD) 22.7 (4.68) Gender 144 M, 143 F Native speaker of English 70.7% (203/287) Prior experience with recommender systems 74.9% (215/287) Student level 185 undergrad, 87 grad, 15 others

Experiment Procedure

TASK 1:

  • Participant asked to rate 50 random jokes (of 100)

with preference ratings on 1‐5 star scale to provide training data for recommender system.

  • Provides training data for generating

recommendations later

  • Provides context to participant that system will

generate useful recommendations

44

slide-23
SLIDE 23

23

Experiment Procedure (cont.)

TASK 2:

  • Participant is randomly assigned to one of the 7

interface treatment groups. All recommendations will be displayed in that interface format.

  • From remaining 50 jokes 30 jokes are randomly

selected and presented to participant with recommendation ratings (except 5 control) with some within‐subject manipulations.

  • Participants are asked to read and rate each joke.

45 46

slide-24
SLIDE 24

24

47

Experiment Procedure (cont.)

TASK 3: Participants complete a short survey on demographic and other factors used for controls.

  • Age and gender
  • Prior experience with recommender systems
  • Opinions about usefulness of recommendations
  • Native speaker
  • Numeracy

48

slide-25
SLIDE 25

25

Generating Recommendations

First Approach: Artificial Recommendations

– Randomly drawn ratings from uniform distributions

  • High ~ U[3.5‐4.5]
  • Low ~ U[1.5‐2.5]
  • Random so as to not make manipulation systematic and
  • bvious

– Allows us to control for value ranges shown – Pure manipulation ‐ not based on individual preferences

49

Generating Recommendations

Second Approach: Perturbed Recommendations

– Start with real rating predictions generated using item‐ based collaborative filtering

  • Best performing for this dataset among several algorithms tested

– Perturb the predictions by 1 star up/down – Allows us to control for individual preferences – More realistic representation of decision environment – Simulates real‐world recommendation system error

50

slide-26
SLIDE 26

26

Within‐Subject Manipulation

Each subject read and rated 30 jokes, 5 for each condition:

51

Within-subject conditions Condition N Description High-Artificial 5 Randomly generated high recommendations, ~U[3.5, 4.5] Low-Artificial 5 Randomly generated low recommendations, ~U[1.5, 2.5] High-Perturbed 5 Actual predictions that were perturbed upward by 1 star Low-Perturbed 5 Actual predictions that were perturbed downward by 1 star Accurate 5 Actual algorithmic predictions (i.e., not perturbed) Control 5 No predictions were provided

Control Analysis

52

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Binary Graphic‐Precise Graphic‐Vague Numeric‐Precise Numeric‐Vague Star‐Number Star‐Only

Mean Pre‐Treatment Joke Ratings

slide-27
SLIDE 27

27

RESULTS: ARTIFICIAL RECOMMENDATIONS

53

Expectations for Artificial Recommendations Analysis

  • IF biases go away (for a given interface design)

– Anchoring effects should not be observed – I.e., no significant differences between user preference ratings based on recommendations provided (mean ratings should be same regardless

  • f high, low, and control recommendation

conditions)

  • User interfaces that exhibit reduced anchoring

effects should reduce biases

54

slide-28
SLIDE 28

28

Artificial Recommendations Analysis

55

Star-Only Star-Number Numeric-Vague Numeric-Precise Graphic-Vague Graphic-Precise Binary L C H L C H L C H L C H L C H L C H L C H

3.4 3.2 3.0 2.8 2.6 2.4 2.2

Average User Rating Mean User Rating (Bars are One Standard Error)

  • Significant effects in all conditions (one‐tailed p‐value < 0.001 for all High vs. Low tests)
  • One‐way ANOVA suggests significant difference in effect sizes among different rating

representations (F(6, 280) = 2.24, p < 0.05)

Regression: Artificial Recommendations

56

UserRatingij = b0 + b1(Groupij) + b2(Highij) + b3(Groupij × Highij) + b4(ShownRatingNoiseij) + b5(PredictedRatingij) + b6(Controlsi) + ui + εij

Controls:

  • Joke funiness
  • Age
  • Undergraduate
  • NativeSpeaker
  • IfUsedRecSys
  • RecomAccurate
  • RecomUseful
  • Numeracy
  • Examine if some interface designs can reduce anchoring

biases in consumer preference ratings

slide-29
SLIDE 29

29

Selected Regression Terms

High = 1

0.793***

ShownRatingNoise

0.287***

Numeric‐Vague × High

‐0.169

Star‐Numeric × High

‐0.125

Star‐Only × High

‐0.344*

Graphic‐Precise × High

‐0.328*

Graphic‐Vague × High

‐0.363*

Binary × High

‐0.426**

Baseline: Numeric‐Precise

* p ≤ 0.05, ** p ≤ 0.01, *** p ≤ 0.001

Anchoring bias is significant Compared to numeric groups, the effect sizes of non‐numeric groups are much smaller

Compared to numeric groups, the effect sizes of non‐ numeric groups are much smaller Anchoring bias is significant Artificial Recommendations Baseline: Numeric‐Precise Model 1 High Only Model 2 Low Only Model 3 High & Low

Anchoring (High=1) 0.793*** ShownRatingNoise 0.348*** 0.247** 0.287*** PredictedRating 0.300*** 0.280*** 0.274***

Group

Numeric‐Vague ‐0.234** ‐0.071 ‐0.070 Star‐Numeric ‐0.156 ‐0.006 ‐0.018 Star‐Only ‐0.383*** ‐0.013 ‐0.028 Graphic‐Precise ‐0.049 0.316** 0.298* Graphic‐Vague ‐0.203 0.178 0.167 Binary ‐0.390*** 0.042 0.039

Interactions

Numeric‐Vague×Anchoring ‐0.169 Star‐Numeric×Anchoring ‐0.125 Star‐Only×Anchoring ‐0.344* Graphic‐Precise×Anchoring ‐0.328* Graphic‐Vague×Anchoring ‐0.363* Binary×Anchoring ‐0.426** Controls jokeFunniness 0.635*** 0.541*** 0.595*** age ‐0.003 ‐0.006 ‐0.005 male 0.092 ‐0.005 0.045 undergrad ‐0.149* ‐0.097 ‐0.127* native ‐0.131* 0.002 ‐0.067 IfUsedRecSys 0.064 0.006 0.036 PredictionAccurate 0.122*** 0.008 0.066** PredictionUseful 0.079*** ‐0.020 0.030 Numeracy 0.011 0.001 0.006 Constant ‐0.549 0.132 ‐0.584 R2 overall 0.268 0.140 0.246 * p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001

the “bias‐reducing” effects can be highly asymmetric

slide-30
SLIDE 30

30

Regression Analysis: Representation, Vagueness

Artificial Recommendations Model 4 Numeric/Graphic × Precise/Vague Anchoring (High=1) 0.404*** ShownRatingNoise 0.205* PredictedRating 0.22*** Representation (Numeric=1) ‐0.266* Vagueness (Precise=1) 0.099 Numeric×Precise 0.011 Numeric×Anchoring 0.253* Precise×Anchoring 0.104 Controls jokeFunniness 0.712*** Age ‐0.008 Male 0.044 Undergrad ‐0.186* native ‐0.111 IfUsedRecSys 0.061 PredictionAccurate 0.077* PredictionUseful 0.026 Numeracy 0.013 Constant ‐0.669 R2 overall 0.253

Anchoring bias is significant High‐Low difference is significantly larger for Numeric groups than Graphic groups No significant difference between Precise and Vague groups

* p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001

RESULTS: PERTURBED RECOMMENDATIONS

60

slide-31
SLIDE 31

31

Expectations for Perturbed Recommendations Analysis

  • IF biases go away (for a given interface design)

– Anchoring effects should not be observed – I.e., (perturbed) recommendations provided should not cause user preference ratings to significantly differ from predicted ratings

  • User interfaces that exhibit reduced anchoring

effects should reduce biases

61

Perturbed Recommendations Analysis

62

  • Significant effects in all conditions (one‐tailed p‐value < 0.001 for all High vs. Low tests)

Star-Only Star-Number Numeric-Vague Numeric-Precise Graphic-Vague Graphic-Precise Binary L A H L A H L A H L A H L A H L A H L A H

0.5 0.0

  • 0.5

Average Rating Drift Mean Rating Drift (Bars are One Standard Error)

Dependent Variable: Rating Drift = Submitted Rating – Predicted Rating

slide-32
SLIDE 32

32

Regression: Perturbed Recommendations

63

RatingDriftij = b0 + b1(Groupij) + b2(Highij) + b3(Groupij × Highij) + b4(PredictedRatingij) + b5(Controlsi) + ui + εij

Controls:

  • Joke funiness
  • Age
  • Undergraduate
  • NativeSpeaker
  • IfUsedRecSys
  • RecomAccurate
  • RecomUseful
  • Numeracy
  • Examine if some interface designs can reduce anchoring

biases in consumer preference ratings

Selected Regression Terms: Perturbed Recommendations

High = 1

0.777***

Predicted Rating

0.136***

Numeric‐Vague × High

‐0.040

Star‐Numeric × High

‐0.189

Star‐Only × High

‐0.140

Graphic‐Precise × High

‐0.285

Graphic‐Vague × High

‐0.301*

Binary × High

‐0.361* * p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001

Anchoring bias is significant

Non‐numeric displays Binary and Graphic‐Vague generated much smaller rating drifts away from actual preference

slide-33
SLIDE 33

33

Regression Analysis: Representation, Vagueness

Perturbed Recommendations Model 6 Numeric/Graphic × Precise/Vague Anchoring (High=1) 0.468*** PredictedRating ‐0.228* Representation (Numeric=1) ‐0.256** Vagueness (Precise=1) 0.031 Numeric×Precise 0.094 Numeric×Anchoring 0.264* Precise×Anchoring 0.030 Controls jokeFunniness 0.431*** Age 0.003 Male 0.070 Undergrad ‐0.136 native ‐0.052 IfUsedRecSys 0.066 PredictionAccurate 0.085* PredictionUseful ‐0.040 Numeracy 0.022* Constant ‐1.650* R2 overall 0.152

Anchoring bias is significant High‐Low difference is significantly larger for Numeric groups than Graphic groups No significant difference between Precise and Vague groups

* p ≤ 0.05 ** p ≤ 0.01 *** p ≤ 0.001

Summary of Results

  • Anchoring biases could not be completely eliminated in

any of the tested designs, but…

  • Interface design factors seem promising for reducing bias
  • Information representation seems to be a robust factor:

Graphical, Binary, and Star‐Only representations appear to generate less bias in consumer preference ratings than

  • ther representations

– Numeric‐Precise display tends to generate largest biases

  • Vagueness in recommendation rating does not seem to

reduce bias

66

slide-34
SLIDE 34

34

Conclusions & Future Work

  • Biases are pervasive and difficult to eliminate
  • Some promise in de‐biasing through interface design, but a lot

more research opportunities exist, e.g.,

– Alternative interface designs? – Mechanisms behind the design outcomes? (E.g., vagueness – double anchor?) – Bias vs. usefulness tradeoff?

  • Other related issues:

– Non‐interface design approaches (user education, post hoc de‐biasing) – Biases in aggregate review ratings vs. personalized recommender systems ratings – Longitudinal characteristics of recommendation biases – Bias implications in real world (e.g., recommender performance evaluation, manipulation/ abuse)

  • This is a very problem‐rich and growing area of research

67

Thank You!