Searching for Arms Daniel Fershtman Alessandro Pavan October 1, - - PowerPoint PPT Presentation

searching for arms
SMART_READER_LITE
LIVE PREVIEW

Searching for Arms Daniel Fershtman Alessandro Pavan October 1, - - PowerPoint PPT Presentation

Searching for Arms Daniel Fershtman Alessandro Pavan October 1, 2019 Motivation Experimentation/Sequential Learning central to many problems In many cases, endogenous set of alternatives/arms search Tradeoff: exploring existing alternatives


slide-1
SLIDE 1

Searching for Arms

Daniel Fershtman Alessandro Pavan October 1, 2019

slide-2
SLIDE 2

Motivation

Experimentation/Sequential Learning central to many problems In many cases, endogenous set of alternatives/arms search Tradeoff: exploring existing alternatives vs searching for new ones

slide-3
SLIDE 3

Motivation

Example

Consumer sequentially explores different alternatives within“consideration set” , while expanding consideration set through search Firm interviews candidates, while searching for additional suitable candidates to interview Researcher splits time on several ongoing projects of unknown return, while also searching for new projects Difference experimentation: directed search: undirected

slide-4
SLIDE 4

This Paper

Multi-armed bandit problem with endogenous set of arms Optimal policy: index policy (with special index for Search) Extension to problems with irreversible choice (based on partial information) Weitzman: special case where set of boxes exogenous and uncertainty resolved after first inspection

slide-5
SLIDE 5

Search Index

Definition

GS(ωS) = sup

τ,π

E τ−1

s=0 δs(rπ s − cπ s )|ωS

E τ−1

s=0 δs|ωs

  • Recursive representation

GS(ωS) = Eχ∗ τ ∗−1

s=0

δs(rs − cs)|ωS Eχ∗ τ ∗−1

s=0

δs|ωs

  • χ∗: policy selecting physical arm with highest Gittins index (among those

brought by new search) if such index higher than search index and search

  • therwise

τ ∗ : first time search index + indexes of all physical arms brought by new search fall below value of search index at time search launched

slide-6
SLIDE 6

Difficulties

Opportunity cost of search depends on entire composition of current choice set e.g., profitability of searching for additional candidates depends on observable covariates of current candidates (gender, education, etc.) and past interviews Non-stationarity in search technology search outcome may depend on type and number of arms previously found past search costs Search competes with its own“descendants”(i.e., with arms discovered through past searches) correlation Treating search as“meta arm”requires decisions within meta arm invariant to info

  • utside meta arm

bandit problems with meta arms (e.g., arms that can be activated with different intensities –“super-processes” ) rarely admit index solution

slide-7
SLIDE 7

Literature

Bandits

Gittins and Jones (1974), Rothschild (1974), Rustichini and Wolinsky (1995), Keller and Rady (1999)... Surveys: Bergemann and Valimaki (2008), Horner and Skrzypacz (2017)

Bandits with time-varying set of alternatives

Whittle (1981), Varaiya et al. (1985), Weiss (1988), Weber (1994)...

Sequential search for best alternative (Pandora’s problem)

Weitzman (1979), Olszewski and Weber (2015), Choi and Smith (2016), Doval (2018)...

Experimentation before irreversible choice

Ke, Shen and Villas-Boas (2016), Ke and Villas-Boas (2018)...

⇒ KEY DIFFERENCE: Endogeneity of set of arms

slide-8
SLIDE 8

Plan

1

Model

2

Optimal policy

3

Dynamics

4

Proof of main theorem

5

Applications

6

Extensions irreversible choice search frictions multiple search arms no discounting

slide-9
SLIDE 9

Model

slide-10
SLIDE 10

Model: Environment

Discrete time: t = 0, ..., ∞ Available“physical”arms in period t: It = {1, ..., nt} (I0 exogenous) At each t, DM pull arm among It search for new arms

  • pt-out: arm i = 0 (fixed reward equal to outside option)

Pulling arm i ∈ It reward ri ∈ R transition to new“state” Search costly stochastic set of new arms It+1\It

slide-11
SLIDE 11

Model: “Physical”Arms

“State”of physical arm: ωP = (ξ, θ) ∈ ΩP ξ ∈ Ξ : persistent“type” θ ∈ Θ: evolving state Example: ξ: type of research project/idea (theory, empirical, experimental) θ = (σm) : history of signals about project’s impact r: utility from working on project HωP : distribution over ΩP, given ωP Reward: r(ωP) Usual assumptions: Arm’ state“frozen”when not pulled time-autonomous processes evolution of arms’ states independent across arms, conditional on arms’ types

slide-12
SLIDE 12

Model: Search Technology

State of search technology: ωS = ((c0, E0), (c1, E1), ..., (cm, Em)) ∈ ΩS m: number of past searches ck: cost of k’th search Ek = (nk(ξ) : ξ ∈ Ξ): result of k-th search nk(ξ) ∈ N: number of arms of type ξ found HωS : joint distribution over (c, E), given ωS Key assumptions independence of calendar time independence of arms’ idiosyncratic shocks, θ Correlation though ξ

slide-13
SLIDE 13

Model: Search Technology

Stochasticity in search technology: learning about alternatives not yet in consideration set evolution of DM’s ability to find new alternatives e.g., limited set of outside alternatives fatigue/experience

slide-14
SLIDE 14

Model: states and policies

Period-t state: St ≡ (ωS

t , SP t )

ωS

t : state of search technology

SP

t ≡ (St(ωP) : ωP ∈ ΩP) state of physical arms

SP

t (ωP): number of physical arms in state ωP ∈ ΩP

Definition eliminates dependence on calendar time, while keeping track of relevant information Policy χ describes feasible decisions at all histories Policy χ optimal if it maximizes expected discounted sum of net payoffs Eχ  

  • t=0

δt  

  • j=1

xjtrjt − ctyt   |S0  

slide-15
SLIDE 15

Plan

1

Model

2

Optimal policy

3

Dynamics

4

Proof of main theorem

5

Applications

6

Extensions Irreversible choice Search frictions multiple search arms no discounting

slide-16
SLIDE 16

Optimal Policy

slide-17
SLIDE 17

Indexes for Physical Arms

Index for“physical”arms:

GP(ωP) ≡ sup

τ>0

E τ−1

s=0 δsrs|ωP

E τ−1

s=0 δs|ωP

  • τ: stopping time

Interpretations: maximal expected discounted reward, per unit of expected discounted time (Gittins) annuity that makes DM indifferent between stopping right away and continuing with option to retire in the future (Whittle) fair charge (Weber)

slide-18
SLIDE 18

Index for Search

Index for search:

GS(ωS) ≡ sup

π,τ

E τ−1

s=0 δs(rπ s − cπ s )|ωS

E τ−1

s=0 δs|ωs

  • τ : stopping time

π: choice among arms discovered AFTER t and FUTURE searches r π

s , cπ s : stochastic rewards/costs, under rule π

Interpretation: fair (flow) price for visiting“casinos”found stochastically over time, playing in them, and continue searching for other casinos Definition: accommodates for correlation among arms found over time compatible with possibility that search lasts indefinitely and brings unbounded set of alternatives

slide-19
SLIDE 19

Index policy

Definition

Index policy selects at each t “search”iff GS(ωS

t ) ≥

G∗(SP

t )

maximal index among available physical arms

  • therwise, it selects any“physical”arm with index G∗(SP

t )

slide-20
SLIDE 20

Optimality of index policy

Theorem 1

Index policy optimal in bandit problem with search for new arms

slide-21
SLIDE 21

Implications of Index Policy

Each period DM must assign task to a worker Each worker can be ξ =Male of ξ=Female different processes over signals/rewards Probability search brings Male: .8 Fixing value of highest index, optimality of searching for new candidates same no matter whether you have 49 M and 1 F, or 25 M and 25 F Given highest physical index G∗(SP

t ), composition of set of physical arms irrelevant

for decision to search However, opportunity cost of search (value of continuing with current agents) depends on number of M and F (and past outcomes) Maximal index among current arms NOT sufficient statistics for state of current arms when it comes to continuation payoff with current arms

slide-22
SLIDE 22

Plan

1

Model

2

Optimal policy

3

Dynamics

4

Proof of main theorem

5

Applications

6

Extensions Irreversible choice Search frictions multiple search arms no discounting

slide-23
SLIDE 23

Dynamics

slide-24
SLIDE 24

Dynamics under index policy

Stationary search technology: HωS = HS all ωS if DM searches at t, all physical arms present at t never pulled again (search=replacement) Result extends to“Improving search technologies” : physical arms required to pass more stringent tests over time Deteriorating search technology: e.g., finite set of arms DM may return to arms present before last search

slide-25
SLIDE 25

Plan

1

Model

2

Optimal policy

3

Dynamics

4

Proof of main theorem

5

Applications

6

Extensions Irreversible choice Search frictions multiple search arms no discounting

slide-26
SLIDE 26

Proof of Main Theorem

slide-27
SLIDE 27

Proof of Theorem 1: Road Map

1

Characterization of payoff under index policy representation uses“timing process”based on optimal stopping in indexes definition: physical arms: stop when index drops below its initial value (Mandelbaum, 1986) search: stop when search index and all indexes of newly arrived arms smaller than value of search index when search began

2

Dynamic programming payoff function under index policy solves dynamic programming equation

slide-28
SLIDE 28

Proof: Step 1

κ(v|S) ∈ N ∪ {∞}: minimal time until all indexes (search/existing arms/newly found arms) weakly below v ∈ R+

Lemma 1

V(S0) payoff under index policy, starting from state S0 = ∞ [1 − Eδκ(v|S0)

  • expected discounted

time till all indexes drop weakly below v ]dv

slide-29
SLIDE 29

Proof: Step 2

V(S0) solves dynamic programming equation: V(S0) = max{ V S(ωS|S0)

  • value from searching

and reverting to index policy thereafter , max

ωP ∈{ ˆ ωP ∈ΩP :SP 0 ( ˆ ωP )>0}

V P(ωP|S0)

  • value from pulling

physical arm and reverting to index policy thereafter } Proof uses representation of payoff under index policy from Lemma 1 decomposition of overall problem into collection of binary problems where choice is between single arm (possibly search) and auxiliary fictitious arm with fixed reward

slide-30
SLIDE 30

Plan

1

Model

2

Index policy

3

Dynamics

4

Proof of main theorem

5

Applications

6

Extensions irreversible choice search frictions multiple search arms no discounting

slide-31
SLIDE 31

Applications

slide-32
SLIDE 32

Dynamic Matching on a Platform

Platform dynamically matches agents Shocks to match quality Gradual learning about attractiveness Platform solicits buyers/sellers in response to past past bids (match outcomes) Joint dynamics of bidding matching solicitation Distortions in solicitation dynamics (due to mkt power + private info)

slide-33
SLIDE 33

Design of Search Engines

Representative buyer uses search engine to identify product to purchase Search brings set of sponsored and organic links Clicking on a link brings additional information GSP auction sellers compete by submitting bids higher bids: higher positions payments linked to clicks Result permits to endogenize click through rates (CTR) characterize firms’ value for being on different positions/pages Auction design how many products per page? payments

slide-34
SLIDE 34

Plan

1

Model

2

Index policy

3

Dynamics

4

Proof of main theorem

5

Applications

6

Extensions irreversible choice search frictions multiple search arms no discounting

slide-35
SLIDE 35

Extensions

slide-36
SLIDE 36

Extension 1: Irreversible Choice

Irreversible choice In each period, DM can search for new alternatives experiment with existing ones irreversibly select one alternative from those found from past searches Type-ξ arm must be pulled Mξ ≥ 0 times before DM can irreversibly commit to it (Weitzman: Mξ = 1 all ξ) Flow-payoff from irreversibly selecting arm in state ωP: R(ωP)

slide-37
SLIDE 37

Extension 1: Irreversible Choice

Partial order on states of physical arms: ωP ˆ ωP e.g., ωP = (ξ, σ, m) where“m”is number of times arm has been activated ωP = (ξ, σ, m) ωP = (ξ, σ, ˆ m) if m ≥ ˆ m

Definition

Type ξ satisfies“better-later-than-sooner”property if, for any ωP ˆ ωP, either R(ωP) ≥ R(ˆ ωP) or R(ωP), R(ˆ ωP) ≤ 0. Weitzman: special case in which R(ˆ ωP) = R(ωP)

Theorem

Suppose all types satisfy“better-later-than-sooner”property. Then index policy optimal.

slide-38
SLIDE 38

Extension 2: Search frictions

Results extend to settings where pull of an arm occupies arbitrary number of periods (before a different action may be taken) Relative length of time in which pulling arms is interrupted for search can be made arbitrarily small (by re-scaling payoffs and adjusting discount factor) Hence analysis extends to settings where search and experimentation“virtually”in parallel

slide-39
SLIDE 39

Conclusions

Experimentation with endogenous set of alternatives determined by past searches Optimal policy: index policy “physical”arms: Gittins (1979) index “search”arm: special index with recursive structure accounts for selection from new arms found Constant, or improving, search technology: search=replacement Otherwise, existing arms put on hold and resumed later Irreversible actions: “better-later-than-sooner”property: index policy optimal Applications: mediated matching design of search engines R&D and patenting

slide-40
SLIDE 40

Conclusions

THANKS!

slide-41
SLIDE 41

Meta Arms

Arm 1: 1,000 first time λ ∈ {1, 10} subsequent times (equal probability, perfectly persistent) Arm 2 (Meta Arm) can be used in two modes 2(A): 100 first time, 0 thereafter 2(B): 11 each period Selection of Arm 2’s mode irreversible Optimal policy (δ = .9): start w. Arm 1 If λ = 10, use arm 2 in mode 2(A) for one period, followed by arm 1 thereafter If λ = 1, use arm 2 in mode 2(B) thereafter No index representation, no matter index def.

Go back

slide-42
SLIDE 42

Policy: formal definition

Period-t decision: dt ≡ (xt, yt) xit = 1 if“physical”arm i pulled; xit = 0 otherwise yt = 1 if search; yt = 0 otherwise Sequence of decisions d = (dt)∞

t=0 feasible if, for all t ≥ 0:

xjt = 1 only if j ∈ It

  • j∈It xjt + yt = 1

Rule χ governing feasible decisions (dt)t≥0 is a policy iff sequence of decisions {dχ

t }t≥0 under χ is {F χ t }t≥0-adapted, where {F χ t }t≥0 is natural filtration induced

by χ

Go back

slide-43
SLIDE 43

Recursive characterization of index for search

Index of search arm can be re-written as GS(ωS) = Eχ∗ τ ∗−1

s=0

δs(rs − cs)|ωS Eχ∗ τ ∗−1

s=0

δs|ωs

  • ,

where χ∗ is index policy and τ ∗ is first time s ≥ 1 at which index of search and indexes of all physical arms obtained through search fall below value of search index at s = 0.

Go back

slide-44
SLIDE 44

Proof of Lemma 1

v 0 = max{G∗(SP

0 ), GS(ωS 0 )}

t0: first time all indexes (including search) strictly below v 0 (t0 = ∞ if event never occurs) η(v 0|S0): discounted sum of rewards, net of search costs, till t0 (includes rewards from newly arrived arms) v 1 = max{G∗(SP

t0), GS(ωS t0)}

(note: t0 = κ(v 1|S0)) ... η(v i|S0): net rewards between κ(v i|S0) and κ(v i+1|S0) − 1 Stochastic sequence of values (v i)i≥0, times (κ(v i|S0))i≥0, and discounted net rewards (η(v i|S0))i≥0

slide-45
SLIDE 45

Proof of Lemma 1

slide-46
SLIDE 46

Proof of Lemma 1

slide-47
SLIDE 47

Proof of Lemma 1

slide-48
SLIDE 48

Proof of Lemma 1

slide-49
SLIDE 49

Proof of Lemma 1

slide-50
SLIDE 50

Proof of Lemma 1

slide-51
SLIDE 51

Proof of Lemma 1

slide-52
SLIDE 52

Proof of Lemma 1

slide-53
SLIDE 53

Proof of Lemma 1

slide-54
SLIDE 54

Proof of Lemma 1

(Average) payoff under index policy: V(S0) = (1 − δ)E ∞

  • i=0

δκ(vi )η(v i)|S0

  • .

Starting at κ(v i), optimal stopping time in index defining v i is κ(v i+1) if v i is index of physical arm, κ(v i+1) is first time its index drops below v i if v i is index of search arm, κ(v i+1) is first time search index + index of all arms discovered after κ(v i) drop below v i Hence, v i = expected discounted sum of net rewards, per unit of expected discounted time, from κ(v i) until κ(v i+1) − 1: v i = E

  • η(v i)|Fκ(vi )
  • E
  • 1 − δκ(vi+1)−κ(vi )|Fκ(vi )
  • /(1 − δ)

Same true if multiple arms and/or search have index equal to v i at κ(v i)

slide-55
SLIDE 55

Proof of Lemma 1

Plugging in expression for v i, V(S0) = E ∞

  • i=0

v i δκ(vi ) − δκ(vi+1) |S0

  • Therefore,

V(S0) = E ∞ vdδκ(v)|S0

  • =

  • 1 − Eδκ(v|S0)

dv

Go back

slide-56
SLIDE 56

Proof of DP

Want to show that V(S0) solves dynamic programming equation: V(S0) = max{ V S(ωS|S0)

  • value from searching

and reverting to index policy thereafter , max

ωP ∈{ ˆ ωP ∈ΩP :SP 0 ( ˆ ωP )>0}

V P(ωP|S0)

  • value from pulling

physical arm and reverting to index policy thereafter }

slide-57
SLIDE 57

Auxiliary arms

e(ωA

M): state with single auxiliary arm yielding fixed reward M

Note: κ(v| S0 ∨ e(ωA

M)

  • S0 + auxiliary arm

) =

  • κ(v|S0)

if v ≥ M ∞

  • therwise

From Lemma 1, payoff from index policy when auxiliary arm added: V(S0 ∨ e(ωA

M)) =

∞ [1 − Eδκ(v|S0∨e(ωA

M))]dv

= M + ∞

M

[1 − Eδκ(v|S0)]dv = V(S0) + M Eδκ(v|S0)dv

slide-58
SLIDE 58

Auxiliary arms

DS(ωS|e(ωS) ∨ e(ωA

M))

  • loss from starting

with search given only search + auxiliary arm ≡ V(e(ωS) ∨ e(ωA

M))

  • value under index

policy given only search + auxiliary arm − V S(ωS|e(ωS) ∨ e(ωA

M))

  • value of searching

and reverting to index policy given only search + auxiliary arm =

  • if M ≤ GS(ωS)

> 0 if M > GS(ωS) Similarly, for physical arm in state ωP: DP(ωP|e(ωP) ∨ e(ωA

M)) =

  • if M ≤ GP(ωP)

> 0 if M > GP(ωP)

slide-59
SLIDE 59

Proof that V solves Bellman eq

Can show ( “tedious” ): DS(ωS|S0) = v0

0 DS(ωS|e(ωS) ∨ e(ωA M))dEδκ(M|SP

0 )

Hence: DS(ωS|S0) = 0 ⇐ ⇒ DS(ωS|e(ωS) ∨ e(ωA

M)) = 0, ∀M ∈ [0, max{G∗(SP 0 ), GS(ωS)}]

⇐ ⇒ G∗(SP

0 ) ≤ GS(ωS)

loss from starting with search = 0 iff search has largest index, and > 0 otherwise Similarly, DP(ωP|S0) = 0 ⇐ ⇒ GP(ωP) = G∗(SP

0 ) ≥ GS(ωS)

Hence, V(S0) = max

  • V S(ωS|S0),

max

ωP ∈{ ˆ ωP ∈ΩP :SP 0 ( ˆ ωP )>0}

V P(ωP|S0)

  • V(S0) solves dynamic programming equation (hence index policy optimal)
slide-60
SLIDE 60

Validation

Assumption: For any S, and policy χ, lim

t→∞δtEχ

 

  • s=t

δs  

  • j=1

xjsrjs − csys   |S   = 0 Solution to DP equation coincides with value function Assumption satisfied if rewards/costs uniformly bounded Also compatible with unbounded rewards/costs. E.g., arms are sampling processes, with rewards drawn from Normal distribution with unknown mean

Go back

slide-61
SLIDE 61

Irreversible:Proof

Fictitious environment with no irreversible choice For any physical arm in state ωP found though search, or pulled in period t, “auxiliary”arm generated at t with fixed reward R(ωP) also“found” Auxiliary arms remain in same state forever and do not generate other auxiliary arms Pulling auxiliary arm corresponding to arm j equivalent to choosing arm j (once pulled, it will be pulled forever) Given state ωP, NEW index of physical arm ˆ GP(ωP) ≡ sup

π,τ

E τ−1

s=0 δs˜

rs|ωP E τ−1

s=0 δs|ωP

(similar to search index) rule π specifies selection over primitive and auxiliary arms ˜ rs: period-s reward (can coincide with R(ˆ ωP) in case period-s selection is auxiliary arm) Index for search as before - but search adjusted to include discovery of auxiliary arms Index policy optimal in fictitious environment Difficulty: Recasting problem this way possible only if auxiliary arms corresponding to past states of same arm never selected guaranteed by“‘better-later-than-sooner”property

Go back