Searching for Arms
Daniel Fershtman Alessandro Pavan October 1, 2019
Searching for Arms Daniel Fershtman Alessandro Pavan October 1, - - PowerPoint PPT Presentation
Searching for Arms Daniel Fershtman Alessandro Pavan October 1, 2019 Motivation Experimentation/Sequential Learning central to many problems In many cases, endogenous set of alternatives/arms search Tradeoff: exploring existing alternatives
Daniel Fershtman Alessandro Pavan October 1, 2019
Experimentation/Sequential Learning central to many problems In many cases, endogenous set of alternatives/arms search Tradeoff: exploring existing alternatives vs searching for new ones
Example
Consumer sequentially explores different alternatives within“consideration set” , while expanding consideration set through search Firm interviews candidates, while searching for additional suitable candidates to interview Researcher splits time on several ongoing projects of unknown return, while also searching for new projects Difference experimentation: directed search: undirected
Multi-armed bandit problem with endogenous set of arms Optimal policy: index policy (with special index for Search) Extension to problems with irreversible choice (based on partial information) Weitzman: special case where set of boxes exogenous and uncertainty resolved after first inspection
Definition
GS(ωS) = sup
τ,π
E τ−1
s=0 δs(rπ s − cπ s )|ωS
E τ−1
s=0 δs|ωs
GS(ωS) = Eχ∗ τ ∗−1
s=0
δs(rs − cs)|ωS Eχ∗ τ ∗−1
s=0
δs|ωs
brought by new search) if such index higher than search index and search
τ ∗ : first time search index + indexes of all physical arms brought by new search fall below value of search index at time search launched
Opportunity cost of search depends on entire composition of current choice set e.g., profitability of searching for additional candidates depends on observable covariates of current candidates (gender, education, etc.) and past interviews Non-stationarity in search technology search outcome may depend on type and number of arms previously found past search costs Search competes with its own“descendants”(i.e., with arms discovered through past searches) correlation Treating search as“meta arm”requires decisions within meta arm invariant to info
bandit problems with meta arms (e.g., arms that can be activated with different intensities –“super-processes” ) rarely admit index solution
Bandits
Gittins and Jones (1974), Rothschild (1974), Rustichini and Wolinsky (1995), Keller and Rady (1999)... Surveys: Bergemann and Valimaki (2008), Horner and Skrzypacz (2017)
Bandits with time-varying set of alternatives
Whittle (1981), Varaiya et al. (1985), Weiss (1988), Weber (1994)...
Sequential search for best alternative (Pandora’s problem)
Weitzman (1979), Olszewski and Weber (2015), Choi and Smith (2016), Doval (2018)...
Experimentation before irreversible choice
Ke, Shen and Villas-Boas (2016), Ke and Villas-Boas (2018)...
⇒ KEY DIFFERENCE: Endogeneity of set of arms
1
Model
2
Optimal policy
3
Dynamics
4
Proof of main theorem
5
Applications
6
Extensions irreversible choice search frictions multiple search arms no discounting
Discrete time: t = 0, ..., ∞ Available“physical”arms in period t: It = {1, ..., nt} (I0 exogenous) At each t, DM pull arm among It search for new arms
Pulling arm i ∈ It reward ri ∈ R transition to new“state” Search costly stochastic set of new arms It+1\It
“State”of physical arm: ωP = (ξ, θ) ∈ ΩP ξ ∈ Ξ : persistent“type” θ ∈ Θ: evolving state Example: ξ: type of research project/idea (theory, empirical, experimental) θ = (σm) : history of signals about project’s impact r: utility from working on project HωP : distribution over ΩP, given ωP Reward: r(ωP) Usual assumptions: Arm’ state“frozen”when not pulled time-autonomous processes evolution of arms’ states independent across arms, conditional on arms’ types
State of search technology: ωS = ((c0, E0), (c1, E1), ..., (cm, Em)) ∈ ΩS m: number of past searches ck: cost of k’th search Ek = (nk(ξ) : ξ ∈ Ξ): result of k-th search nk(ξ) ∈ N: number of arms of type ξ found HωS : joint distribution over (c, E), given ωS Key assumptions independence of calendar time independence of arms’ idiosyncratic shocks, θ Correlation though ξ
Stochasticity in search technology: learning about alternatives not yet in consideration set evolution of DM’s ability to find new alternatives e.g., limited set of outside alternatives fatigue/experience
Period-t state: St ≡ (ωS
t , SP t )
ωS
t : state of search technology
SP
t ≡ (St(ωP) : ωP ∈ ΩP) state of physical arms
SP
t (ωP): number of physical arms in state ωP ∈ ΩP
Definition eliminates dependence on calendar time, while keeping track of relevant information Policy χ describes feasible decisions at all histories Policy χ optimal if it maximizes expected discounted sum of net payoffs Eχ
∞
δt
∞
xjtrjt − ctyt |S0
1
Model
2
Optimal policy
3
Dynamics
4
Proof of main theorem
5
Applications
6
Extensions Irreversible choice Search frictions multiple search arms no discounting
Index for“physical”arms:
GP(ωP) ≡ sup
τ>0
E τ−1
s=0 δsrs|ωP
E τ−1
s=0 δs|ωP
Interpretations: maximal expected discounted reward, per unit of expected discounted time (Gittins) annuity that makes DM indifferent between stopping right away and continuing with option to retire in the future (Whittle) fair charge (Weber)
Index for search:
GS(ωS) ≡ sup
π,τ
E τ−1
s=0 δs(rπ s − cπ s )|ωS
E τ−1
s=0 δs|ωs
π: choice among arms discovered AFTER t and FUTURE searches r π
s , cπ s : stochastic rewards/costs, under rule π
Interpretation: fair (flow) price for visiting“casinos”found stochastically over time, playing in them, and continue searching for other casinos Definition: accommodates for correlation among arms found over time compatible with possibility that search lasts indefinitely and brings unbounded set of alternatives
Definition
Index policy selects at each t “search”iff GS(ωS
t ) ≥
G∗(SP
t )
maximal index among available physical arms
t )
Theorem 1
Index policy optimal in bandit problem with search for new arms
Each period DM must assign task to a worker Each worker can be ξ =Male of ξ=Female different processes over signals/rewards Probability search brings Male: .8 Fixing value of highest index, optimality of searching for new candidates same no matter whether you have 49 M and 1 F, or 25 M and 25 F Given highest physical index G∗(SP
t ), composition of set of physical arms irrelevant
for decision to search However, opportunity cost of search (value of continuing with current agents) depends on number of M and F (and past outcomes) Maximal index among current arms NOT sufficient statistics for state of current arms when it comes to continuation payoff with current arms
1
Model
2
Optimal policy
3
Dynamics
4
Proof of main theorem
5
Applications
6
Extensions Irreversible choice Search frictions multiple search arms no discounting
Stationary search technology: HωS = HS all ωS if DM searches at t, all physical arms present at t never pulled again (search=replacement) Result extends to“Improving search technologies” : physical arms required to pass more stringent tests over time Deteriorating search technology: e.g., finite set of arms DM may return to arms present before last search
1
Model
2
Optimal policy
3
Dynamics
4
Proof of main theorem
5
Applications
6
Extensions Irreversible choice Search frictions multiple search arms no discounting
1
Characterization of payoff under index policy representation uses“timing process”based on optimal stopping in indexes definition: physical arms: stop when index drops below its initial value (Mandelbaum, 1986) search: stop when search index and all indexes of newly arrived arms smaller than value of search index when search began
2
Dynamic programming payoff function under index policy solves dynamic programming equation
κ(v|S) ∈ N ∪ {∞}: minimal time until all indexes (search/existing arms/newly found arms) weakly below v ∈ R+
Lemma 1
V(S0) payoff under index policy, starting from state S0 = ∞ [1 − Eδκ(v|S0)
time till all indexes drop weakly below v ]dv
V(S0) solves dynamic programming equation: V(S0) = max{ V S(ωS|S0)
and reverting to index policy thereafter , max
ωP ∈{ ˆ ωP ∈ΩP :SP 0 ( ˆ ωP )>0}
V P(ωP|S0)
physical arm and reverting to index policy thereafter } Proof uses representation of payoff under index policy from Lemma 1 decomposition of overall problem into collection of binary problems where choice is between single arm (possibly search) and auxiliary fictitious arm with fixed reward
1
Model
2
Index policy
3
Dynamics
4
Proof of main theorem
5
Applications
6
Extensions irreversible choice search frictions multiple search arms no discounting
Platform dynamically matches agents Shocks to match quality Gradual learning about attractiveness Platform solicits buyers/sellers in response to past past bids (match outcomes) Joint dynamics of bidding matching solicitation Distortions in solicitation dynamics (due to mkt power + private info)
Representative buyer uses search engine to identify product to purchase Search brings set of sponsored and organic links Clicking on a link brings additional information GSP auction sellers compete by submitting bids higher bids: higher positions payments linked to clicks Result permits to endogenize click through rates (CTR) characterize firms’ value for being on different positions/pages Auction design how many products per page? payments
1
Model
2
Index policy
3
Dynamics
4
Proof of main theorem
5
Applications
6
Extensions irreversible choice search frictions multiple search arms no discounting
Irreversible choice In each period, DM can search for new alternatives experiment with existing ones irreversibly select one alternative from those found from past searches Type-ξ arm must be pulled Mξ ≥ 0 times before DM can irreversibly commit to it (Weitzman: Mξ = 1 all ξ) Flow-payoff from irreversibly selecting arm in state ωP: R(ωP)
Partial order on states of physical arms: ωP ˆ ωP e.g., ωP = (ξ, σ, m) where“m”is number of times arm has been activated ωP = (ξ, σ, m) ωP = (ξ, σ, ˆ m) if m ≥ ˆ m
Definition
Type ξ satisfies“better-later-than-sooner”property if, for any ωP ˆ ωP, either R(ωP) ≥ R(ˆ ωP) or R(ωP), R(ˆ ωP) ≤ 0. Weitzman: special case in which R(ˆ ωP) = R(ωP)
Theorem
Suppose all types satisfy“better-later-than-sooner”property. Then index policy optimal.
Results extend to settings where pull of an arm occupies arbitrary number of periods (before a different action may be taken) Relative length of time in which pulling arms is interrupted for search can be made arbitrarily small (by re-scaling payoffs and adjusting discount factor) Hence analysis extends to settings where search and experimentation“virtually”in parallel
Experimentation with endogenous set of alternatives determined by past searches Optimal policy: index policy “physical”arms: Gittins (1979) index “search”arm: special index with recursive structure accounts for selection from new arms found Constant, or improving, search technology: search=replacement Otherwise, existing arms put on hold and resumed later Irreversible actions: “better-later-than-sooner”property: index policy optimal Applications: mediated matching design of search engines R&D and patenting
Arm 1: 1,000 first time λ ∈ {1, 10} subsequent times (equal probability, perfectly persistent) Arm 2 (Meta Arm) can be used in two modes 2(A): 100 first time, 0 thereafter 2(B): 11 each period Selection of Arm 2’s mode irreversible Optimal policy (δ = .9): start w. Arm 1 If λ = 10, use arm 2 in mode 2(A) for one period, followed by arm 1 thereafter If λ = 1, use arm 2 in mode 2(B) thereafter No index representation, no matter index def.
Go back
Period-t decision: dt ≡ (xt, yt) xit = 1 if“physical”arm i pulled; xit = 0 otherwise yt = 1 if search; yt = 0 otherwise Sequence of decisions d = (dt)∞
t=0 feasible if, for all t ≥ 0:
xjt = 1 only if j ∈ It
Rule χ governing feasible decisions (dt)t≥0 is a policy iff sequence of decisions {dχ
t }t≥0 under χ is {F χ t }t≥0-adapted, where {F χ t }t≥0 is natural filtration induced
by χ
Go back
Index of search arm can be re-written as GS(ωS) = Eχ∗ τ ∗−1
s=0
δs(rs − cs)|ωS Eχ∗ τ ∗−1
s=0
δs|ωs
where χ∗ is index policy and τ ∗ is first time s ≥ 1 at which index of search and indexes of all physical arms obtained through search fall below value of search index at s = 0.
Go back
v 0 = max{G∗(SP
0 ), GS(ωS 0 )}
t0: first time all indexes (including search) strictly below v 0 (t0 = ∞ if event never occurs) η(v 0|S0): discounted sum of rewards, net of search costs, till t0 (includes rewards from newly arrived arms) v 1 = max{G∗(SP
t0), GS(ωS t0)}
(note: t0 = κ(v 1|S0)) ... η(v i|S0): net rewards between κ(v i|S0) and κ(v i+1|S0) − 1 Stochastic sequence of values (v i)i≥0, times (κ(v i|S0))i≥0, and discounted net rewards (η(v i|S0))i≥0
(Average) payoff under index policy: V(S0) = (1 − δ)E ∞
δκ(vi )η(v i)|S0
Starting at κ(v i), optimal stopping time in index defining v i is κ(v i+1) if v i is index of physical arm, κ(v i+1) is first time its index drops below v i if v i is index of search arm, κ(v i+1) is first time search index + index of all arms discovered after κ(v i) drop below v i Hence, v i = expected discounted sum of net rewards, per unit of expected discounted time, from κ(v i) until κ(v i+1) − 1: v i = E
Same true if multiple arms and/or search have index equal to v i at κ(v i)
Plugging in expression for v i, V(S0) = E ∞
v i δκ(vi ) − δκ(vi+1) |S0
V(S0) = E ∞ vdδκ(v)|S0
∞
dv
Go back
Want to show that V(S0) solves dynamic programming equation: V(S0) = max{ V S(ωS|S0)
and reverting to index policy thereafter , max
ωP ∈{ ˆ ωP ∈ΩP :SP 0 ( ˆ ωP )>0}
V P(ωP|S0)
physical arm and reverting to index policy thereafter }
e(ωA
M): state with single auxiliary arm yielding fixed reward M
Note: κ(v| S0 ∨ e(ωA
M)
) =
if v ≥ M ∞
From Lemma 1, payoff from index policy when auxiliary arm added: V(S0 ∨ e(ωA
M)) =
∞ [1 − Eδκ(v|S0∨e(ωA
M))]dv
= M + ∞
M
[1 − Eδκ(v|S0)]dv = V(S0) + M Eδκ(v|S0)dv
DS(ωS|e(ωS) ∨ e(ωA
M))
with search given only search + auxiliary arm ≡ V(e(ωS) ∨ e(ωA
M))
policy given only search + auxiliary arm − V S(ωS|e(ωS) ∨ e(ωA
M))
and reverting to index policy given only search + auxiliary arm =
> 0 if M > GS(ωS) Similarly, for physical arm in state ωP: DP(ωP|e(ωP) ∨ e(ωA
M)) =
> 0 if M > GP(ωP)
Can show ( “tedious” ): DS(ωS|S0) = v0
0 DS(ωS|e(ωS) ∨ e(ωA M))dEδκ(M|SP
0 )
Hence: DS(ωS|S0) = 0 ⇐ ⇒ DS(ωS|e(ωS) ∨ e(ωA
M)) = 0, ∀M ∈ [0, max{G∗(SP 0 ), GS(ωS)}]
⇐ ⇒ G∗(SP
0 ) ≤ GS(ωS)
loss from starting with search = 0 iff search has largest index, and > 0 otherwise Similarly, DP(ωP|S0) = 0 ⇐ ⇒ GP(ωP) = G∗(SP
0 ) ≥ GS(ωS)
Hence, V(S0) = max
max
ωP ∈{ ˆ ωP ∈ΩP :SP 0 ( ˆ ωP )>0}
V P(ωP|S0)
Assumption: For any S, and policy χ, lim
t→∞δtEχ
∞
δs
∞
xjsrjs − csys |S = 0 Solution to DP equation coincides with value function Assumption satisfied if rewards/costs uniformly bounded Also compatible with unbounded rewards/costs. E.g., arms are sampling processes, with rewards drawn from Normal distribution with unknown mean
Go back
Fictitious environment with no irreversible choice For any physical arm in state ωP found though search, or pulled in period t, “auxiliary”arm generated at t with fixed reward R(ωP) also“found” Auxiliary arms remain in same state forever and do not generate other auxiliary arms Pulling auxiliary arm corresponding to arm j equivalent to choosing arm j (once pulled, it will be pulled forever) Given state ωP, NEW index of physical arm ˆ GP(ωP) ≡ sup
π,τ
E τ−1
s=0 δs˜
rs|ωP E τ−1
s=0 δs|ωP
(similar to search index) rule π specifies selection over primitive and auxiliary arms ˜ rs: period-s reward (can coincide with R(ˆ ωP) in case period-s selection is auxiliary arm) Index for search as before - but search adjusted to include discovery of auxiliary arms Index policy optimal in fictitious environment Difficulty: Recasting problem this way possible only if auxiliary arms corresponding to past states of same arm never selected guaranteed by“‘better-later-than-sooner”property
Go back