Strategy-proof estimators for simple regression By Javier Perote - - PowerPoint PPT Presentation
Strategy-proof estimators for simple regression By Javier Perote - - PowerPoint PPT Presentation
Strategy-proof estimators for simple regression By Javier Perote (University of Salamanca) and Juan Perote-Pea (University of Zaragoza) MOTIVATION First, this is the continuation of a research project consisting in introducing private
MOTIVATION
- First, this is the continuation of a research project
consisting in introducing private information and strategic considerations into well-known “aggregation” and “decision” techniques like:
– Operations Research (PERT, queuing theory, linear programming,…) – Multicriteria decision making – Clustering techniques – Econometrics
- Are these techniques “robust” to individual
manipulation using the private information?
MOTIVATION
- Secondly, strategic data manipulation evokes the
literature on “robustness” to avoid random contamination and outlier detection: most of the estimators proposed in that literature use the properties of the median to aggregate data
- Interestingly, the median as an allocation device to
aggregate information is strategy-proof in some contexts: i.e., when individuals have “single- peaked” preferences on a single dimension in public goods allocation problems
- Can the incentives literature (from social choice
theory) answer questions on econometrics?
STRUCTURE OF THE PAPER
- First, we argue that the informational problem can
be very important in some econometric studies. Therefore, designing estimators that are robust to data manipulation can be useful
- Secondly, we examine the most popular
estimators, OLS and show that they may lead to sample contamination (they’re NOT robust)
- Then, we propose a whole family of estimators for
the simple regression case that can be proved to be immune to this kind of data contamination
- Finally, we’ll confront some of them with OLS in
a Monte Carlo experiment
WHAT KIND OF PROBLEM?
- Some econometric problems use reported or
declared information (that cannot be easily and costlessly observed or verified) from agents or individuals (like questionnaires i.e., it is the agent’s private information)
- The information extracted from the data is (or can
be) used to allocate “something” or to assess policies that might be important to the agents
- Therefore, the agents might be tempted to report
false information if they think that the data managing process can be profitably manipulated
AN EXAMPLE
- A big firm or a government department has a
number of divisions (perhaps located in different regions)
- Measures of the output “produced” by the
divisions cannot be verified without important costs (inventory costs, monitoring costs, etc.). For instance, number of clients served in a month
- Therefore, the information about each division’s
- utput is privately owned by the division manager
and is reported by him to the firm’s manager
THE MODEL WITH THE EXAMPLE
- Some of the inputs affecting each division’s
- utput are known to the planner (firm’s boss),
maybe because the planner himself “allocated” then in the past (i.e., the number of workers in each division, the estimated demand in each region, the monthly division’s budget, etc.)
- set of divisions (= agents)
- each agent is also an “observation”
- division i’s measure of (true) output
- division i’s reported output
{ }:
,..., 2 , 1 n N =
: , N j i ∈
: ,
i
y N i∈ ∀ : ~ ,
i
y N i∈ ∀
THE MODEL WITH THE EXAMPLE
- publicly known explanatory variable
- True data generating process:
- where
and is an i.i.d. random variable (error term or random shock)
- Let
and : : ,
i
x N i∈ ∀
i i i
e x y + + =
1
β β
n i ,..., 1 =
) , ( : σ N ei
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =
n i
x x x X 1 ... ... 1 ... ... 1
1
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =
n i
y y y Y ... ...
1
) , ( Y X
True sample
THE MODEL WITH THE EXAMPLE
- A regression estimator is a function “T” of the
sample :
- The estimated or predicted values of the response
variable for each observation are generated as:
- And the residuals
are the differences:
- . The most widely used estimator is
the OLS one: ) , ( ) ˆ , ˆ ( ˆ
1
Y X T = = ′ β β β
. ˆ ˆ ˆ
1
N i x y
i i
∈ ∀ + = β β
) , ( Y X
N i ei ∈ ∀ , ˆ
i i i
y y e ˆ ˆ − =
) , ( , ˆ arg ˆ
1 2
Y X e min
n i i OLS
∀ =
∑
=
β
THE MODEL WITH THE EXAMPLE
- When the true sample
is known to the planner, the OLS estimator is the unbiased one with minimum variance (good properties)
- But when the true sample is unknown, the only
information received by the planner is instead of . Applying OLS to the reported sample
- nly maintain the good poperties
when all agents do not lie! (i.e., )
- QUESTION: In which cases will the agents lie?
) , ( Y X ) ~ , ( Y X ) , ( Y X ) ~ , ( Y X
Y Y = ~
THE MODEL WITH THE EXAMPLE
- We must assume some “preferences” guiding the
agents’ declaring behaviour. We opt by the…
- SINGLE-PEAKEDNESS ASSUMPTION:
- Agent
with true response value has single- peaked preferences
- n the real line E if:
- (i)
- (ii) and
N i∈
i
y
i
y i
R
i y i i
y v E v v P y
i
≠ ∈ ∀ ,
) ( ) ( , , v y P v y v v v v
i y i i
i
+ + → > > ∀
). ( ) ( v y P v y
i y i i
i
− −
EXAMPLE OF SINGLE-PEAKEDNESS
- Possible single-peaked preferences for
The real line representing predicted values
i
y
i
y ˆ
Preference “intensity”
N i∈
E
THE MODEL WITH THE EXAMPLE
- Let us use the partitioned notation:
- Def: Regression estimator
- is manipulable at sample
by observation
- if
such that
- Def: Regression estimator
- is strategy-proof if it is NOT manipulable at any
sample for any observation
( )
i i Y
y Y
−
= ,
) ~ , ~ , ( ) ˆ , ˆ ( ˆ
1 i i Y
y X T
−
= = ′ β β β
Z Y X ∈ ) ~ , (
{ }
n i ,..., 1 ∈
) ~ ( ,
~ ~ i i i y y i
y y E y R
i i
≠ ∈ ∃ ℜ ∈ ∃
[ ] [ ]
i i i i i y i i i i i i
x Y y X Y y X P x Y y X Y y X
i
) ~ , ~ , ( ˆ ) ~ , ~ , ( ˆ ) ~ , , ( ˆ ) ~ , , ( ˆ
~ − − − −
+ + β β β β
) ~ , ~ , ( ) ˆ , ˆ ( ˆ
1 i i Y
y X T
−
= = ′ β β β
Z Y X ∈ ) ~ , (
{ }
n i ,..., 1 ∈
SOME EXAMPLES
- The workers’ union’s wage setting problem
i
L
i i i
rK q p −
i i i i
FB L w y + = ~
i
L
i
w
SOME EXAMPLES
- The efficiency frontier estimation problem
i
σ log
i
r log
i i i i
FB L w y + = ~
i
σ
1
ˆ β
i i i
e r DGP + + = σ β β log log :
1
ˆ β
SOME EXAMPLES
- The tax pay-as-you-go rates allocation problem
rate tax average PAYG ti :
i i i i
FB L w y + = ~ $ 000 , 10
1
ˆ β ˆ β
income Ii :
30% 20%
i i
I t schedule tax PAYG
1
ˆ ˆ : β β + =
OLS IS NOT STRATEGY-PROOF
- Example:
i
x
i
y ~
True response variables for 5
- bservations
2
y
2
x
OLS IS NOT STRATEGY-PROOF
- Example:
i
x
i
y ~
The OLS estimator generates the regression line
2
y
2
x
OLS IS NOT STRATEGY-PROOF
- Example:
i
x
i
y ~
By lying and under- estimating , agent 2 can be better off
2
y
2
x
2
~ y
The regression line slightly shifts downwards And the new prediction for is closer to true
2
x
2
y
2
y
Lie: :
2 2
~ y y ≠
A STRATEGY-PROOF ESTIMATOR
- Only recommended for the case of
such that and : it is an extension of the median voter theorem: the MV estimator, defined as:
) ~ , ( Y X Z = N i xi ∈ ∀ > 0
0 =
β
) ˆ ~ ( ˆ , ~ ˆ
1 1 i i N i i i
x y med x y med β β β − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =
∈
A STRATEGY-PROOF ESTIMATOR
- Only recommended for the case of
such that and : it is an extension of the median voter theorem: the MV estimator, defined as:
) ~ , ( Y X Z = N i xi ∈ ∀ > 0
0 =
β
) ˆ ~ ( ˆ , ~ ˆ
1 1 i i N i i i
x y med x y med β β β − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =
∈
i
x
i
y ~
Case of 5
- bservations
2
x
2
~ y
A STRATEGY-PROOF ESTIMATOR
- Only recommended for the case of
such that and : it is an extension of the median voter theorem: the MV estimator, defined as:
) ~ , ( Y X Z = N i xi ∈ ∀ > 0
0 =
β
) ˆ ~ ( ˆ , ~ ˆ
1 1 i i N i i i
x y med x y med β β β − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =
∈
i
x
i
y ~
is the median
- f the slopes
1
ˆ β
A STRATEGY-PROOF ESTIMATOR
- Only recommended for the case of
such that and : it is an extension of the median voter theorem: the MV estimator, defined as:
) ~ , ( Y X Z = N i xi ∈ ∀ > 0
0 =
β
) ˆ ~ ( ˆ , ~ ˆ
1 1 i i N i i i
x y med x y med β β β − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =
∈
i
x
i
y ~
is the median
- f the slopes
1
ˆ β
1 2 3 4 5
A STRATEGY-PROOF ESTIMATOR
- Only recommended for the case of
such that and : it is an extension of the median voter theorem: the MV estimator, defined as:
) ~ , ( Y X Z = N i xi ∈ ∀ > 0
0 =
β
) ˆ ~ ( ˆ , ~ ˆ
1 1 i i N i i i
x y med x y med β β β − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =
∈
i
x
i
y ~
and is always the origin
ˆ β
1 2 3 4 5
CRM ESTIMATORS
- The clockwise repeated median estimators (CRM)
is a family of strategy-proof estimators valid for every sample such that
- They are parameterised by two sets
with either
- r
. First we calculate the clockwise angle of any pair of declared
- bservations
:
) ~ , ( Y X Z =
N j i x x
j i
∈ ∀ ≠ ,
N S S ⊆ ′ ,
∅ = ′ ∩ S S S S ′ ⊆
N j i ∈ ,
= )) ~ , ( ), ~ , ((
j j i i
y x y x CWA ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − + − + =
i j i j i j i j i j
x x y y arctan x x y y sign x x sign ~ ~ ~ ~ 2 ) ( π π
CRM ESTIMATORS
- Then, we define the directing angle,
- And finally, the regression estimator is obtained as
- Some members of this class are known estimators:
)) ~ , ( ), ~ , ((
j j i i i j S j S i
y x y x CWA med med
≠ ′ ∈ ∈
=
= ) ~ , ( Y X DA
) ˆ ~ ( ˆ ) ) ~ , ( ( 2 ) ~ , ( ˆ
1 1 i i S i
x y med Y X DA sign Y X DA tan β β π π π β − = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − − =
∈
CRM ESTIMATORS
- If
, we obtain a clockwise extension of the repeated median estimator (Siegel, 1982)
- If
, we obtain a clockwise extension of the median star estimator (Simon, 1986)
- If
- we obtain Brown and Mood (1951) technique and
slightly changed, Tukey’s (1970/71) resistant line method
N S S = ′ =
{ } { }
h S h N S = ′ = , \
{ } { }
j j h j j h
x med x N h S x med x N h S > ∋ ∈ = ′ ≤ ∋ ∈ = ,
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
5 declared
- bservations. First,
we calculate each
- ne’s clockwise
angle 1 2 3 4 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
We start with the first one: first, find the vectors connecting 1 with any
- ther observation…
1 2 3 4 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
Now, look at the clockwise angle of 1 with 2 1 2 3 4 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
We first find the median of the 4 clockwise angles of
- bservation 1
1 2 3 4 5 Note: when there’s an even number of angles, we take the highest median (convention)
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
We represent the median angle with a small arrow pointing to the corresponding
- bservation
1 2 3 4 5 And we proceed to find the median angle for each Observation: 1 to 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
We represent the median angle with a small arrow pointing to the corresponding
- bservation
1 2 3 4 5 And we proceed to find the median angle for each Observation: 1 to 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
We re-order the arrows from the one with the biggest clockwise angle to the one with the smallest 1 2 3 4 5 And we find the median of all
- f them, i.e., the one starting
from observation 3 1 2 3 4 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
Observation 3 is called the directing
- bservation pointing
to observation 4 and its clockwise angle the directing angle 1 2 3 4 5 The slope of the regression line is given by the directing angle and the intercept is immediate 1 2 3 4 5
CRM ESTIMATORS: AN EXAMPLE
- Let’s consider
and sample
N S S = ′ =
) ~ , ( Y X Z =
i
y ~
i
x
The CRM estimators are always such that the regression line passes through two different
- bservations (3and 4)
1 2 3 4 5 1 2 3 4 5 Observations below (above) the regression line have always bigger (smaller) angles than the directing one
CRM ESTIMATORS: OTHER EXAMPLE
- Let’s consider
and : resistant line
{ }
2 , 1 = S
i
y ~
i
x
1 2 3 4 5
{ }
5 , 4 = ′ S
CRM ESTIMATORS: OTHER EXAMPLE
- Let’s consider
and : resistant line
{ }
2 , 1 = S
i
y ~
i
x
1 2 3 4 5
{ }
5 , 4 = ′ S
The resistant line regression line Agent 3 cannot change the line and agents 2 and 5 cannot be better off. Only 1 and 4 might want to lie
CRM ESTIMATORS: OTHER EXAMPLE
- Let’s consider
and : resistant line
{ }
2 , 1 = S
i
y ~
i
x
1 2 3 4 5
{ }
5 , 4 = ′ S
Agent 4’s lies below the regression line will not change it If 4 report a
- ver the
regression line, will only shift it upwards
4
~ y
CRM ESTIMATORS: OTHER EXAMPLE
- Let’s consider
and : resistant line
{ }
2 , 1 = S
i
y ~
i
x
1 2 3 4 5
{ }
5 , 4 = ′ S
Agent 4’s lies below the regression line will not change it If 4 report a
- ver the
regression line, will only shift it upwards
4
~ y
CRM ESTIMATORS: OTHER EXAMPLE
- Let’s consider
and : resistant line
{ }
2 , 1 = S
i
y ~
i
x
1 2 3 4 5
{ }
5 , 4 = ′ S
New regression line with lie ( ): 4 is now worse off since his prediction is even further
4
~ y
CRM ESTIMATORS: OTHER EXAMPLE
- Let’s consider
and : resistant line
{ }
2 , 1 = S
i
y ~
i
x
1 2 3 4 5
{ }
5 , 4 = ′ S
Agent 1 cannot change the line with lies
- ver it
And can only shift it downwards by using lies below the regression line, therefore lying does not pay
1
~ y
THE SIMULATION RESULTS
- We undertake a Monte Carlo experiment
comparing the OLS estimates when the sample will be strategically manipulated with some CRM estimators that avoid manipulation but are biased. Two DGP:
- DGP1: where
and i.i.d
- DGP2: where
and i.i.d
- We must also assume a sample contamination
process for OLS regression (somehow arbitrary). In particular, less than 1/3 of the observations on average were strategically contaminated
i i i
e x y + − = 5 . 5
) 1 , ( : N ei
i i i
e x y + + − = 5 . 5
) 1 , ( : N ei
THE SIMULATION RESULTS
FIGURE 4: DGP yi=5-0.5xi+ei; V(ei)=1.
- 15
- 10
- 5
5 10 15 1 3 5 7 9 11 13 15 17 19 Observations OLS Repeated Median Resistant Line Median Star Contaminated OLS
THE SIMULATION RESULTS
FIGURE 5: DGP yi=-5+0.5xi+ei; V(ei)=1.
- 6
- 4
- 2
2 4 6 1 3 5 7 9 11 13 15 17 19 Observations OLS Repeated Median Resistant Line Median Star Contaminated OLS
THE SIMULATION RESULTS
FIGURE 6: SIMULATED HISTOGRAMS FOR THE REGRESSION INTERCEPT (yi=5-0.5xi+ei; V(ei)=0.01)
1 2 3 4 5 6 7 8 9 10 4,0 4,2 4,4 4,6 4,8 5,0 5,2 5,4 5,6 5,8 OLS Contaminated OLS Resistant Line Repeated Median Median Star
THE SIMULATION RESULTS
FIGURE 7: SIMULATED HISTOGRAMS FOR THE REGRESSION SLOPE (yi=5-0.5xi+ei; V(ei)=0.01)
10 20 30 40 50 60 70 80 90 100
- 0,60 -0,58 -0,55 -0,53 -0,50 -0,48 -0,46 -0,43 -0,41
OLS Contaminated OLS Resistant Line Repeated Median Median Star
CONCLUSIONS
- In some contexts, strategy-proofness might be an
important desirable property to hold when the information extraction is linked to resource allocation or policy assessment and part of the sample is private information
- In these cases, a loss in consistency by using a
CRM estimator instead of OLS might be a low price to pay for a “honestly revealed” sample
- The commitment to use the CRM estimator for the
resource allocation after extracting the information must be clear: there might be an inconsistency problem
CONCLUSIONS
- Most CRM estimators have also high breakdown
points and are robust to contamination by outliers
- CRM estimators only work for single-peaked
- preferences. If agents have other objectives like
minimising instead of , for example, the search for strategy-proof estimators must start again
i i
y y ~ ˆ −
i i