 
              Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
In the rest of this course we switch gear to the interplay between information theory and statistics. 1 In this lecture, we will introduce the basic elements of statistical decision theory : It is about how to make decision from data samples collected from a statistical model. It is about how to evaluate decision making algorithms (decision rules) under a statistical model. It also serves the purpose of overviewing the contents to be covered in the follow-up lectures. 2 In the follow-up lectures, we will go into details of several topics, including Hypothesis testing: large-sample asymptotic performance limits Point estimation: Bayes vs. Minimax, lower bounding techniques, high dimensional problems, etc. Along side, we will introduce tools and techniques for investigating the asymptotic performance of several statistical problems, and show its interplay with information theory . Tools from probability theory: large deviation, concentration inequalities, etc. Elements from information theory: information measures, lower bounding techniques, etc. 2 / 55 I-Hsiang Wang IT Lecture 7
Overview of this Lecture In this lecture, the goal is to establish basics of statistical decision theory. 1 We will begin with setting up the framework of statistical decision theory, including: Statistical experiment: parameter space, data samples, statistical model Decision rule: deterministic vs. randomized Performance evaluation: loss function, risk, minimax vs. Bayes 2 Next, we will introduce two basic statical decision making problems, including Hypothesis testing Point estimation 3 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Statistical Model and Decision Making 1 Basic Framework Examples Paradigms Hypothesis Testing 2 Basics Estimation 3 Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 4 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Basic Framework Statistical Model and Decision Making 1 Basic Framework Examples Paradigms Hypothesis Testing 2 Basics Estimation 3 Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 5 / 55 I-Hsiang Wang IT Lecture 7
� ��� Statistical Model and Decision Making Basic Framework Statistical Decision Experiment Making τ ( X ) = ˆ T X θ P θ ( · ) θ ∈ Θ X ∈ X 6 / 55 I-Hsiang Wang IT Lecture 7
� ��� Statistical Model and Decision Making Basic Framework Statistical Decision Experiment Making τ ( X ) = ˆ T X θ P θ ( · ) θ ∈ Θ X ∈ X Statistical Experiment Statistical Model: A collection of data-generating distributions P ≜ { P θ | θ ∈ Θ } , where ▶ Θ is called the parameter space , could be finite, infinitely countable, or uncountable. ▶ P θ ( · ) is a probability distribution which accounts for the implicit randomness in experiments, sampling, or making observations Data (Sample/Outcome/Observation) : X is generated by a random draw from P θ , that is, X ∼ P θ . ▶ X could be random variables, vectors, matrices, processes, etc. 7 / 55 I-Hsiang Wang IT Lecture 7
� ��� Statistical Model and Decision Making Basic Framework Statistical Decision Experiment Making τ ( X ) = ˆ T X θ P θ ( · ) θ ∈ Θ X ∈ X Inference Task Objective: T ( θ ) , a function of the parameter θ . From the data X ∼ P θ , one would like to infer T ( θ ) from X . Decision Rule Decision rule (deterministic) : τ ( · ) is a function of X . ˆ T = τ ( X ) is the inferred result. Decision rule (randomized) : τ ( · , · ) is a function of ( X, U ) , where U is external randomness. ˆ T = τ ( X, U ) is the inferred result. 8 / 55 I-Hsiang Wang IT Lecture 7
� ��� Statistical Model and Decision Making Basic Framework Statistical Decision Experiment Making τ ( X ) = ˆ T X θ P θ ( · ) θ ∈ Θ X ∈ X Loss Function ˆ T ( · ) T ( θ ) θ T l ( · , · ) l ( T ( θ ) , τ ( X )) E X ∼ P θ [ · ] L θ ( τ ) Performance Evaluation : how good is a decision rule τ ? Loss function: l ( T ( θ ) , τ ( X )) measures how bad the decision rule τ is (with a specific data point X ) . Note: since X is random, l ( T ( θ ) , τ ( X )) is also random. Risk: L θ ( τ ) ≜ E X ∼ P θ [ l ( T ( θ ) , τ ( X ))] measures on average how bad the decision rule τ is when the true parameter is θ . 9 / 55 I-Hsiang Wang IT Lecture 7
� ��� Statistical Model and Decision Making Basic Framework Statistical Decision Experiment Making τ ( X ) = ˆ T X θ P θ ( · ) θ ∈ Θ X ∈ X Loss Function ˆ T ( · ) T ( θ ) θ T l ( · , · ) l ( T ( θ ) , τ ( X )) E X ∼ P θ [ · ] L θ ( τ ) Performance Evaluation : what if the decision rule τ is randomized? Loss function becomes l ( T ( θ ) , τ ( X, U )) . Risk becomes L θ ( τ ) ≜ E U,X ∼ P θ [ l ( T ( θ ) , τ ( X, U ))] . 10 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Statistical Model and Decision Making 1 Basic Framework Examples Paradigms Hypothesis Testing 2 Basics Estimation 3 Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency 11 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Sometimes we care about the inferred object itself . 12 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Example: Decoding Decoding in channel coding over a DMC is one example that we are familiar with. Parameter is the message: θ ← → m { 1 , 2 , . . . , 2 NR } Parameter space is the message set: Θ ← → Data is the received sequence: → Y N X ← → ∏ N Statistical model is Encoder+Channel: P θ ( x ) ← i =1 P Y | X ( y i | x i ( m )) Task is to decode the message: T ( θ ) ← → m Decision rule is the decoding algorithm: → dec ( Y N ) τ ( X ) ← { } m ̸ = dec ( y N ) Loss function is the 0-1 loss: l ( T ( θ ) , τ ( x )) ← → 1 { � } Risk is the decoding error probability: → λ m, dec ≜ P m ̸ = dec ( Y N ) � m is sent L θ ( τ ) ← 13 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Example: Hypothesis Testing Decoding in channel coding belongs to a more general class of problems called hypothesis testing . Parameter space is a finite set: | Θ | < ∞ Task is to infer parameter θ : T ( θ ) = θ Loss function is the 0-1 loss: l ( T ( θ ) , τ ( x )) = 1 { θ ̸ = τ ( x ) } Risk is the probability of error: L θ ( τ ) = P X ∼ P θ { θ ̸ = τ ( X ) } 14 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Example: Density Estimation Estimate the probability density function from the collected samples. Parameter space is a (huge) set of density functions: → F = { f : R → [0 , + ∞ ) which is concave/continuous/Lipschitz continuous/etc. } Θ ← i.i.d. Data is the observed i.i.d. sequence: → X n , X i X ← ∼ f ( · ) Task is to infer density function f ( · ) : T ( θ ) ← → f → ˆ Decision rule is the density estimator: τ ( X ) ← f X n ( · ) � ( ) � ˆ Loss function is some sort of divergence: � l ( T ( θ ) , τ ( x )) ← → D f f x n [ ( � )] � ˆ � Risk is the expected loss: L θ ( τ ) ← → E X n ∼ f ⊗ n D f f X n 15 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Sometimes we care about the utility of the inferred object. 16 / 55 I-Hsiang Wang IT Lecture 7
Statistical Model and Decision Making Examples Example: Classification/Prediction A basic problem in learning is to train a classifier that predicts the category of a new object. Parameter space is a collection of labelings: Θ ← → H = { h : X → [1 : K ] } → ( X n , Y n ) , label Y i ∈ [1 : K ] . Data is the training data set: X ← → ∏ n Statistical model is the noisy labeling: P θ ( x ) ← i =1 P X ( x i ) P Y | h ( X ) ( y i | h ( x i )) → ˆ Task is to infer the true labeling h ∈ H : → h ( · ) , τ ( X ) ← h X n ,Y n ( · ) . T ( θ ) ← Loss function is the prediction error probability: [ { }] h ( X ) ̸ = ˆ l ( T, τ ( x )) ← → E X ∼ P X h ( X ) 1 (Note: This is still random as ˆ h depends on the randomly drawn training data ( X, Y ) n ) Risk is the averaged loss over training: [ [ { }]] h ( X ) ̸ = ˆ L θ ( τ ) ← → E ( X n ,Y n ) ∼ ( P X P Y | h ( X ) ) ⊗ n h ( X ) E X ∼ P X 1 17 / 55 I-Hsiang Wang IT Lecture 7
Recommend
More recommend