Berkeley Notes
  • Introduction
  • EE120
    • Introduction to Signals and Systems
    • The Fourier Series
    • The Fourier Transform
    • Generalized transforms
    • Linear Time-Invariant Systems
    • Feedback Control
    • Sampling
    • Appendix
  • EE123
    • The DFT
    • Spectral Analysis
    • Sampling
    • Filtering
  • EECS126
    • Introduction to Probability
    • Random Variables and their Distributions
    • Concentration
    • Information Theory
    • Random Processes
    • Random Graphs
    • Statistical Inference
    • Estimation
  • EECS127
    • Linear Algebra
    • Fundamentals of Optimization
    • Linear Algebraic Optimization
    • Convex Optimization
    • Duality
  • EE128
    • Introduction to Control
    • Modeling Systems
    • System Performance
    • Design Tools
    • Cascade Compensation
    • State-Space Control
    • Digital Control Systems
    • Cayley-Hamilton
  • EECS225A
    • Hilbert Space Theory
    • Linear Estimation
    • Discrete Time Random Processes
    • Filtering
  • EE222
    • Real Analysis
    • Differential Geometry
    • Nonlinear System Dynamics
    • Stability of Nonlinear Systems
    • Nonlinear Feedback Control
Powered by GitBook
On this page

Was this helpful?

  1. EECS126

Statistical Inference

PreviousRandom GraphsNextEstimation

Last updated 3 years ago

Was this helpful?

Suppose we have a variable XXX (may or may not be a random variable) that represents the state of nature. We observe a variable YYY which is obtained by some model of the world PY∣XP_{Y|X}PY∣X​.

Suppose we know that X∼πX\sim \piX∼π where π\piπ is a probability distribution. If we observe Y=yY=yY=y, then the a posteriori estimate of XXX is given by Bayes Rule

Pr{X=x∣Y=y}=PY∣X(y∣x)π(x)∑x~PY∣X(y∣x~)π(x~)∝PY∣X(y∣x)π(x).\text{Pr}\left\{X=x | Y=y\right\} = \frac{P_{Y|X}(y|x)\pi(x)}{\sum_{\tilde{x}}P_{Y|X}(y|\tilde{x})\pi(\tilde{x})} \propto P_{Y|X}(y|x)\pi(x).Pr{X=x∣Y=y}=∑x~​PY∣X​(y∣x~)π(x~)PY∣X​(y∣x)π(x)​∝PY∣X​(y∣x)π(x).

Since the estimate is only dependent on the model and the prior, we don’t actually need to compute the probabilities to figure out the most likely XXX.

Definition 77

The Maximum A Posteriori (MAP) estimate is given by

X^MAP(y)=argmaxxPY∣X(y∣x)π(x)\hat{X}_{MAP}(y) = \text{argmax}_x P_{Y|X}(y|x)\pi(x)X^MAP​(y)=argmaxx​PY∣X​(y∣x)π(x)

If we have no prior information on XXX, then we can assume π\piπ is uniform, reducing Definition 77 to only optimize over the model.

Definition 78

The Maximum Likelihood (ML) estimate is given by

X^ML(y)=argmaxxPY∣X(y∣x)\hat{X}_{ML}(y) = \text{argmax}_x P_{Y|X}(y|x)X^ML​(y)=argmaxx​PY∣X​(y∣x)

Binary Hypothesis Testing

Definition 79

A Binary Hypothesis Test is a type of statistical inference where the unknown variable X∈{0,1}X\in\{ 0, 1 \}X∈{0,1}.

Since there are only two possible values of XXX in a binary test, there are two “hypotheses” that we have, and we want to accept the more likely one.

Definition 80

The Null Hypothesis H0H_0H0​ says that Y∼PY∣X=0Y\sim P_{Y|X=0}Y∼PY∣X=0​

Definition 81

The Alternate Hypothesis H1H_1H1​ says that Y∼PY∣X=1Y\sim P_{Y|X=1}Y∼PY∣X=1​

With two possible hypotheses, there are two kinds of errors we can make.

Definition 82

A Type I error (false positive) is when we incorrectly reject the null hypothesis. The Type I error probability is then

Pr{X^(Y)=1∣X=0}\text{Pr}\left\{\hat{X}(Y) = 1 | X = 0\right\}Pr{X^(Y)=1∣X=0}

Definition 83

A Type II error (false negative) is when we incorrectly accept the null hypothesis. The Type II error probability is then

Pr{X^(Y)=0∣X=1}\text{Pr}\left\{\hat{X}(Y) = 0 | X = 1\right\}Pr{X^(Y)=0∣X=1}

Our goal is to create a decision rule X^:Y→{0,1}\hat{X}: \mathcal{Y} \to \{0, 1\}X^:Y→{0,1} that we can use to predict XXX. Based on what the decision rule is used for, there will be requirements on how large the probability of Type I and Type II errors can be. We can formulate the search for a hypothesis test as an optimization. For some β∈[0,1]\beta \in [0, 1]β∈[0,1], we want to find

X^β(Y)=argminPr{X^(Y)=0∣X=1}:Pr{X^(Y)=1∣X=0}≤β.(1)\hat{X}_\beta(Y) = \text{argmin} \text{Pr}\left\{\hat{X}(Y)=0 | X=1\right\} \quad : \quad \text{Pr}\left\{\hat{X}(Y)=1|X=0\right\} \leq \beta. \qquad (1)X^β​(Y)=argminPr{X^(Y)=0∣X=1}:Pr{X^(Y)=1∣X=0}≤β.(1)

Intuitively, our test should depend on pY∣X(y∣1)p_{Y|X}(y|1)pY∣X​(y∣1) and pY∣X(y∣0)p_{Y|X}(y|0)pY∣X​(y∣0) since these quantities give us how likely we are to get our observations if we knew the ground truth. We can define a ratio that formally compares these two quantities.

Definition 84

The likelihood ratio is given by

L(y)=pY∣X(y∣1)pY∣X(y∣0)L(y) = \frac{p_{Y|X}(y|1)}{p_{Y|X}(y|0)}L(y)=pY∣X​(y∣0)pY∣X​(y∣1)​

Notice that we can write MLE as a threshold on the likelihood ratio since if L(y)≥1L(y) \geq 1L(y)≥1, then we say X=1X=1X=1, and vice versa. However, there is no particular reason that 111 must always be the number at which we threshold our likelihood ratio, and so we can generalize this idea to form different forms of tests.

Definition 85

For some threshold ccc and randomization probability γ\gammaγ, a threshold test is of the form

X^(y)={1 if L(y)>c0 if L(y)<c Bernoulli(γ) if L(y)=c.\hat{X}(y) = \begin{cases} 1 & \text{ if } L(y) > c\\ 0 & \text{ if } L(y) < c\\ \text{ Bernoulli}(\gamma) & \text { if } L(y) = c. \end{cases}X^(y)=⎩⎨⎧​10 Bernoulli(γ)​ if L(y)>c if L(y)<c if L(y)=c.​

MAP fits into the framework of a threshold test since we can write

X^MAP={1 if L(y)≥π0π10 if L(y)<π0π1\hat{X}_{MAP} = \begin{cases} 1 & \text{ if } L(y) \geq \frac{\pi_0}{\pi_1}\\ 0 & \text{ if } L(y) < \frac{\pi_0}{\pi_1} \end{cases}X^MAP​={10​ if L(y)≥π1​π0​​ if L(y)<π1​π0​​​

It turns out that threshold tests are optimal with respect to solving Equation 1.

Theorem 44 (Neyman Pearson Lemma)

Given β∈[0,1]\beta\in[0, 1]β∈[0,1], the optimal decision rule to

X^β(Y)=argminPr{X^(Y)=0∣X=1}:Pr{X^(Y)=1∣X=0}≤β\hat{X}_\beta(Y) = \text{argmin} \text{Pr}\left\{\hat{X}(Y)=0 | X=1\right\} \quad : \quad \text{Pr}\left\{\hat{X}(Y)=1|X=0\right\} \leq \betaX^β​(Y)=argminPr{X^(Y)=0∣X=1}:Pr{X^(Y)=1∣X=0}≤β

is a threshold test.

When L(y)L(y)L(y) is monotonically increasing or decreasing, we can make the decision rule even simpler since it can be turned into a threshold on yyy. For example, if L(y)L(y)L(y) is monotonically inreasing, then an optimal decision rule might be

X^(y)={1 if y>c0 if y<cBernoulli(γ) if y=c.\hat{X}(y) = \begin{cases} 1 & \text{ if } y > c\\ 0 & \text{ if } y < c\\ \text{Bernoulli}(\gamma) & \text{ if } y = c. \end{cases}X^(y)=⎩⎨⎧​10Bernoulli(γ)​ if y>c if y<c if y=c.​

Figure 2: Inference Setup