LLN and conditional probability

Law of large numbers

Note that in section, we are dealing with random variables with independent, identical distribution, also written as i.i.d. The law of large numbers aims to study the convergence of the average sum of large i.i.d. random variables.

We first prove the following important lemma.

Lemma. [Chebyshev Inequality] Let $X$ be a random variable with $E (X) < \infty$ and $Var (X) < \infty$ , then for any $ϵ > 0$ , we have $\begin{matrix} (5-1) & P (| X - E (X) | \geq ϵ) \leq \frac{Var (X)}{ϵ^{2}} . \end{matrix}$ In other words, we have $\begin{matrix} (5-2) & P (| X - E (X) | < ϵ) \geq 1 - \frac{Var (X)}{ϵ^{2}} . \end{matrix}$

Proof ▸

We assume $X$ is a discrete random variable. It can be easily extend to the case where $X$ is continuous. We denote $E (X) = μ$ and $f (x)$ as the p.m.f. of $X$ . We first expand the LHS of $(5-1)$ and obtain $P (| X - μ | \geq ϵ) = \sum_{| x - μ | \geq ϵ} f (x) .$ On the other hand, we have $\begin{aligned} Var (X) & = \sum_{x} (x - μ)^{2} f (x) \\ \geq \sum_{| x - μ | \geq ϵ} (x - μ)^{2} f (x) \\ \geq \sum_{| x - μ | \geq ϵ} ϵ^{2} f (x) \\ = ϵ^{2} \sum_{| x - μ | \geq ϵ} f (x) \\ = ϵ^{2} P (| X - μ | \geq ϵ) . \end{aligned}$ Therefore, we have $P (| X - μ | \geq ϵ) \leq \frac{Var (X)}{ϵ^{2}} .$ ◼

Chebyshev’s Inequality is the best possible inequality in the sense that, for any $ϵ > 0$ , it is possible to give an example of a random variable for which Chebyshev’s Inequality is in fact an equality.

Example. Suppose we have a random variable $X$ such that for any $ϵ > 0$ , $f (- ϵ) = f (ϵ) = \frac{1}{2}$ . Clearly, $E (X) = 0 < \infty$ and $Var (X) = E (X^{2}) = ϵ^{2} < \infty$ . Therefore, $\frac{Var (X)}{ϵ^{2}} = 1.$ Also note that $P (| X - μ | \geq ϵ) = 1$ . The equality sign of Chebyshev inequality holds. We cannot get better result.

Theorem 1. [Law of large numbers] Consider a sequence of i.i.d. random variables $X_{i}$ with finite mean and variance. Denote $E (X) = μ$ and $Var (X) = σ^{2}$ . Define $Q_{n} = \frac{1}{n} (X_{1} + X_{2} + \dots + X_{n}),$ then for any $ϵ > 0$ , $lim_{n \to \infty} P (| Q_{n} - μ | \geq ϵ) = 0,$ or $lim_{n \to \infty} P (| Q_{n} - μ | < ϵ) = 1.$ This means $Q_{n}$ converges to $μ$ in probability.

Proof ▸

We notice that $E (Q_{n}) = \sum_{i = 1}^{n} E (\frac{1}{n} X_{i}) = \frac{1}{n} \sum_{i = 1}^{n} E (X_{i}) = \frac{1}{n} n μ = μ,$ which shows that the expectation of $Q_{n}$ is the same as the expectation of $X_{i}$ . We also have $Var (Q_{n}) = Var (\frac{1}{n} (X_{1} + \dots + X_{n})) = \frac{1}{n^{2}} \sum_{i = 1}^{n} Var (X_{i}) = \frac{σ^{2}}{n} .$ Using Chebyshev inequality, for any $ϵ > 0$ , we have $P (| Q_{n} - μ | \geq ϵ) \leq \frac{Var (Q_{n})}{ϵ^{2}} = \frac{σ^{2}}{n ϵ^{2}} .$ Therfore, $lim_{n \to \infty} P (| Q_{n} - μ | \geq ϵ) \leq lim_{n \to \infty} \frac{σ^{2}}{n ϵ^{2}} = 0.$ Since the probability is nonnegative, we must have $lim_{n \to \infty} P (| Q_{n} - μ | \geq ϵ) = 0.$ This finishes the proof. ◼

This result is significant from the view of frequentist statistics. Recall the probability of an event $A$ is motivated by $P (A) \approx N (A) / N$ where $N (A)$ and $N$ the number of occurrence of $A$ and the number of total experiments respectively. Now we can let $X_{i} = 1_{A}$ , which is the indicator of the event $A$ . Since each experiment is independent, we are actually perform a series Bernoulli trails and $X_{i}$ is the simple Bernoulli variable. Then we can write $N (A) = X_{1} + \dots + X_{n}$ . Now $\frac{N (A)}{N} = \frac{1}{n} (X_{1} + \dots + X_{n}) = Q_{n} .$ Note that $E (X) = P (A)$ and $Var (X) = P (A) - P (A)^{2}$ . Therefore, $\frac{N (A)}{N} \to P (A) as n \to \infty .$

Remark. For the law of large numbers to work, $Var (X)$ must be finite. Otherwise, the law may fail as the following example shows.

Example. [Cauchy distribution] The Cauchy distribution is given by $f (x) = \frac{1}{π (1 + x^{2})},$ where $π$ is the normalization parameter. Let $X$ be the random variable which has the Cauchy distribution. Note that although the Cauchy distribution is very like the normal distribution, $X$ doesn't have the variance. This is because the Cauchy distribution has a long tail as $| x | \to \infty$ and it converges slowly. But $X$ has a mean which is $μ = 0$ . So the question is: does $Q_{n}$ converges to $μ$ ? The answer is negative. This example shows that if the variance is not finite, the law of large numbers fails.

%<div class="remark"><p>Remark. %It is interesting to note that if $X$ and $Y$ are bernoulli, then $X / Y$ is Cauchy. %</p></div>

Conditional distributions and conditional expectation

(This section is the supplement of the lecture.)

Definition 1. The conditional distribution function of $Y$ given $X = x$ is the function $F_{Y | X} (\cdot | x)$ given by $F_{Y | X} (y | x) = \int_{- \infty}^{y} \frac{f (x, v)}{f_{X} (x)} d v$ for any $x$ such that $f_{X} (x) > 0$ . It is sometimes denoted $P (Y \leq y | X = x)$ .

Remembering that distribution functions are integrals of density functions, we are led to the following definition.

Definition 2. The conditional density function of $F_{Y | X}$ , written $f_{Y | X}$ , is given by $f_{Y | X} (y | x) = \frac{f (x, y)}{f_{X} (x)} = \frac{f (x, y)}{\int_{- \infty}^{\infty} f (x, y) d y}$ for any $x$ such that $f_{X} (x) > 0$ .

Theorem 2. The conditional expectation $ψ (X) = E (Y | X)$ satisfies $E (ψ (X)) = E (Y) .$

Theorem 3. The conditional expectation $ψ (X) = E (Y | X)$ satisfies $E (ψ (X) g (X)) = E (Y g (X))$ for any function $g$ for which both expectations exist.

Functions of continuous random variables

Let $X$ be a random variable with density function $f$ , and let $g : R \to R$ be a sufficiently nice function. Then $y = g (X)$ is a random variable also. In order to calculate the distribution of $Y$ , we proceed thus $\begin{aligned} P (Y \leq y) & = P (g (X) \leq y) = P ((g (X) \in (- \infty, y]) \\ = P (X \in g^{- 1} (- \infty, y]) = \int_{g^{- 1} (- \infty, y]} f (x) d x \end{aligned} .$ The $g^{- 1}$ is defined as follows. If $A \subseteq R$ then $g^{- 1} A = x \in R : g (x) \in A$ .

Example. Let $g (x) = a x + b$ for fixed $a, b \in R$ . Then $Y = g (X) = a X + b$ has distribution function $P (Y \leq y) = P (a X + b \leq y) = {\begin{cases} P (X \leq (y - b) / a) & if a > 0 \\ P (X \geq (y - b) / a) & if a < 0 \end{cases}$ Differentiate to obtain $f_{Y} (y) = | a |^{- 1} f_{X} ((y - b) / a)$ .

More generally, if $X_{1}$ and $X_{2}$ have joint density function $f$ , and $g, h$ are two functions mapping $R^{2} \to R$ , then we can use the Jacobian to find the density the joint density function of the pair $Y_{1} = g (X_{1}, X_{2})$ , $Y_{2} = h (X_{1}, X_{2})$ .

Let $y_{1} = y_{1} (x_{1}, x_{2})$ , $y_{2} = y_{2} (x_{1}, x_{2})$ be a one-one mapping $T : (x_{1}, x_{2}) \mapsto (y_{1}, y_{2})$ taking some domain $D \subseteq R^{2}$ onto some range $R \subseteq R^{2}$ . The transformation can be inverted as $x_{1} = x_{1} (y_{1}, y_{2})$ , $x_{2} = x_{2} (y_{1}, y_{2})$ ; the Jacobian of this inverse is defined to be the determinant $J = | \begin{array}{ll} \frac{\partial x_{1}}{\partial y_{1}} & \frac{\partial x_{2}}{\partial y_{1}} \\ \frac{\partial x_{1}}{\partial y_{2}} & \frac{\partial x_{2}}{\partial y_{2}} \end{array} | = \frac{\partial x_{1}}{\partial y_{1}} \frac{\partial x_{2}}{\partial y_{2}} - \frac{\partial x_{1}}{\partial y_{2}} \frac{\partial x_{2}}{\partial y_{1}}$ which express as a function $J = J (y_{1}, y_{2})$ . We assume the partial derivatives are continuous.

Theorem 4. If $g : R^{2} \to R$ , and $T$ maps the set $A \subseteq D$ onto the set $B \subseteq R$ , then $\iint_{A} g (x_{1}, x_{2}) d x_{1} d x_{2} = \iint_{B} g (x_{1} (y_{1}, y_{2}), x_{2} (y_{1}, y_{2})) | J (y_{1}, y_{2}) | d y_{1} d y_{2} .$

Corollary 4.1. If $X_{1}$ , $X_{2}$ have joint density function $f$ , then the pair $Y_{1}, Y_{2}$ given by $(Y_{1}, Y_{2}) = T (X_{1}, X_{2})$ has joint density function $f_{Y_{1}, Y_{2}} (y_{1}, y_{2}) = {\begin{cases} f (x_{1} (y_{1}, y_{2}), x_{2} (y_{1}, y_{2})) | J (y_{1}, y_{2}) | & if (y_{1}, y_{2}) is in the range of T \\ 0 & otherwise. \end{cases}$

A similar result holds for mappings of $R^{n}$ into $R^{n}$ . This technique is sometimes referred to as the method of change of variables.

Example. Suppose that $X_{1} = a Y_{1} + b Y_{2}, X_{2} = c Y_{1} + d Y_{2}$ where $a d - b c \neq 0$ . Check that $f_{Y_{1}, Y_{2}} (y_{1}, y_{2}) = | a d - b c | f_{X_{1}, X_{2}} (a y_{1} + b y_{2}, c y_{1} + d y_{2}) .$

Multivariate normal distribution

Definition and properties

Definition 3. The vector $X = (X_{1}, X_{2}, \dots, X_{n})$ has the **multivariate normal distribution** (or **multinormal distribution**), written $N (μ, V)$ , if its joint density function is $f (x) = \frac{1}{\sqrt{(2 π)^{n} | V |}} \exp [- \frac{1}{2} (x - μ)^{T} V^{- 1} (x - μ)], x \in R^{n}$ where $V$ is a positive definite symmetric matrix.

Theorem 5. If $X$ is $N (μ, V)$ , then

$E (X) = μ$ , which is to say that $E (X_{i}) = μ_{i}$ for all $i$ ,
$V = (v_{i j})$ is called the covariance matrix, because $v_{i j} = cov (X_{i}, X_{j})$ .

Theorem 6. If $X = (X_{1}, X_{2}, \dots, X_{n})$ is $N (μ, V)$ and $Y = (Y_{1}, Y_{2}, \dots, Y_{m})$ is given by $Y = XD$ for some matrix $D$ of rank $m \leq n$ , then $Y$ is $N (0, D^{T} V D)$ .

Definition 4. The vector $X = (X_{1}, X_{2}, \dots, X_{n})$ of random variables is said to have the **multivariate normal distribution** whenever, for all $a \in R^{n}$ , the linear combination ${Xa}^{T} = a_{1} X_{1} + \dots + a_{n} X_{n}$ has a normal distribution.

Distributions arising from the normal distribution

Suppose that $X_{1}, X_{2}, \dots, X_{n}$ is a collection of independent $N (μ, σ^{2})$ variables for some fixed but unknown values of $μ$ and $σ^{2}$ . We can use them to estimate $μ$ and $σ^{2}$ .

Definition 5. The **sample mean** of a sequence of random variables $X_{1}, X_{2}, \dots, X_{n}$ is $\bar{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} .$ It is usually used as a guess at the value of $μ$ .

Definition 6. The **sample variance** of a sequence of random variables $X_{1}, X_{2}, \dots, X_{n}$ is $S^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} .$ It is usually used as a guess at the value of $σ^{2}$ .

Remark. The sample mean and the sample variance have the property of being 'unbiased' in that $E (\bar{X}) = μ$ and $E (S^{2}) = σ^{2}$ . Note that in some texts the sample variance is defined with $n$ in place of $(n - 1)$ .

Theorem 7. If $X_{1}, X_{2}, \dots, X_{n}$ are independent $N (μ, σ^{2})$ variables, then $\bar{X}$ and $S^{2}$ are independent. We have that $\bar{X}$ is $N (μ, σ^{2} / n)$ and $(n - 1) S^{2} / σ^{2}$ is $χ^{(n - 1)}$ .

Definition 7. If $X_{1}, X_{2}, \dots, X_{n}$ are standard normal random variables, then the sum of their squares, $Q = \sum_{i = 1}^{n} X_{i}^{2}$ is distributed according to the $χ^{2}$ distribution with $n$ **degrees of freedom**. This is usually denoted as $Q \sim χ^{2} (k) or Q \sim χ_{k}^{2} .$ The probability density function (p.d.f.) of the $χ^{2}$ distribution is $f (x; k) = {\begin{cases} \frac{x^{\frac{k}{2} - 1} e^{- \frac{x}{2}}}{2^{\frac{k}{2}} Γ (\frac{k}{2})} & x > 0 \\ 0 & otherwise \end{cases}$

Sampling from a distribution

A basic way of generating a random variable with given distribution function is to use the following theorem.

Theorem 8. [Inverse transform technique] Let $F$ be a distribution function, and let $U$ be uniformly distributed on the interval $[0, 1]$ .

If $F$ is a continuous function, the random variable $X = F^{- 1} (U)$ has distribution function $F$ .
Let $F$ be the distribution function of a random variable taking non-negative integer values. The random variable $X$ given by $X = k if and only if F (k - 1) < U \leq F (k)$ has distribution function $F$ .