Expectation and Dependence

Expectation of discrete random variables

The intuition of expectation is the average value of an experiment. Suppose we do an experiment for $N$ repeated times. The probability of each possible outcome $x$ can be approximately defined by $P (x) \approx f (x) = \frac{f r e q (x)}{N} .$ Then the average outcome is $m \approx \frac{1}{N} \sum_{x} freq (x) x = \sum_{x} f (x) x .$

Definition 1. The mean value, or expectation, or expected value of the random variable $X$ with mass function $f$ is defined to be $E (X) = \sum_{x : f (x) > 0} x f (x)$ whenever this sum is absolutely convergent.

Remark.

For notation convenience, we also write $E (X) = \sum_{x} x f (x)$ .
We require **absolute convergence** in order that $E (X)$ be unchanged by reordering the $x_{i}$ .

Theorem 1. Riemann Rearrangement Theorem.

Lemma. If $X$ has mass function $f$ and $g : R \to R$ , then $E (g (x)) = \sum_{x} g (x) f (x)$ whenever this sum is absolutely convergent.

Example. If $X$ is a random variable with mass function f, and $g (x) = x^{2}$ , then $E (X^{2}) = \sum_{x} g (x) f (x) = x^{2} f (x) .$

Definition 2. If $k$ is a positive integer, the $k$ th moment $m_{k}$ of $X$ is defined to be $m_{k} = E (X^{k}) .$ The $k$ th central moment $σ_{k}$ is defined as $σ_{k} = E ((X - m_{1})^{k}) = E ((X - E (X))^{k}) .$

The two moments of most use are $m_{1} = E (X)$ and $σ_{2} = E ((X - E (X))^{2}$ , called the mean (or expectation) and variance of $X$ . These two quantities are measures of the mean and dispersion of $X$ ; that is, $m_{1}$ is the average value of $X$ , and $σ_{2}$ measures the amount by which $X$ tends to deviate from this average. The mean $m_{1}$ is often denoted $μ$ , and the variance of $X$ is often denoted $Var (X)$ . The positive square root $σ = Var (X)$ is called the standard deviation, and in this notation $σ_{2} = σ^{2}$ .

The central moments $σ_{i}$ can be expressed in terms of the ordinary moments $m_{i}$ . For example, $σ_{1} = 0$ , and

\begin{aligned} σ_{2} & = \sum_{x} {(x - m_{1})}^{2} f (x) \\ = \sum_{x} x^{2} f (x) - 2 m_{1} \sum_{x} x f (x) + m_{1}^{2} \sum_{x} f (x) \\ = m_{2} - m_{1}^{2}, \end{aligned}

which may be written as $\begin{matrix} (4-1) & Var (X) = E ((X - E (X))^{2}) = E (X^{2}) - E (X)^{2} . \end{matrix}$

Example. [Binomial variables] Let $X$ be a random variable with binomial distribution. The p.m.f. is $f (k) = (\binom{n}{k}) p^{k} q^{n - k} k = 0, \dots, n,$ where $q = 1 - p$ . The expectation of $X$ is $E (X) = \sum_{k = 0}^{n} k f (k) = \sum_{k = 0}^{n} k (\binom{n}{k}) p^{k} q^{n - k} .$ We use the following algebraic identity to compute $E (X)$ . $\begin{matrix} (4-2) & \sum_{k = 0}^{n} (\binom{n}{k}) x^{k} = (1 + x)^{n}, \end{matrix}$ Differentiate it and multiply by $x$ , we obtain $\begin{matrix} (4-3) & \sum_{k = 0}^{n} k (\binom{n}{k}) x^{k} = n x (1 + x)^{n - 1} . \end{matrix}$ We substitute $x = p / q$ to obtain $E (X) = n p$ . A similar argument shows that the variance of $X$ is given by $Var (X) = n p q$ .

We can think of the process of calculating expectations as a linear operator on the space of random variables.

Theorem 2. The expectation operator $E$ has the following properties:

if $X \geq 0$ , then $E (X) \geq 0$ ,
if $a, b \in R$ , then $E (a X + b Y) = a E (X) + b E (Y)$ ,
the random variable 1, taking the value 1 always, has expectation $E (1) = 1$ .

Proof ▸

We only prove the second property, which is also called the linear property. We must use the joint p.m.f. of $X$ and $Y$ to compute the expectation. $\begin{aligned} E (a X + b Y) & = \sum_{i, j} (a x_{i} + b y_{j}) f (x_{i}, y_{j}) \\ = a \sum_{i, j} x_{i} f (x_{i}, y_{j}) + b \sum_{i, j} y_{j} f (x_{i}, y_{j}) \\ = a \sum_{i} x_{i} f_{X} (x_{i}) + b \sum_{j} y_{j} f_{Y} (y_{j}) \\ = a E (X) + b E (Y), \end{aligned}$ where $f_{X} (x)$ and $f_{Y} (y)$ are marginal p.m.f. of $X$ and $Y$ respectively. ◼

Remark. It is NOT in general true that $E (X Y)$ is the same as $E (X) E (Y)$ .

Lemma. If $X$ and $Y$ are independent, then $E (X Y) = E (X) E (Y)$ .

Proof ▸

If $X, Y$ are independent, $f (x, y) = f_{X} (x) f_{Y} (y)$ . Then $E (X Y) = \sum_{i j} x_{i} y_{j} f (x, y) = \sum_{i} (x_{i} f_{X} (x_{i})) \sum_{j} (y_{j} f_{Y} (y_{j})) = E (X) E (Y) .$ \end{proof} ◼

Definition 3. $X$ and $Y$ are called **uncorrelated** if $E (X Y) = E (X) E (Y)$ .

Remark. Independent variables are uncorrelated. But the converse is **NOT** true.

Theorem 3. For random variables $X$ and $Y$ ,

$Var (a X) = a^{2} Var (X)$ for $a \in R$ ,
$Var (X + Y) = Var (X) + Var (Y)$ is $X$ and $Y$ are uncorrelated.

Remark. The above theorem shows that the variance operator $Var$ is **NOT** a linear operator, even when it is applied only to uncorrelated variables.

Sometimes the sum $S = \sum x f (x)$ does not converge absolutely, which means the mean of the distribution does not exist. Here is an example.

Example. [A distribution without a mean] Let $X$ have mass function $f (k) = A k^{- 1} k = \pm 1, \pm 2, \dots,$ where $A$ is chosen so that $\sum_{k} f (k) = 1$ . The sum $\sum_{k} k f (k) = A \sum_{k \neq 0} k^{- 1}$ doesn't converge absolutely, because both the positive and the negative parts diverge.

This example is suitable to point out that we can base probability theory upon the expectation operator $E$ rather than upon the probability measure $P$ . Roughly speaking, the way we proceed is to postulate axioms, such as (a)-(c) of the above Theorem, for a so-called “expectation operator” $E$ acting on a space of ``random variables”. The probability of an event can then be recaptured by defining $P (A) = E (I_{A})$ .

Recall the indicator function of a set $A$ is defined as $I_{A} (ω) = {\begin{cases} 1 & ω \in A, \\ 0 & ω \notin A . \end{cases}$ In addition, we have $E (I_{A}) = P (A)$ .

Dependence of discrete random variables

Definition 4. The **joint distribution function** $F : R^{2} \to [0, 1]$ of $X$ and $Y$ , where $X$ and $Y$ are discrete variables, is given by $F (x, y) = P (X \leq x and Y \leq y) .$ Their **joint mass function** $f : R^{2} \to [0, 1]$ is given by $f (x, y) = P (X = x and Y = y) .$

We write $F_{X, Y}$ and $f_{X, Y}$ when we need to stress the role of $X$ and $Y$ . We may think of the joint mass function in the following way. If $A_{x} = X = x$ and $B_{y} = Y = y$ , then $f (x, y) = P (A_{x} \cap B_{y}) .$

Lemma. The discrete random variables $X$ and $Y$ are **independent** if and only if $\begin{matrix} (4-4) & f_{X, Y} (x, y) = f_{X} (x) f_{Y} (y) \forall x, y \in R . \end{matrix}$ More generally, $X$ and $Y$ are independent if and only if $f_{X, Y} (x, y)$ can be **factorized as the product** $g (x) h (y)$ of a function of $x$ alone and a function of $y$ alone.

Remark. We stress that the factorization Eq.(4-4) must hold for all $x$ and $y$ in order that $X$ and $Y$ be independent.

Lemma. $E (g (X, Y)) = \sum_{x, y} g (x, y) f_{X, Y} (x, y) .$

Definition 5. The covariance of $X$ and $Y$ is $cov (X, Y) = E ((X - E (X)) (Y - E (Y))) .$ The correlation (coefficient) of $X$ and $Y$ is $corr (X, Y) = ρ (X, Y) = \frac{cov (X, Y)}{\sqrt{Var (X) Var (Y)}}$ as long as the variances are non-zero.

Remark. Notice the following two equations.

$cov (X, X) = Var (X)$ ,
$cov (X, Y) = E (X Y) - E (X) E (Y)$ .

Covariance itself is not a satisfactory measure of dependence because the scale of values which $cov (X, Y)$ may take contains no points which are clearly interpretable in terms of the relationship between $X$ and $Y$ .

Theorem 4. [Cauchy-Schwarz inequality] For random variables $X$ and $Y$ , $E (X Y)^{2} \leq E (X^{2}) E (Y^{2})$ with equality if and only if $P (a X = b Y) = 1$ for some real $a$ and $b$ , at least one of which is non-zero.

Proof ▸

For $a, b \in R$ , let $Z = a X - b Y$ . Then $0 \leq E (Z^{2}) = a^{2} E (X^{2}) - 2 a b E (X Y) + b^{2} E (Y^{2}) .$ Thus the right-hand side is a quadratic in the variable $a$ with at most one real root. Its discriminant must be non-positive. That is to say, if $b \neq 0$ , $E (X Y)^{2} - E (X^{2}) E (Y^{2}) \leq 0.$ The discriminant is zero if and only if the quadratic has a real root. This occurs if and only if $E ((a X - b Y)^{2}) = 0$ for some $a$ and $b$ . ◼

We define $X^{'} = X - E (X), Y^{'} = Y - E (Y)$ . Since all $X, Y$ satisfy the Cauchy-Schwarz inequality, so do $X^{'}$ and $Y^{'}$ . Therefore, $E (X^{'} Y^{'})^{2} \leq E (X^{' 2}) E (Y^{' 2}) \Leftrightarrow cov (X, Y)^{2} \leq Var (X) Var (Y) .$ Therefore, $ρ (X, Y)^{2} \leq 1 R i g h t a r r o w ρ (X, Y) \in [- 1, 1] .$ which gives the following lemma.

Lemma. The correlation coefficient $ρ$ satisfies $| ρ (X, Y) | \leq 1$ with equality if and only if $P (a X + b Y = c) = 1$ for some $a, b, c \in R$ .

Expectation of continuous random variables

Idea of translating expectation from discrete to continuous

Suppose we have a continuous random variable $X$ with $f$ being the probability density function. We split $X$ into small intervals $Δ x$ . Then $p_{i} = f (x_{i}) Δ x$ . $\frac{p_{i}}{Δ x}$ is an approximation of probability density function. Therefore, $E (X) \approx \sum_{i} x_{i} p_{i} = \sum_{i} x_{i} f (x_{i}) Δ x,$ which is the Remann sum. We take the limit and get $E (x) = \int_{- \infty}^{\infty} x f (x) d x .$

Expectation

Definition 6. The **expectation** of a continuous random variable $X$ with density function $f$ is given by $E (X) = \int_{- \infty}^{\infty} x f (x) d x$ whenever this integral exists.

Theorem 5. If $X$ and $g (X)$ are continuous random variables, then $E (g (X)) = \int_{- \infty}^{\infty} g (x) f (x) d x .$

Definition 7. The $k$ th **moment** of a continuous variable $X$ is defined as $E (X^{k}) = \int_{- \infty}^{\infty} x^{k} f (x) d x$ whenever the integral converges.

Example. [Cauchy distribution] The random variable $X$ has the Cauchy distribution t if it has density function $f (x) = \frac{1}{π (1 + x^{2})}, x \in R .$ This distribution is notable for having no moments.

Dependence of continuous random variables

Definition 8. The **joint distribution function** of $X$ and $Y$ is the function $F : R^{2} \to [0, 1]$ given by $F (x, y) = P (X \leq x, Y \leq y) .$

Definition 9. The random variables $X$ and $Y$ are **(jointly) continuous** with **joint (probability) density function** $f : R^{2} \to [0, \infty)$ if $F (x, y) = \int_{v = - \infty}^{y} \int_{u = - \infty}^{x} f (u, v) d u d v for each x, y \in R .$ If $F$ is sufficiently differentiable at the point $(x, y)$ , then we usually specify $f (x, y) = \frac{\partial^{2}}{\partial x \partial y} F (x, y) .$

Probabilities:

\begin{aligned} P (a \leq X \leq b, c \leq Y \leq d) & = F (b, d) - F (a, d) - F (b, c) + F (a, c) \\ = \int_{y = c}^{d} \int_{x = a}^{b} f (x, y) d x d y . \end{aligned}

If $B$ is a sufficiently nice subset of $R^{2}$ , then $P ((X, Y) \in B) = \iint_{B} f (x, y) d x d y .$

Marginal distributions: The marginal distribution functions of $X$ and $Y$ are

F_{X} (x) = P (X \leq x) = F (x, \infty), F_{Y} (y) = P (Y \leq y) = F (\infty, y) .

F_{X} (x) = \int_{- \infty}^{x} (\int_{- \infty}^{\infty} f (u, y) d y) d u .

Marginal density function of $X$ and $Y$ : $f_{X} (x) = \int_{- \infty}^{\infty} f (x, y) d y, f_{Y} (y) = \int_{- \infty}^{\infty} f (x, y) d x .$

Expectation: If $g : R^{2} \to R$ is a sufficiently nice function, then

E (g (X, Y)) = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} g (x, y) f (x, y) d x d y;

in particular, setting $g (x, y) = a x + b y$ ,

E (a X + b Y) = a E (X) + b E (Y) .

Independence: The random variables $X$ and $Y$ are independent if and only if

F (x, y) = F_{X} (x) F_{Y} (y) \forall x, y \in R,

which, for continuous random variables, is equivalent to requiring that

f (x, y) = f_{X} (x) f_{Y} (y) .

Theorem 6. [Cauchy-Schwarz inequality] For any pair $X, Y$ of jointly continuous variables, we have that $E (X Y)^{2} \leq E (X^{2}) E (Y^{2}),$ with equality if and only if $P (a X = b Y) = 1$ for some real $a$ and $b$ , at least one of which is non-zero.