Expectation of discrete random variables

The intuition of expectation is the average value of an experiment. Suppose we do an experiment for $N$ repeated times. The probability of each possible outcome $x$ can be approximately defined by \begin{equation} \mathbf{P}(x) \approx f(x) = \frac{freq(x)}{N}. \end{equation} Then the average outcome is \begin{equation} m \approx \frac{1}{N}\sum_{x} \text{freq}(x)x = \sum_x f(x)x. \end{equation}

Definition 1. The mean value, or expectation, or expected value of the random variable $X$ with mass function $f$ is defined to be $$\begin{equation} \mathbf{E}(X) = \sum_{x:f(x)>0} xf(x) \end{equation}$$ whenever this sum is absolutely convergent.

Remark.

  1. For notation convenience, we also write $\mathbf{E}(X) = \sum_x xf(x)$.
  2. We require **absolute convergence** in order that $\mathbf{E}(X)$ be unchanged by reordering the $x_i$.

Lemma. If $X$ has mass function $f$ and $g:\mathbb{R}\to\mathbb{R}$, then $$\begin{equation} \mathbf{E}(g(x)) = \sum_x g(x) f(x) \end{equation}$$ whenever this sum is absolutely convergent.

Example. If $X$ is a random variable with mass function f, and $g(x) = x^2$, then $$\begin{equation} \mathbf{E}(X^2) = \sum_x g(x)f(x) = x^2 f(x). \end{equation}$$

Definition 2. If $k$ is a positive integer, the $k$th moment $m_k$ of $X$ is defined to be \begin{equation} m_k = \mathbf{E}(X^k). \end{equation} The $k$th central moment $\sigma_k$ is defined as \begin{equation} \sigma_k = \mathbf{E}\left( (X-m_1)^k \right) = \mathbf{E}\left( (X-E(X))^k \right). \end{equation}

The two moments of most use are $m_1 = \mathbf{E}(X)$ and $\sigma_2 = \mathbf{E}( (X - \mathbf{E}(X))^2$, called the mean (or expectation) and variance of $X$. These two quantities are measures of the mean and dispersion of $X$; that is, $m_1$ is the average value of $X$, and $\sigma_2$ measures the amount by which $X$ tends to deviate from this average. The mean $m_1$ is often denoted $\mu$, and the variance of $X$ is often denoted $\mathbf{Var}(X)$. The positive square root $\sigma = \mathbf{Var}(X)$ is called the standard deviation, and in this notation $\sigma_2 = \sigma^2$.

The central moments ${\sigma_i}$ can be expressed in terms of the ordinary moments ${m_i}$. For example, $\sigma_1 = 0$, and

\[\begin{equation} \begin{aligned} \sigma_{2} &=\sum_{x}\left(x-m_{1}\right)^{2} f(x) \\ &=\sum_{x} x^{2} f(x)-2 m_{1} \sum_{x} x f(x)+m_{1}^{2} \sum_{x} f(x) \\ &=m_{2}-m_{1}^{2} , \end{aligned} \end{equation}\]

which may be written as \(\begin{equation} \tag{4-1} \mathbf{Var}(X) = \mathbf{E}\left( (X-E(X))^2 \right) = \mathbf{E}(X^2) - \mathbf{E}(X)^2. \end{equation}\)

Example. [Binomial variables] Let $X$ be a random variable with binomial distribution. The p.m.f. is $$\begin{equation} f(k) = \binom{n}{k} p^k q^{n-k} \quad k = 0,\dots, n, \end{equation}$$ where $q = 1-p$. The expectation of $X$ is $$\begin{equation} \mathbf{E}(X) = \sum_{k=0}^n k f(k) = \sum_{k=0}^n k\binom{n}{k} p^k q^{n-k}. \end{equation}$$ We use the following algebraic identity to compute $\mathbf{E}(X)$. $$\begin{equation} \label{eq:4.2} \tag{4-2} \sum_{k=0}^n \binom{n}{k} x^k = (1+x)^n, \end{equation}$$ Differentiate it and multiply by $x$, we obtain $$\begin{equation} \label{eq:4.3} \tag{4-3} \sum_{k=0}^n k \binom{n}{k} x^k = nx(1+x)^{n-1}. \end{equation}$$ We substitute $x = p / q$ to obtain $\mathbf{E}(X) = np$. A similar argument shows that the variance of $X$ is given by $\mathbf{Var}(X) = npq$.

We can think of the process of calculating expectations as a linear operator on the space of random variables.

Theorem 2. The expectation operator $\mathbf{E}$ has the following properties:

  1. if $X\geq 0$, then $\mathbf{E}(X) \geq 0$,
  2. if $a, b \in \mathbb{R}$, then $\mathbf{E}(aX+bY) = a\mathbf{E}(X) + b\mathbf{E}(Y)$,
  3. the random variable 1, taking the value 1 always, has expectation $\mathbf{E}(1) = 1$.

Proof ▸

We only prove the second property, which is also called the linear property. We must use the joint p.m.f. of $X$ and $Y$ to compute the expectation. $$\begin{equation} \begin{split} \mathbf{E}(aX+bY) &= \sum_{i, j} (ax_i + by_j) f(x_i, y_j) \\ &= a \sum_{i,j} x_i f(x_i, y_j) + b\sum_{i,j} y_j f(x_i, y_j) \\ &= a\sum_{i} x_i f_X(x_i) + b\sum_{j}y_j f_Y(y_j) \\ &= a\mathbf{E}(X) + b\mathbf{E}(Y), \end{split} \end{equation}$$ where $f_X(x)$ and $f_Y(y)$ are marginal p.m.f. of $X$ and $Y$ respectively. ◼

Remark. It is NOT in general true that $\mathbf{E}(XY)$ is the same as $\mathbf{E}(X)\mathbf{E}(Y)$.

Lemma. If $X$ and $Y$ are independent, then $\mathbf{E}(XY) = \mathbf{E}(X)\mathbf{E}(Y)$.

Proof ▸

If $X, Y$ are independent, $f(x,y) = f_X(x) f_Y(y)$. Then $$\begin{equation} \mathbf{E}(XY) = \sum_{ij} x_i y_j f(x,y) = \sum_{i} \left( x_i f_X(x_i) \right) \sum_{j} \left( y_j f_Y(y_j) \right) = \mathbf{E}(X) \mathbf{E}(Y). \end{equation}$$ \end{proof} ◼

Definition 3. $X$ and $Y$ are called **uncorrelated** if $\mathbf{E}(XY) = \mathbf{E}(X)\mathbf{E}(Y)$.

Remark. Independent variables are uncorrelated. But the converse is **NOT** true.

Theorem 3. For random variables $X$ and $Y$,

  1. $\mathbf{Var}(aX) = a^2 \mathbf{Var}(X)$ for $a \in \mathbb{R}$,
  2. $\mathbf{Var}(X+Y) = \mathbf{Var}(X) + \mathbf{Var}(Y)$ is $X$ and $Y$ are uncorrelated.

Remark. The above theorem shows that the variance operator $\mathbf{Var}$ is **NOT** a linear operator, even when it is applied only to uncorrelated variables.

Sometimes the sum $S = \sum xf(x)$ does not converge absolutely, which means the mean of the distribution does not exist. Here is an example.

Example. [A distribution without a mean] Let $X$ have mass function $$\begin{equation} f(k) = Ak^{-1} \quad k = \pm 1, \pm 2, \dots, \end{equation}$$ where $A$ is chosen so that $\sum_k f(k) = 1$. The sum $\sum_k kf(k) = A\sum_{k\neq 0} k^{-1}$ doesn't converge absolutely, because both the positive and the negative parts diverge.

This example is suitable to point out that we can base probability theory upon the expectation operator $\mathbf{E}$ rather than upon the probability measure $\mathbf{P}$. Roughly speaking, the way we proceed is to postulate axioms, such as (a)-(c) of the above Theorem, for a so-called “expectation operator” $\mathbf{E}$ acting on a space of ``random variables”. The probability of an event can then be recaptured by defining $\mathbf{P}(A) = \mathbf{E}(I_A)$.

Recall the indicator function of a set $A$ is defined as \(\begin{equation} I_A(\omega) = \begin{cases} 1 & \omega \in A, \\ 0 & \omega \not\in A. \end{cases} \end{equation}\) In addition, we have $\mathbf{E}(I_A) = \mathbf{P}(A)$.

Dependence of discrete random variables

Definition 4. The **joint distribution function** $F:\mathbb{R}^2 \to [0,1]$ of $X$ and $Y$, where $X$ and $Y$ are discrete variables, is given by $$\begin{equation} F(x, y) = \mathbf{P}(X\leq x \text{ and } Y \leq y). \end{equation}$$ Their **joint mass function** $f:\mathbb{R}^2 \to [0,1]$ is given by $$\begin{equation} f(x,y) = \mathbf{P}(X = x \text{ and } Y = y). \end{equation}$$

We write $F_{X,Y}$ and $f_{X,Y}$ when we need to stress the role of $X$ and $Y$. We may think of the joint mass function in the following way. If $A_x = {X = x}$ and $B_y = {Y = y}$, then \(\begin{equation} f(x,y) = \mathbf{P}(A_x \cap B_y). \end{equation}\)

Lemma. The discrete random variables $X$ and $Y$ are **independent** if and only if $$\begin{equation} \tag{4-4} f_{X,Y}(x,y) = f_X(x)f_Y(y) \quad \forall x,y \in \mathbb{R}. \end{equation}$$ More generally, $X$ and $Y$ are independent if and only if $f_{X,Y}(x, y)$ can be **factorized as the product** $g(x)h (y)$ of a function of $x$ alone and a function of $y$ alone.

Remark. We stress that the factorization Eq.(4-4) must hold for all $x$ and $y$ in order that $X$ and $Y$ be independent.

Lemma. $$\begin{equation} \mathbf{E}(g(X, Y))=\sum_{x, y} g(x, y) f_{X, Y}(x, y). \end{equation}$$

Definition 5. The covariance of $X$ and $Y$ is $$\begin{equation} \mathbf{cov}(X,Y) = \mathbf{E}\left( (X-\mathbf{E}(X))(Y-\mathbf{E}(Y)) \right). \end{equation}$$ The correlation (coefficient) of $X$ and $Y$ is $$\begin{equation} \mathbf{corr}(X, Y) = \rho(X,Y) = \frac{\mathbf{cov}(X,Y)}{\sqrt{\mathbf{Var}(X)\mathbf{Var}(Y)}} \end{equation}$$ as long as the variances are non-zero.

Remark. Notice the following two equations.

  1. $\mathbf{cov}(X,X) = \mathbf{Var}(X)$,
  2. $\mathbf{cov}(X,Y) = \mathbf{E}(XY) - \mathbf{E}(X)\mathbf{E}(Y)$.

Covariance itself is not a satisfactory measure of dependence because the scale of values which $\mathbf{cov}(X, Y)$ may take contains no points which are clearly interpretable in terms of the relationship between $X$ and $Y$.

Theorem 4. [Cauchy-Schwarz inequality] For random variables $X$ and $Y$, $$\begin{equation} \mathbf{E}(XY)^2 \leq \mathbf{E}(X^2) \mathbf{E}(Y^2) \end{equation}$$ with equality if and only if $\mathbf{P}(aX = bY) = 1$ for some real $a$ and $b$, at least one of which is non-zero.

Proof ▸

For $a, b \in \mathbb{R}$, let $Z = aX - bY$. Then $$\begin{equation} 0 \leq \mathbf{E}(Z^2) = a^2 \mathbf{E}(X^2) - 2ab\mathbf{E}(XY) + b^2\mathbf{E}(Y^2). \end{equation}$$ Thus the right-hand side is a quadratic in the variable $a$ with at most one real root. Its discriminant must be non-positive. That is to say, if $b \neq 0$, $$\begin{equation} \mathbf{E}(XY)^2 - \mathbf{E}(X^2) \mathbf{E}(Y^2) \leq 0. \end{equation}$$ The discriminant is zero if and only if the quadratic has a real root. This occurs if and only if $$\begin{equation} \mathbf{E}\left( (aX-bY)^2 \right) = 0 \end{equation}$$ for some $a$ and $b$. ◼

We define $X’ = X-\mathbf{E}(X), Y’ = Y - \mathbf{E}(Y)$. Since all $X, Y$ satisfy the Cauchy-Schwarz inequality, so do $X’$ and $Y’$. Therefore, \(\begin{equation} \mathbf{E}(X'Y')^2 \leq \mathbf{E}(X'^2) \mathbf{E}(Y'^2) \quad \Leftrightarrow \quad \mathbf{cov}(X, Y)^2 \leq \mathbf{Var}(X)\mathbf{Var}(Y). \end{equation}\) Therefore, \(\begin{equation} \rho(X,Y)^2 \leq 1 \quad \mathbb{R}ightarrow \quad \rho(X,Y) \in [-1, 1]. \end{equation}\) which gives the following lemma.

Lemma. The correlation coefficient $\rho$ satisfies $\left\vert \rho (X, Y) \right\vert \leq 1$ with equality if and only if $\mathbf{P}(aX + bY = c) = 1$ for some $a, b, c \in \mathbb{R}$.

Expectation of continuous random variables

Idea of translating expectation from discrete to continuous

Suppose we have a continuous random variable $X$ with $f$ being the probability density function. We split $X$ into small intervals $\Delta x$. Then $p_i = f(x_i)\Delta x$. $\frac{p_i}{\Delta x}$ is an approximation of probability density function. Therefore, \(\begin{equation} \mathbf{E}(X) \approx \sum_{i} x_i p_i = \sum_{i} x_i f(x_i) \Delta x, \end{equation}\) which is the Remann sum. We take the limit and get \(\begin{equation} \mathbf{E}(x) = \int_{-\infty}^\infty x f(x)dx. \end{equation}\)

Expectation

Definition 6. The **expectation** of a continuous random variable $X$ with density function $f$ is given by $$\begin{equation} \mathbf{E}(X) = \int_{-\infty}^\infty xf(x) dx \end{equation}$$ whenever this integral exists.

Theorem 5. If $X$ and $g(X)$ are continuous random variables, then $$\begin{equation} \mathbf{E}\left( g(X) \right) = \int_{-\infty}^\infty g(x)f(x) dx. \end{equation}$$

Definition 7. The $k$th **moment** of a continuous variable $X$ is defined as $$\begin{equation} \mathbf{E}(X^k) = \int_{-\infty}^\infty x^k f(x) dx \end{equation}$$ whenever the integral converges.

Example. [Cauchy distribution] The random variable $X$ has the Cauchy distribution t if it has density function $$\begin{equation} f(x) = \frac{1}{\pi (1+x^2)}, \quad x \in \mathbb{R}. \end{equation}$$ This distribution is notable for having no moments.

Dependence of continuous random variables

Definition 8. The **joint distribution function** of $X$ and $Y$ is the function $F: \mathbb{R}^2 \to [0, 1]$ given by $$\begin{equation} F(x,y) = \mathbf{P}(X\leq x, Y \leq y). \end{equation}$$

Definition 9. The random variables $X$ and $Y$ are **(jointly) continuous** with **joint (probability) density function** $f : \mathbb{R}^2 \to [0, \infty)$ if $$\begin{equation} F(x, y)=\int_{v=-\infty}^{y} \int_{u=-\infty}^{x} f(u, v) d u d v \quad \text{for each } x, y\in\mathbb{R}. \end{equation}$$ If $F$ is sufficiently differentiable at the point $(x , y)$, then we usually specify $$\begin{equation} f(x, y)=\frac{\partial^{2}}{\partial x \partial y} F(x, y). \end{equation}$$

Probabilities:

\[\begin{equation} \begin{aligned} \mathbf{P}(a \leq X \leq b, c \leq Y \leq d) &=F(b, d)-F(a, d)-F(b, c)+F(a, c) \\ &=\int_{y=c}^{d} \int_{x=a}^{b} f(x, y) d x d y. \end{aligned} \end{equation}\]

If $B$ is a sufficiently nice subset of $\mathbb{R}^2$, then \(\begin{equation} \mathbf{P} \left( (X, Y) \in B \right)=\iint_{B} f(x, y) d x d y. \end{equation}\)

Marginal distributions: The marginal distribution functions of $X$ and $Y$ are

\[\begin{equation} F_{X}(x)=\mathbf{P}(X \leq x)=F(x, \infty), \quad F_{Y}(y)=\mathbf{P}(Y \leq y)=F(\infty, y). \end{equation}\] \[\begin{equation} F_{X}(x)=\int_{-\infty}^{x}\left(\int_{-\infty}^{\infty} f(u, y) d y\right) d u. \end{equation}\]

Marginal density function of $X$ and $Y$: \(\begin{equation} f_{X}(x)=\int_{-\infty}^{\infty} f(x, y) d y, \quad f_{Y}(y)=\int_{-\infty}^{\infty} f(x, y) d x. \end{equation}\)

Expectation: If $g: \mathbb{R}^2 \to \mathbb{R}$ is a sufficiently nice function, then

\[\begin{equation} \mathbf{E}(g(X, Y))=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x, y) f(x, y) d x d y; \end{equation}\]

in particular, setting $g(x, y) = ax + by$,

\[\begin{equation} \mathbf{E}(aX+bY) = a\mathbf{E}(X) + b\mathbf{E}(Y). \end{equation}\]

Independence: The random variables $X$ and $Y$ are independent if and only if

\[\begin{equation} F(x,y) = F_X(x) F_Y(y) \quad \forall x, y \in \mathbb{R}, \end{equation}\]

which, for continuous random variables, is equivalent to requiring that

\[\begin{equation} f(x,y) = f_X(x) f_Y(y). \end{equation}\]

Theorem 6. [Cauchy-Schwarz inequality] For any pair $X, Y$ of jointly continuous variables, we have that $$\begin{equation} \mathbf{E}(XY)^2 \leq \mathbf{E}(X^2) \mathbf{E}(Y^2), \end{equation}$$ with equality if and only if $\mathbf{P}(aX = bY) = 1$ for some real $a$ and $b$, at least one of which is non-zero.