Expectation of discrete random variables

The intuition of expectation is the average value of an experiment. Suppose we do an experiment for N repeated times. The probability of each possible outcome x can be approximately defined by P(x)f(x)=freq(x)N. Then the average outcome is m1Nxfreq(x)x=xf(x)x.

Definition 1. The mean value, or expectation, or expected value of the random variable X with mass function f is defined to be E(X)=x:f(x)>0xf(x) whenever this sum is absolutely convergent.

Remark.

  1. For notation convenience, we also write E(X)=xxf(x).
  2. We require **absolute convergence** in order that E(X) be unchanged by reordering the xi.

Lemma. If X has mass function f and g:RR, then E(g(x))=xg(x)f(x) whenever this sum is absolutely convergent.

Example. If X is a random variable with mass function f, and g(x)=x2, then E(X2)=xg(x)f(x)=x2f(x).

Definition 2. If k is a positive integer, the kth moment mk of X is defined to be mk=E(Xk). The kth central moment σk is defined as σk=E((Xm1)k)=E((XE(X))k).

The two moments of most use are m1=E(X) and σ2=E((XE(X))2, called the mean (or expectation) and variance of X. These two quantities are measures of the mean and dispersion of X; that is, m1 is the average value of X, and σ2 measures the amount by which X tends to deviate from this average. The mean m1 is often denoted μ, and the variance of X is often denoted Var(X). The positive square root σ=Var(X) is called the standard deviation, and in this notation σ2=σ2.

The central moments σi can be expressed in terms of the ordinary moments mi. For example, σ1=0, and

σ2=x(xm1)2f(x)=xx2f(x)2m1xxf(x)+m12xf(x)=m2m12,

which may be written as (4-1)Var(X)=E((XE(X))2)=E(X2)E(X)2.

Example. [Binomial variables] Let X be a random variable with binomial distribution. The p.m.f. is f(k)=(nk)pkqnkk=0,,n, where q=1p. The expectation of X is E(X)=k=0nkf(k)=k=0nk(nk)pkqnk. We use the following algebraic identity to compute E(X). (4-2)k=0n(nk)xk=(1+x)n, Differentiate it and multiply by x, we obtain (4-3)k=0nk(nk)xk=nx(1+x)n1. We substitute x=p/q to obtain E(X)=np. A similar argument shows that the variance of X is given by Var(X)=npq.

We can think of the process of calculating expectations as a linear operator on the space of random variables.

Theorem 2. The expectation operator E has the following properties:

  1. if X0, then E(X)0,
  2. if a,bR, then E(aX+bY)=aE(X)+bE(Y),
  3. the random variable 1, taking the value 1 always, has expectation E(1)=1.

Proof ▸

We only prove the second property, which is also called the linear property. We must use the joint p.m.f. of X and Y to compute the expectation. E(aX+bY)=i,j(axi+byj)f(xi,yj)=ai,jxif(xi,yj)+bi,jyjf(xi,yj)=aixifX(xi)+bjyjfY(yj)=aE(X)+bE(Y), where fX(x) and fY(y) are marginal p.m.f. of X and Y respectively. ◼

Remark. It is NOT in general true that E(XY) is the same as E(X)E(Y).

Lemma. If X and Y are independent, then E(XY)=E(X)E(Y).

Proof ▸

If X,Y are independent, f(x,y)=fX(x)fY(y). Then E(XY)=ijxiyjf(x,y)=i(xifX(xi))j(yjfY(yj))=E(X)E(Y). \end{proof} ◼

Definition 3. X and Y are called **uncorrelated** if E(XY)=E(X)E(Y).

Remark. Independent variables are uncorrelated. But the converse is **NOT** true.

Theorem 3. For random variables X and Y,

  1. Var(aX)=a2Var(X) for aR,
  2. Var(X+Y)=Var(X)+Var(Y) is X and Y are uncorrelated.

Remark. The above theorem shows that the variance operator Var is **NOT** a linear operator, even when it is applied only to uncorrelated variables.

Sometimes the sum S=xf(x) does not converge absolutely, which means the mean of the distribution does not exist. Here is an example.

Example. [A distribution without a mean] Let X have mass function f(k)=Ak1k=±1,±2,, where A is chosen so that kf(k)=1. The sum kkf(k)=Ak0k1 doesn't converge absolutely, because both the positive and the negative parts diverge.

This example is suitable to point out that we can base probability theory upon the expectation operator E rather than upon the probability measure P. Roughly speaking, the way we proceed is to postulate axioms, such as (a)-(c) of the above Theorem, for a so-called “expectation operator” E acting on a space of ``random variables”. The probability of an event can then be recaptured by defining P(A)=E(IA).

Recall the indicator function of a set A is defined as IA(ω)={1ωA,0ωA. In addition, we have E(IA)=P(A).

Dependence of discrete random variables

Definition 4. The **joint distribution function** F:R2[0,1] of X and Y, where X and Y are discrete variables, is given by F(x,y)=P(Xx and Yy). Their **joint mass function** f:R2[0,1] is given by f(x,y)=P(X=x and Y=y).

We write FX,Y and fX,Y when we need to stress the role of X and Y. We may think of the joint mass function in the following way. If Ax=X=x and By=Y=y, then f(x,y)=P(AxBy).

Lemma. The discrete random variables X and Y are **independent** if and only if (4-4)fX,Y(x,y)=fX(x)fY(y)x,yR. More generally, X and Y are independent if and only if fX,Y(x,y) can be **factorized as the product** g(x)h(y) of a function of x alone and a function of y alone.

Remark. We stress that the factorization Eq.(4-4) must hold for all x and y in order that X and Y be independent.

Lemma. E(g(X,Y))=x,yg(x,y)fX,Y(x,y).

Definition 5. The covariance of X and Y is cov(X,Y)=E((XE(X))(YE(Y))). The correlation (coefficient) of X and Y is corr(X,Y)=ρ(X,Y)=cov(X,Y)Var(X)Var(Y) as long as the variances are non-zero.

Remark. Notice the following two equations.

  1. cov(X,X)=Var(X),
  2. cov(X,Y)=E(XY)E(X)E(Y).

Covariance itself is not a satisfactory measure of dependence because the scale of values which cov(X,Y) may take contains no points which are clearly interpretable in terms of the relationship between X and Y.

Theorem 4. [Cauchy-Schwarz inequality] For random variables X and Y, E(XY)2E(X2)E(Y2) with equality if and only if P(aX=bY)=1 for some real a and b, at least one of which is non-zero.

Proof ▸

For a,bR, let Z=aXbY. Then 0E(Z2)=a2E(X2)2abE(XY)+b2E(Y2). Thus the right-hand side is a quadratic in the variable a with at most one real root. Its discriminant must be non-positive. That is to say, if b0, E(XY)2E(X2)E(Y2)0. The discriminant is zero if and only if the quadratic has a real root. This occurs if and only if E((aXbY)2)=0 for some a and b. ◼

We define X=XE(X),Y=YE(Y). Since all X,Y satisfy the Cauchy-Schwarz inequality, so do X and Y. Therefore, E(XY)2E(X2)E(Y2)cov(X,Y)2Var(X)Var(Y). Therefore, ρ(X,Y)21Rightarrowρ(X,Y)[1,1]. which gives the following lemma.

Lemma. The correlation coefficient ρ satisfies |ρ(X,Y)|1 with equality if and only if P(aX+bY=c)=1 for some a,b,cR.

Expectation of continuous random variables

Idea of translating expectation from discrete to continuous

Suppose we have a continuous random variable X with f being the probability density function. We split X into small intervals Δx. Then pi=f(xi)Δx. piΔx is an approximation of probability density function. Therefore, E(X)ixipi=ixif(xi)Δx, which is the Remann sum. We take the limit and get E(x)=xf(x)dx.

Expectation

Definition 6. The **expectation** of a continuous random variable X with density function f is given by E(X)=xf(x)dx whenever this integral exists.

Theorem 5. If X and g(X) are continuous random variables, then E(g(X))=g(x)f(x)dx.

Definition 7. The kth **moment** of a continuous variable X is defined as E(Xk)=xkf(x)dx whenever the integral converges.

Example. [Cauchy distribution] The random variable X has the Cauchy distribution t if it has density function f(x)=1π(1+x2),xR. This distribution is notable for having no moments.

Dependence of continuous random variables

Definition 8. The **joint distribution function** of X and Y is the function F:R2[0,1] given by F(x,y)=P(Xx,Yy).

Definition 9. The random variables X and Y are **(jointly) continuous** with **joint (probability) density function** f:R2[0,) if F(x,y)=v=yu=xf(u,v)dudvfor each x,yR. If F is sufficiently differentiable at the point (x,y), then we usually specify f(x,y)=2xyF(x,y).

Probabilities:

P(aXb,cYd)=F(b,d)F(a,d)F(b,c)+F(a,c)=y=cdx=abf(x,y)dxdy.

If B is a sufficiently nice subset of R2, then P((X,Y)B)=Bf(x,y)dxdy.

Marginal distributions: The marginal distribution functions of X and Y are

FX(x)=P(Xx)=F(x,),FY(y)=P(Yy)=F(,y). FX(x)=x(f(u,y)dy)du.

Marginal density function of X and Y: fX(x)=f(x,y)dy,fY(y)=f(x,y)dx.

Expectation: If g:R2R is a sufficiently nice function, then

E(g(X,Y))=g(x,y)f(x,y)dxdy;

in particular, setting g(x,y)=ax+by,

E(aX+bY)=aE(X)+bE(Y).

Independence: The random variables X and Y are independent if and only if

F(x,y)=FX(x)FY(y)x,yR,

which, for continuous random variables, is equivalent to requiring that

f(x,y)=fX(x)fY(y).

Theorem 6. [Cauchy-Schwarz inequality] For any pair X,Y of jointly continuous variables, we have that E(XY)2E(X2)E(Y2), with equality if and only if P(aX=bY)=1 for some real a and b, at least one of which is non-zero.