Law of large numbers

Note that in section, we are dealing with random variables with independent, identical distribution, also written as i.i.d. The law of large numbers aims to study the convergence of the average sum of large i.i.d. random variables.

We first prove the following important lemma.

Lemma. [Chebyshev Inequality] Let X be a random variable with E(X)< and Var(X)<, then for any ϵ>0, we have (5-1)P(|XE(X)|ϵ)Var(X)ϵ2. In other words, we have (5-2)P(|XE(X)|<ϵ)1Var(X)ϵ2.

Proof ▸

We assume X is a discrete random variable. It can be easily extend to the case where X is continuous. We denote E(X)=μ and f(x) as the p.m.f. of X. We first expand the LHS of (5-1) and obtain P(|Xμ|ϵ)=|xμ|ϵf(x). On the other hand, we have Var(X)=x(xμ)2f(x)|xμ|ϵ(xμ)2f(x)|xμ|ϵϵ2f(x)=ϵ2|xμ|ϵf(x)=ϵ2P(|Xμ|ϵ). Therefore, we have P(|Xμ|ϵ)Var(X)ϵ2.

Chebyshev’s Inequality is the best possible inequality in the sense that, for any ϵ>0, it is possible to give an example of a random variable for which Chebyshev’s Inequality is in fact an equality.

Example. Suppose we have a random variable X such that for any ϵ>0, f(ϵ)=f(ϵ)=12. Clearly, E(X)=0< and Var(X)=E(X2)=ϵ2<. Therefore, Var(X)ϵ2=1. Also note that P(|Xμ|ϵ)=1. The equality sign of Chebyshev inequality holds. We cannot get better result.

Theorem 1. [Law of large numbers] Consider a sequence of i.i.d. random variables Xi with finite mean and variance. Denote E(X)=μ and Var(X)=σ2. Define Qn=1n(X1+X2++Xn), then for any ϵ>0, limnP(|Qnμ|ϵ)=0, or limnP(|Qnμ|<ϵ)=1. This means Qn converges to μ in probability.

Proof ▸

We notice that E(Qn)=i=1nE(1nXi)=1ni=1nE(Xi)=1nnμ=μ, which shows that the expectation of Qn is the same as the expectation of Xi. We also have Var(Qn)=Var(1n(X1++Xn))=1n2i=1nVar(Xi)=σ2n. Using Chebyshev inequality, for any ϵ>0, we have P(|Qnμ|ϵ)Var(Qn)ϵ2=σ2nϵ2. Therfore, limnP(|Qnμ|ϵ)limnσ2nϵ2=0. Since the probability is nonnegative, we must have limnP(|Qnμ|ϵ)=0. This finishes the proof. ◼

This result is significant from the view of frequentist statistics. Recall the probability of an event A is motivated by P(A)N(A)/N where N(A) and N the number of occurrence of A and the number of total experiments respectively. Now we can let Xi=1A, which is the indicator of the event A. Since each experiment is independent, we are actually perform a series Bernoulli trails and Xi is the simple Bernoulli variable. Then we can write N(A)=X1++Xn. Now N(A)N=1n(X1++Xn)=Qn. Note that E(X)=P(A) and Var(X)=P(A)P(A)2. Therefore, N(A)NP(A) as n.

Remark. For the law of large numbers to work, Var(X) must be finite. Otherwise, the law may fail as the following example shows.

Example. [Cauchy distribution] The Cauchy distribution is given by f(x)=1π(1+x2), where π is the normalization parameter. Let X be the random variable which has the Cauchy distribution. Note that although the Cauchy distribution is very like the normal distribution, X doesn't have the variance. This is because the Cauchy distribution has a long tail as |x| and it converges slowly. But X has a mean which is μ=0. So the question is: does Qn converges to μ? The answer is negative. This example shows that if the variance is not finite, the law of large numbers fails.

%<div class="remark"><p>Remark. %It is interesting to note that if X and Y are bernoulli, then X/Y is Cauchy. %</p></div>

Conditional distributions and conditional expectation

(This section is the supplement of the lecture.)

Definition 1. The conditional distribution function of Y given X=x is the function FY|X(|x) given by FY|X(y|x)=yf(x,v)fX(x)dv for any x such that fX(x)>0. It is sometimes denoted P(Yy|X=x).

Remembering that distribution functions are integrals of density functions, we are led to the following definition.

Definition 2. The conditional density function of FY|X, written fY|X, is given by fY|X(y|x)=f(x,y)fX(x)=f(x,y)f(x,y)dy for any x such that fX(x)>0.

Theorem 2. The conditional expectation ψ(X)=E(Y|X) satisfies E(ψ(X))=E(Y).

Theorem 3. The conditional expectation ψ(X)=E(Y|X) satisfies E(ψ(X)g(X))=E(Yg(X)) for any function g for which both expectations exist.

Functions of continuous random variables

Let X be a random variable with density function f, and let g:RR be a sufficiently nice function. Then y=g(X) is a random variable also. In order to calculate the distribution of Y, we proceed thus P(Yy)=P(g(X)y)=P((g(X)(,y])=P(Xg1(,y])=g1(,y]f(x)dx. The g1 is defined as follows. If AR then g1A=xR:g(x)A.

Example. Let g(x)=ax+b for fixed a,bR. Then Y=g(X)=aX+b has distribution function P(Yy)=P(aX+by)={P(X(yb)/a) if a>0P(X(yb)/a) if a<0 Differentiate to obtain fY(y)=|a|1fX((yb)/a).

More generally, if X1 and X2 have joint density function f, and g,h are two functions mapping R2R, then we can use the Jacobian to find the density the joint density function of the pair Y1=g(X1,X2), Y2=h(X1,X2).

Let y1=y1(x1,x2), y2=y2(x1,x2) be a one-one mapping T:(x1,x2)(y1,y2) taking some domain DR2 onto some range RR2. The transformation can be inverted as x1=x1(y1,y2), x2=x2(y1,y2); the Jacobian of this inverse is defined to be the determinant J=|x1y1x2y1x1y2x2y2|=x1y1x2y2x1y2x2y1 which express as a function J=J(y1,y2). We assume the partial derivatives are continuous.

Theorem 4. If g:R2R, and T maps the set AD onto the set BR, then Ag(x1,x2)dx1dx2=Bg(x1(y1,y2),x2(y1,y2))|J(y1,y2)|dy1dy2.

Corollary 4.1. If X1, X2 have joint density function f, then the pair Y1,Y2 given by (Y1,Y2)=T(X1,X2) has joint density function fY1,Y2(y1,y2)={f(x1(y1,y2),x2(y1,y2))|J(y1,y2)| if (y1,y2) is in the range of T0 otherwise. 

A similar result holds for mappings of Rn into Rn. This technique is sometimes referred to as the method of change of variables.

Example. Suppose that X1=aY1+bY2,X2=cY1+dY2 where adbc0. Check that fY1,Y2(y1,y2)=|adbc|fX1,X2(ay1+by2,cy1+dy2).

Multivariate normal distribution

Definition and properties

Definition 3. The vector X=(X1,X2,,Xn) has the **multivariate normal distribution** (or **multinormal distribution**), written N(μ,V), if its joint density function is f(x)=1(2π)n|V|exp[12(xμ)TV1(xμ)],xRn where V is a positive definite symmetric matrix.

Theorem 5. If X is N(μ,V), then

  1. E(X)=μ, which is to say that E(Xi)=μi for all i,
  2. V=(vij) is called the covariance matrix, because vij=cov(Xi,Xj).

Theorem 6. If X=(X1,X2,,Xn) is N(μ,V) and Y=(Y1,Y2,,Ym) is given by Y=XD for some matrix D of rank mn, then Y is N(0,DTVD).

Definition 4. The vector X=(X1,X2,,Xn) of random variables is said to have the **multivariate normal distribution** whenever, for all aRn, the linear combination XaT=a1X1++anXn has a normal distribution.

Distributions arising from the normal distribution

Suppose that X1,X2,,Xn is a collection of independent N(μ,σ2) variables for some fixed but unknown values of μ and σ2. We can use them to estimate μ and σ2.

Definition 5. The **sample mean** of a sequence of random variables X1,X2,,Xn is X¯=1ni=1nXi. It is usually used as a guess at the value of μ.

Definition 6. The **sample variance** of a sequence of random variables X1,X2,,Xn is S2=1n1i=1n(XiX¯)2. It is usually used as a guess at the value of σ2.

Remark. The sample mean and the sample variance have the property of being 'unbiased' in that E(X¯)=μ and E(S2)=σ2. Note that in some texts the sample variance is defined with n in place of (n1).

Theorem 7. If X1,X2,,Xn are independent N(μ,σ2) variables, then X¯ and S2 are independent. We have that X¯ is N(μ,σ2/n) and (n1)S2/σ2 is χ(n1).

Definition 7. If X1,X2,,Xn are standard normal random variables, then the sum of their squares, Q=i=1nXi2 is distributed according to the χ2 distribution with n **degrees of freedom**. This is usually denoted as Qχ2(k)orQχk2. The probability density function (p.d.f.) of the χ2 distribution is f(x;k)={xk21ex22k2Γ(k2)x>00 otherwise 

Sampling from a distribution

A basic way of generating a random variable with given distribution function is to use the following theorem.

Theorem 8. [Inverse transform technique] Let F be a distribution function, and let U be uniformly distributed on the interval [0,1].

  1. If F is a continuous function, the random variable X=F1(U) has distribution function F.
  2. Let F be the distribution function of a random variable taking non-negative integer values. The random variable X given by X=kif and only ifF(k1)<UF(k) has distribution function F.