Recursive Least Squares

Recursive least squares (RLS) is an adaptive filtering algorithm that recursively finds the coefficients to minimize a weighted linear least squares cost function relating to the input signals. Let $x_{i} \in R^{n}$ and $y_{i} \in R$ be the $i$ -th input and output signals which satisfy the following linear relationship:

y_{i} = x_{i}^{T} w + v_{i},

where $v_{i}$ is a random noise and $w \in R^{n}$ is the parameter that we want to find. Assume we have $m$ signal pairs. We denote $Y_{m} = [y_{1}, \dots, y_{m}] \in R^{m}$ and $X_{m} = [x_{1}, \dots, x_{n}] \in R^{n \times m}$ . $V_{m} = [v_{1}, \dots, v_{m}] \in R^{m}$ . Then, we can write a compact form

Y_{m} = X_{m}^{T} w + V_{m} .

Note: some reference denote $X \in R^{m \times n}$ such that we have $Y_{m} = X_{m} w + V_{m}$ . We use our notations.

Our objective is to find an estimated parameter $w$ to minimize the least square error of $m$ signal pairs. Define the error $e_{i} = y_{i} - x_{i}^{T} w$ . Our objective is

min_{w} \sum_{i = 1}^{m} ‖ e_{i} ‖_{2}^{2} := ‖ Y_{m} - X_{m}^{T} w ‖_{2}^{2} .

The solution to the above optimization problem is $w_{m}^{*} = (X_{m} X_{m}^{T})^{- 1} X_{m} Y_{m}$ .

Now we assume we have a new signal pair $(x_{m + 1}, y_{m + 1})$ comes. It augments the existing data set to $X_{m = 1}$ and $Y_{m + 1}$ . We want to update $w_{m}^{*}$ to minimize the least square loss of $X_{m + 1}$ and $Y_{m + 1}$ . We can completely forget $w_{m}^{*}$ and formulate a new problem to find $w_{m + 1}^{*}$ . But it can be exhaustive. We want to do it quickly and reduce the computational burden. Can we leverage $w^{*}$ to update the solution? The answer is yes. We know that $w_{m + 1}^{*} = (X_{m + 1} X_{m + 1}^{T}) X_{m + 1} Y_{m + 1}$ and $X_{m + 1} = [X_{m}, x_{m + 1}] \in R^{n \times (m + 1)}$ and $Y_{m + 1} = [Y_{m}, y_{m}] \in R^{m + 1}$ . Therefore, we have

w_{m + 1}^{*} = (X_{m} X_{m}^{T} + x_{m + 1} x_{m + 1}^{T})^{- 1} (X_{m} Y_{m} + x_{m + 1} y_{m + 1}) .

Using the Matrix Inversion Lemma (the Sherman-Morrison-Woodbury formula)

{(A + U C V)}^{- 1} = A^{- 1} - A^{- 1} U {(C^{- 1} + V A^{- 1} U)}^{- 1} V A^{- 1},

we let $A = X_{m} X_{m}^{T}$ , $U = x_{m + 1}$ , $V = x_{m + 1}^{T}$ , and $C = I_{1}$ . We have

\begin{aligned} (X_{m} X_{m}^{T} + x_{m + 1} x_{m + 1}^{T})^{- 1} \\ = & (X_{m} X_{m}^{T})^{- 1} - (X_{m} X_{m}^{T})^{- 1} x_{m + 1} {(I_{1} + x_{m + 1}^{T} (X_{m} X_{m}^{T})^{- 1} x_{m + 1})}^{- 1} x_{m + 1}^{T} (X_{m} X_{m}^{T})^{- 1} . \end{aligned}

Let $P_{m} = (X_{m} X_{m}^{T})^{- 1}$ , then we have have the update

P_{m + 1} = P_{m} - P_{m} x_{m + 1} (I_{1} + x_{m + 1}^{T} P_{m} x_{m + 1})^{- 1} x_{m + 1}^{T} P_{m} .

Then, we can write

w_{m + 1}^{*} = P_{m + 1} (P_{m}^{- 1} w_{m}^{*} + x_{m + 1} y_{m + 1}) .

Note from the definition that $P_{m}^{- 1} = X_{m} X_{m}^{T}$ and $P_{m + 1}^{- 1} = P_{m}^{- 1} + x_{m + 1} x_{m + 1}^{T}$ , we can further simplify $w_{m + 1}^{*}$ as

w_{m + 1}^{*} = w_{m}^{*} + P_{m + 1} x_{m + 1} (y_{m + 1} - x_{m + 1}^{T} w_{m}^{*}) .

We can initialize $P_{0} = I$ , w_0 = 0. For $m = 0, 1, 2, \dots$ , we have

observe $(x_{m + 1}, y_{m + 1})$
Update $P_{m + 1}$
update $w_{m + 1}^{*}$ .

Initialization

We should note that RLS can start from a zero data set. We need to set $P_{0}$ and $w_{0}$ . It is easy to get confused when solving linear systems. Suppose we want to solve $b = A x$ , where $A \in R^{m \times n}$ . When $m < n$ , the system is under-determined, i.e., there are more variables than equations. Therefore, infinite solutions can exist if $b \in col (A)$ .

In RLS, when there is one data, the system is under-determined. So can we solve $w^{*}$ for only one data point from the optimization? The answer is no because $(x x^{T})^{- 1}$ does not exist for a single point. Therefore, at the beginning, when the data is less than the number of decision variables $(m < n)$ . We may be unable to use the formula $(X X^{T})^{- 1} X Y$ to find the optimal $w^{*}$ . This formula only holds when $m \geq n$ . Therefore, we need to assume an invertible $P_{0}$ and follow the rules to update.

In fact, when $m < n$ , $A x = b$ only has two possibilities. If $b \in col (A)$ , we have infinitely many solutions. If $b \notin col (A)$ , there is no solution.

Decayed Error

We can add decay factors to the objective. The past data contribute less to the loss than the current data. The objective becomes $min_{w} \sum_{i = 1}^{m} λ^{m - i} ‖ e_{i} ‖_{2}^{2} .$ for $λ \in (0, 1)$ . The update role is the same, but $λ$ is evolved.

Multi-dimensional Signals

The idea can naturally extend to $y_{i} \in R^{p}$ for $p > 1$ . In this case, the input signal is generally a matrix $x_{i} \in R^{n \times p}$ . We have $y_{i} = x_{i}^{T} w$ . The formulation does not change.

Nonlinear RLS

Previously, we used a linear filter to achieve the least square estimation. The input-output signal can also follow some nonlinear rules

y_{i} = f (x_{i}) + v_{i} .

In this case, we use a nonlinear filter $f (x, w)$ parameterized by $w \in R^{p}$ to minimize the square loss

min_{w} \sum_{i = 1}^{m} λ^{m - i} ‖ y_{i} - f (x_{i}, w) ‖_{2}^{2} .

However, the general nonlinear functions can be hard to optimize, we in fact first linearize $f$ at some $\hat{w}$ and then minimize the linearized loss. We want to perform it incrementally (or recursively) to reduce the computational burden. Assume we already have some $w_{m - 1}^{*}$ . The optimal $w_{m}^{*}$ on the data set $X_{m}$ and $Y_{m}$ solves the following problem:

w_{m}^{*} = \arg min_{w} \sum_{i = 1}^{m} λ^{m - i} ‖ y_{i} - f (x_{i}, w_{m - 1}^{*}) - \nabla_{w} f (x_{i}, w_{m - 1}^{*})^{T} (w - w_{m - 1}^{*}) ‖_{2}^{2} .

Now we let ${\bar{y}}_{i} (w_{m}^{*}) = y_{i} - f (x_{i}, w_{m}^{*}) + \nabla_{w} f (x_{i}, w_{m}^{*})^{T} w_{m}^{*}$ and ${\bar{x}}_{i} (w_{m}^{*}) = \nabla_{w} f (x_{i}, w_{m}^{*})$ . We can formulate a new data matrix ${\bar{Y}}_{m} (w_{m}^{*})$ and ${\bar{X}}_{m} (w_{m}^{*})$ . At step $m + 1$ , we receive $(x_{m + 1}, y_{m + 1})$ and we can compute ${\bar{y}}_{m + 1} (w_{m}^{*})$ and ${\bar{x}}_{m + 1} (w_{m}^{*})$ . Then, we can use the same method in linear RLS to update $w_{m}^{*}$ . The difference is that we need to compute ${\bar{Y}}_{m}$ and $\bar{X}$ at $w_{m}^{*}$ in every $m$ step.

The convergence issue of nonlinear RLS can be found in Bertsekas’ book Nonlinear Programming.

Useful reference

Recursive Least Squares