Recursive Least Squares
Recursive Least Squares
Recursive least squares (RLS) is an adaptive filtering algorithm that recursively finds the coefficients to minimize a weighted linear least squares cost function relating to the input signals. Let $x_i \in \mathbb{R}^n$ and $y_i \in \mathbb{R}$ be the $i$-th input and output signals which satisfy the following linear relationship:
\[y_i = x_i^T w + v_i,\]where $v_i$ is a random noise and $w\in \mathbb{R}^n$ is the parameter that we want to find. Assume we have $m$ signal pairs. We denote $Y_m = [y_1,\dots,y_m] \in \mathbb{R}^m$ and $X_m = [x_1,\dots, x_n] \in \mathbb{R}^{n\times m}$. $V_m = [v_1, \dots, v_m] \in \mathbb{R}^m$. Then, we can write a compact form
\[Y_m = X_m^T w + V_m.\]Note: some reference denote $X \in \mathbb{R}^{m\times n}$ such that we have $Y_m = X_m w + V_m$. We use our notations.
Our objective is to find an estimated parameter $w$ to minimize the least square error of $m$ signal pairs. Define the error $e_i = y_i - x_i^T w$. Our objective is
\[\min_w \quad \sum_{i=1}^m \| e_i \|^2_2 := \| Y_m - X_m^T w \|_2^2.\]The solution to the above optimization problem is $w_m^\ast = (X_m X_m^T)^{-1} X_m Y_m$.
Now we assume we have a new signal pair $(x_{m+1}, y_{m+1})$ comes. It augments the existing data set to $X_{m=1}$ and $Y_{m+1}$. We want to update $w^\ast_{m}$ to minimize the least square loss of $X_{m+1}$ and $Y_{m+1}$. We can completely forget $w^\ast_m$ and formulate a new problem to find $w^\ast_{m+1}$. But it can be exhaustive. We want to do it quickly and reduce the computational burden. Can we leverage $w^\ast$ to update the solution? The answer is yes. We know that $w^\ast_{m+1} = (X_{m+1} X_{m+1}^T) X_{m+1} Y_{m+1}$ and $X_{m+1} = [X_m, x_{m+1}] \in \mathbb{R}^{n\times(m+1)}$ and $Y_{m+1} = [Y_m, y_m] \in \mathbb{R}^{m+1}$. Therefore, we have
\[w^\ast_{m+1} = (X_m X_m^T + x_{m+1} x_{m+1}^T)^{-1} (X_m Y_m + x_{m+1} y_{m+1}).\]Using the Matrix Inversion Lemma (the Sherman-Morrison-Woodbury formula)
\[\left(A+UCV\right)^{-1}=A^{-1}-A^{-1}U\left(C^{-1}+VA^{-1}U\right)^{-1}VA^{-1},\]we let $A = X_m X_m^T$, $U=x_{m+1}$, $V = x_{m+1}^T$, and $C=I_1$. We have
\[\begin{align*} &(X_m X_m^T + x_{m+1} x_{m+1}^T)^{-1} \\= &(X_m X_m^T)^{-1} - (X_m X_m^T)^{-1} x_{m+1} \left(I_1 + x_{m+1}^T (X_m X_m^T)^{-1} x_{m+1}\right)^{-1} x_{m+1}^T (X_m X_m^T)^{-1}. \end{align*}\]Let $P_m = (X_m X_m^T)^{-1}$, then we have have the update
\[P_{m+1} = P_m - P_m x_{m+1} (I_1 + x_{m+1}^T P_m x_{m+1})^{-1} x_{m+1}^T P_m.\]Then, we can write
\[w^\ast_{m+1} = P_{m+1}(P_m^{-1} w^\ast_m + x_{m+1} y_{m+1}).\]Note from the definition that $P_m^{-1} = X_m X_m^T$ and $P_{m+1}^{-1} = P_m^{-1} + x_{m+1} x_{m+1}^T$, we can further simplify $w^\ast_{m+1}$ as
\[w^\ast_{m+1} = w^\ast_m + P_{m+1} x_{m+1} (y_{m+1} - x_{m+1}^T w^\ast_m).\]We can initialize $P_0 = I$, w_0 = 0. For $m=0,1,2,\dots$, we have
- observe $(x_{m+1}, y_{m+1})$
- Update $P_{m+1}$
- update $w^\ast_{m+1}$.
Initialization
We should note that RLS can start from a zero data set. We need to set $P_0$ and $w_0$. It is easy to get confused when solving linear systems. Suppose we want to solve $b = Ax$, where $A \in \mathbb{R}^{m\times n}$. When $m < n$, the system is under-determined, i.e., there are more variables than equations. Therefore, infinite solutions can exist if $b \in \mathrm{col}(A)$.
In RLS, when there is one data, the system is under-determined. So can we solve $w^\ast$ for only one data point from the optimization? The answer is no because $(x x^T)^{-1}$ does not exist for a single point. Therefore, at the beginning, when the data is less than the number of decision variables $(m < n)$. We may be unable to use the formula $(XX^T)^{-1}XY$ to find the optimal $w^\ast$. This formula only holds when $m \geq n$. Therefore, we need to assume an invertible $P_0$ and follow the rules to update.
In fact, when $m < n$, $Ax = b$ only has two possibilities. If $b \in \mathrm{col}(A)$, we have infinitely many solutions. If $b \not\in \mathrm{col}(A)$, there is no solution.
Decayed Error
We can add decay factors to the objective. The past data contribute less to the loss than the current data. The objective becomes \(\min_w \quad \sum_{i=1}^m \lambda^{m-i} \| e_i \|^2_2.\) for $\lambda \in (0,1)$. The update role is the same, but $\lambda$ is evolved.
Multi-dimensional Signals
The idea can naturally extend to $y_i \in \mathbb{R}^p$ for $p > 1$. In this case, the input signal is generally a matrix $x_i \in \mathbb{R}^{n\times p}$. We have $y_i = x_i^T w$. The formulation does not change.
Nonlinear RLS
Previously, we used a linear filter to achieve the least square estimation. The input-output signal can also follow some nonlinear rules
\[y_i = f(x_i) + v_i.\]In this case, we use a nonlinear filter $f(x, w)$ parameterized by $w \in \mathbb{R}^p$ to minimize the square loss
\[\min_w \quad \sum_{i=1}^m \lambda^{m-i} \| y_i - f(x_i, w) \|^2_2.\]However, the general nonlinear functions can be hard to optimize, we in fact first linearize $f$ at some $\hat{w}$ and then minimize the linearized loss. We want to perform it incrementally (or recursively) to reduce the computational burden. Assume we already have some $w^\ast_{m-1}$. The optimal $w^\ast_m$ on the data set $X_m$ and $Y_m$ solves the following problem:
\[w^\ast_m = \arg\min_w \sum_{i=1}^{m} \lambda^{m-i} \| y_i - f(x_i, w^\ast_{m-1}) - \nabla_w f(x_i, w^\ast_{m-1})^T (w - w^\ast_{m-1}) \|^2_2.\]Now we let $\bar{y}_i (w^\ast_m) = y_i - f(x_i, w^\ast_m) + \nabla_w f(x_i, w^\ast_m)^T w^\ast_m$ and $\bar{x}_i (w^\ast_m) = \nabla_w f(x_i,w^\ast_m)$. We can formulate a new data matrix $\bar{Y}_m(w^\ast_m)$ and $\bar{X}_m (w^\ast_m)$. At step $m+1$, we receive $(x_{m+1}, y_{m+1})$ and we can compute $\bar{y}_{m+1} (w^\ast_m)$ and $\bar{x}_{m+1} (w^\ast_m)$. Then, we can use the same method in linear RLS to update $w^\ast_m$. The difference is that we need to compute $\bar{Y}_m$ and $\bar{X}$ at $w^\ast_m$ in every $m$ step.
The convergence issue of nonlinear RLS can be found in Bertsekas’ book Nonlinear Programming.