Conjugated Gradient an Optimal Stopping Time Approach

$\let\vec\mathbf$

Introduction

Suppose that we have a symmetric matrix $A\in \mathbb{R}^{N\times N}$ and our aim is to solve the linear problem $A\vec{x}=\vec{b}$ to find the exact solution $\vec{x_*}$ . One of the algorithm widely implemented to perform such task is the conjugated gradient (CG) method. The standard formulation of the CG method is as follow:

We define the initial step: $\vec{x_0 }= \vec{0},\; \vec{r_0} =\vec{b}, \vec{p_0}=\vec{r_0}$ .
We compute the length of our next step: $\alpha_n = \frac{\vec{r_{n-1}}^t \vec{r_{n-1}}}{\vec{p_{n-1}}^tA\vec{p_{n-1}}}$ .
We compute the approximated solution: $\vec{x_n}=\vec{x_{n-1}} + \alpha \vec{p_{n-1}}$ .
We compute the residual: $\vec{r_{n}}=\vec{r_{n-1}} - A\alpha_n\vec{p_{n-1}}$
Last but not least we update the search direction: $\vec{p_{n}} = \frac{\vec{r_{n-1}}^t \vec{r_{n}}}{\vec{r_{n}}^t\vec{r_{n-1}}}\vec{p_{n-1}}$

the idea behind this method is to find the best the approximation $\vec{x_n}$ in the Krylov subspace of order $n$ , $K_n$ , that minimize the quantity: $\phi(x_n)=\vec{x_n}^t A \vec{x_n}-\vec{x_n}^t\vec{b}$ . A couple of interesting result for the CG method are the following:

Theorem 1

The following statements holds when the CG method is applied to a symmetric matrix:

$K_n=<\vec{b},A\vec{b},\dots, A^{n-1}\vec{b}>=<\vec{x_1},\dots,\vec{x_n}>=<\vec{p_0},\dots,\vec{p_{n-1}}>=<\vec{r_0},\dots,\vec{r_{n-1}}>$ .
The residual are orthogonal: $\vec{r_i}^t\vec{r_j}=0$ .
The search directions are $A$ conjugated: $\vec{p_i}^t A \vec{p_j}=0$

Theorem 2

If we apply the CG method to a positive defined symmetric matrix the element $x_n$ that minimize $\phi(x_n)$ is unique and the converge with respect to:

$e_n=||x_* - x_n||_A$

is monotonic.

More detail on the CG method and the theorems before mentioned can be found in Numerical Linear Algebra, Lloyd N. Trefethen. An interesting aspect is that even if Theorem 2 states that the error of the CG in the norm induced by $A$ is monotonically decreasing, we can’t state the same for the residual $\vec{r_n}$ . In the next sections we will try to use techniques developed in the field of optimal stopping time to address the question of when is convenient to arrest our method in order to minimize the residual.

Preliminary Example

Let’s begin by addressing a very simple problem, suppose that we can compute $n$ steps of the CG and we have reached the step $n-1$ . How can we decide if would be convenient in terms of residual to compute the $n$ -th step ? Well a straight forward answer would be that we should compute $\vec{x_n}$ if:

$\mathbb{E}\Big[||A\vec{\xi_n} - \vec{b}||\Big]< ||\vec{r_{n-1}}||\;\;\;\; (D1)$

where $\vec{\xi_n}$ is the random variable whose realization is $\vec{x_n}$ and we can use any norm of our choosing, with out altering much the meaning of the equation. Considering the fact that each step of the CG method is computed only considering thee previous we can assume that $\vec{\xi_n}$ is a Markov process. In light of the fact that $\vec{\xi_n}$ is Markov chain we can replace as decision criteria $(D1)$ with:

$\mathbb{E}\Big[||A\vec{\xi_n} - \vec{b}||\; \Big| \; \vec{\xi_{n-1}}=\vec{x_{n-1}}\Big]< || \vec{r_{n-1}}||\;\;\;\; (D2)$

The idea of introducing $\vec{\xi_n}$ is justified by the fact that the value that of $\vec{x_n}$ is unknown to the user at time $n-1$ even if it can be exactly computed at the $n$ -th step. The idea is for the user to make and “educated guess” of the out come for the CG method at step $n$ , and to decide thanks to this educated guess if proceed to compute $\vec{x_n}$ . The “education” of our guess consist in the distribution that we assume $\vec{\xi_n}$ has. We know that the CG method moves along a direction $\vec{p_{n-1}}$ of length $\alpha_n$ to minimize $\phi(\vec{x_n})$ and we need to find a distribution that correctly represent this behavior. To do so we start from a multivariate Gaussian in $\mathbb{R}^n$ centered in $\vec{x_{n-1}}$ :

$u^0(\vec{x})= \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}}\exp(-\frac{1}{2}(\vec{x}-\vec{x_{n-1}})^t\Sigma^{-1}(\vec{x}-\vec{x_{n-1}}))$

Then we puncture the Gaussian propagating it using the wave equation:

$\begin{cases}\partial^2_t u_{_N} = \Delta u_{_N} \\ u_{_N}(\vec{x},0)=u^0(\vec{x})\\ \partial_t u_{_N}(\vec{x},0) \equiv 0 \end{cases}$

the solution of the above problem that we will denote as $u_{_N}(\vec{x},t)$ still is a probability density function under certain hypothesis that we will investigate later. Furthermore it has the shape of a punctured Gaussian. Such a shape tell us that we have greater probability of finding $\vec{\xi_n}$ around the $N$ -dimensional hole produced around $\vec{x_{n-1}}$ . In particular since we known that the wave front propagate as sphere of radios $ct$ in the particular case of Gaussian pulses, the density $u(\vec{x},\alpha_n)$ seems a fear distribution for $\vec{\xi_n}$ given that we have $c=1$ . We should remark latter on how spectral properties of $A$ together with the choice of $\Sigma$ can improve the way we build our distribution, we will remark in the same section that we can provide a general formula to compute $u_{_N}$ . Now if we decide to use the euclidean norm to evaluate our residual we have that $(D2)$ is equivalent to:

$\int_{\mathbb{R}^n} ||A\vec{x} - \vec{b}||^2 u_{_N}(\vec{z},\alpha_n) d\vec{x}\leq ||\vec{r_{n-1}}||^2$

$\int_{\mathbb{R}^n} \Big(\vec{x}^t (A^tA) \vec{x} -2\vec{x}^t A^t \vec{b} + \vec{b}^2\Big)u_{_N}(\vec{x},\alpha_n) d\vec{x}\leq ||\vec{r_{n-1}}||^2$

$\int_{\mathbb{R}^n} \Big(\vec{x}^t (A^tA) \vec{x} -2\vec{x}^t A^t \vec{b} + \vec{b}^2\Big)u_{_N}(\vec{x},\alpha_n) d\vec{x}\leq || \vec{r_{n-1}}||^2$

$\int_{\mathbb{R}^n} \vec{x}^t (A^tA) \vec{x} \;u_{_N}(\vec{x},\alpha_n) d\vec{x}-2\int_{\mathbb{R}^n} \vec{x}^t A^t \vec{b} u_{_N}(\vec{x},\alpha_n)\; d\vec{x}+\int_{\mathbb{R}^n} \vec{b}^2\;u_{_N}(\vec{x},\alpha_n) d\vec{x}\leq ||\vec{r_{n-1}}||^2\;\;\;\; (D3)$

Equation $(D3)$ tell us when is convenient to compute $\vec{x_n}$ according to our educated guess.

Further Discussion Regarding The Distribution

Here we wont to extend the explanation to why the distribution $u^{_N}_n$ was chosen together with some general remarks regarding it. We will start from a two dimensional view to ease our minds, or at least mine. We can image the CG method as moving from $\vec{x_1}\in \mathbb{R}^2$ to $\vec{x_2}\in \mathbb{R}^2$ along the direction $\vec{p_1}$ with length $\alpha_2$ . So we can start assuming that $\vec{\xi_2}$ is distributed around $\vec{\xi_1}$ with radius $\alpha_2$ . But clearly we wont the density function of $\vec{\xi_2}$ to be null near $\vec{x_1}$ since we know we are moving away from it. To achieve this result we began from a Gaussian centered in $\vec{x_1}$ :

$f_2(\vec{x})=\frac{1}{4\pi^2\det(\Sigma)^{\frac{1}{2}}}\exp\Big(-\frac{1}{2}(\vec{x}-\vec{x_1})^t \Sigma^{-1}(\vec{x}-\vec{x_1})\Big)$

then we propagate this distribution using the wave equation. In particular we are interested in finding the solution to the problem:

$\begin{cases} \partial_t^2 u_2^{_2} = \Delta u_2^{_2}\ u_2^{_2}(\vec{x},0) = f_2(\vec{x})\\ \partial_t u_2^{_2}(\vec{x},0) \equiv 0 \end{cases}$

**Figure 1**, In figure is possible to observe how the wave equation propagates a Gaussian pulse.

We can see from Figure 1 that this procedure creates a hole in our normal distribution, and this hole expands with time. Furthermore if we consider the Green function associated with this problem:

$G(\vec{x},t) = \frac{1}{2\pi} \frac{1}{\sqrt{t^2-||x||}}\Theta(t-||\vec{x}||)$

where $\Theta$ is the Heaviside function. We can easily see that the radius of the ”hole“ produced propagating $f_2$ using the wave equation is $t$ since:

$\Theta = \begin{cases} 1 \quad if \quad x>0 \\ 0.5 \quad if \quad x = 0 \\ 0 \quad if x < 0 \end{cases}$

We can as well compute explicitly the solution to the wave equation:

$H(t,\vec{y},\vec{x})=\bigg( t^2 - |\vec{y} - \vec{x} |^2 \bigg)^{-\frac{1}{2}}$

$u_2^{_2} = \frac{1}{2\pi t} \frac{1}{m\Big(B(\vec{x},t)\Big)}\int_{B(\vec{x},t)} \bigg( t u_2^{2} (\vec{y},0) + t^2 \partial_t u_2^{_2}(\vec{y},0) + t \nabla u_2^{_2} \cdot (\vec{y}-\vec{x}) \bigg) H(t,\vec{y},\vec{x})d\vec{y}$

$u_2^{_2} = \frac{1}{2\pi t} \frac{1}{m\Big(B(\vec{x},t)\Big)}\int_{B(\vec{x},t)} \frac{1}{4\pi \det(\Sigma)^\frac{1}{2}}\bigg(1-\Sigma^{-1}(\vec{x}-\vec{x_1})\cdot(\vec{y}-\vec{x})\bigg)\exp\Big(-\frac{1}{2} (\vec{x}-\vec{x_1})^t\Sigma^{-1}(\vec{x}-\vec{x_1})\Big)H(t,\vec{y},\vec{x})d\vec{y}$

we consider only the absolute value part of this function to obtain a function $u_2^{2}(\vec{x},\alpha_2)$ that is Lebesgue integrable and non-negative, ie it is a probability density function if we normalize. In particular we will call $u_2^{_2}$ the probability density function obtained this way. We can do the same procedure for a Gaussian of $\mathbb{R}^N$ , and we can compute analiticaly the solution to the wave propagation, as in Wave Equation in Higher Dimensions, Lecture Notes Maths 220A Stanford, to see that we have a Lebsgue integrable. We will call $u_n^{_N}$ the pdf obtained by the normalization of the absolute part of the previously mentioned solution with $t=\alpha_n$ , starting from a Gaussian centered in $\vec{x_{n-1}}$ . Furthermore even in higher dimension we know that the wave front propagates at a distance $t$ from the mean of the Gaussian, and so we will have an hole of radius $t$ , produced in the center of the Gaussian.
Last but not least since we know the direction of search we can choose $\Sigma$ such that it has eigenvector $p_{n-1}$ and that this eigenvector is associated with the greatest eigenvalue of the matrix. This produced the Gaussian shown in Figure 2.

**Figure 2**, In figure is possible to observe a Gaussian that propagates in the direction of the CG search direction.

The last notation we will adopt is $u_{n,\vec{\mu}}^{_N}$ to indicate the probability density function obtained using the direction of search $\vec{p_{n-1}}$ as above, of radius $\alpha_n$ but starting from a Gaussian centered in $\vec{\mu}$ , propagated until $t=\alpha_n$ .

Optimal Stopping Approach

We will here address the problem of finding the optimal stopping time for the CG method within a finite horizon $m\leq N$ .
Let’s consider again the Markov chain ${\xi_n}_{n\geq 0}$ , we can suppose as we have done in our preliminary example that that $\xi_n\sim u_n^{_N}(\vec{x},\alpha_n)$ . Here we have to deal with a time in-homogeneous therefore to apply the result developed in the book Optimal Stopping and Free-Boundary Problems, Peskir, Goran, Shiryaev, Albert N, we need to introduce a time homogeneous Markov chain: $Z_n = (n,\xi_n)$ . Now to find the optimal stopping time for the CG method we will introduce a week version of the result presented in Optimal Stopping and Free-Boundary Problems, Peskir, Goran, Shiryaev, Albert N.

Theorem 3

Let’s consider the optimal stopping time problem:

$V^{m}(n,\vec{x}) = \underset{0\leq \tau \leq m-n}{\sup} \mathbb{E} \Big[ G(n+\tau,\xi{n+\tau}) \Big| \xi_{n} = \vec{x} \Big]$

Assuming that $\mathbb{E}\Big[ \underset{0\leq \tau \leq m-n}{\sup} |G(n+\tau,\xi_{n+\tau})| \Big] < \infty$ , where $G:{0,\dots,n}\times \mathbb{R}^N\to \mathbb{R}$ is our gain function, then the following statements holds:

The Wald-Bellaman equation can be used to compute: $V^{m}(n,\vec{x}) = \max\bigg\{ G(n,\vec{x}), T\Big[ V^{m}(n,\vec{x}) \Big]\bigg\}$ for $n=1,\dots,m$ .
The optimal stopping $\tau_D$ can be computed as: $\tau_D = \inf \{0 \leq \tau \leq m \;such\;that\; (n+\tau,\xi_{n+\tau}) \in D \}$ , where $D=\Big\{ (n,\vec{x})\in {0,\dots,n}\times\mathbb{R}^N \; such \; that\; V^N(n,x)=G(n,x)\Big\}$ .

Then we can define the transition operator as follow:

$T\Big[ g(n,\vec{x}) \Big] = \mathbb{E} \Big[ g(n+1,\xi_{n+1}) \,\Big|\, \xi_n=\vec{x} \Big]$

In this case our gain function will be defined as follow:

$G:{1,\dots,n}\times \mathbb{R}^N \to \mathbb{R}$

$(n,\vec{x}) \mapsto \frac{1}{||A\vec{x}-\vec{b}||}$

which respect the hypothesis of the previous theorem since the greatest the gain function could get is $\frac{1}{\epsilon_m}<\infty$ . Our transition operator becomes:

$T\Big[ g(n,\vec{x}) \Big] = \int_{\mathbb{R}^N} g(n+1,\vec{y})u_{n,\vec{x}}^{N}(\vec{y},\alpha_n) d\vec{y}$

In particular we can use Wald-Bellman equation to compute with a technique known as backward propagation $V^{m}(\cdot,\vec{x})$ (this is a common practice in dynamical programming):

$V^m(m-1,\vec{x}) = \max \bigg\{ \frac{1}{||A\vec{x}-\vec{b}||},\int_{\mathbb{R}^N} \frac{1}{||A\vec{x}-\vec{b}||} u_{m,\vec{x}}^{_N}(\vec{y},\alpha_{m}) d\vec{y}\bigg\}\;\;\;\; (OP1)$

$\vdots$

$V^m(m-2,\vec{x}) = \max \bigg\{ \frac{1}{||A\vec{x}-\vec{b}||}, \int_{\mathbb{R}^N} V^m(m-1,\vec{x}) u_{m-1,\vec{x}}^{_N}(\vec{y},\alpha_{m-1}) d\vec{y}\bigg\}\;\;\;\; (OP2)$

$\vdots$

and this allows to build the set $D=\Big\{ (n,\vec{x})\in \{1,\dots,n\}\times\mathbb{R}^N \; such \; that\; V^N(n,x)=G(n,x)\Big\}$ , which inf $\tau_D$ is the optimal stopping time we were searching fore.

Conclusion

We have here shown two ideas, the first one is presented in section 2 and explain how to decide at step $n-1$ if is convenient to compute $n$ , the second one provide a technique to compute the optimal stopping time for the CG method. Those idea are just an interesting exercise on the optimal stopping time and a probabilistic approach to the arrest criteria for numerical linear algebra. This because to compute the density $u_{n}^{N}$ and $u_{n,\vec{\mu}}^{_N}$ we need to have $\vec{p_n}$ and $\alpha_n$ that are the most complex operation in the CG method, $\mathcal{O}(N^2)$ .
An interesting approach to make the second idea computationally worth while would be the one of using a low-rank approximation of order $M<<N$ . In this way the complexity to compute the approximation of $\vec{p_n}$ and $\alpha_n$ is only $\mathcal{O}(MN)$ .
Last but not least is important to mention that if we would like to implement the afore mentioned ideas we could use a Monte Carlo integration technique to evaluate the integrals in equation $(D3)$ and $(OP 1,2)$ and a multi dimensional root-finding algorithm to find compute $D$ .
Interesting aspect that are worth investing are how to build the best possible low-rank approximation and if theory such as the one of generalized Toepliz sequences can give us useful information to build our density function in a more efficient way.