Suppose that we have a symmetric matrix and our aim is to solve the linear problem to find the exact solution . One of the algorithm widely implemented to perform such task is the conjugated gradient (CG) method. The standard formulation of the CG method is as follow:

- We define the initial step: .
- We compute the length of our next step: .
- We compute the approximated solution: .
- We compute the residual:
- Last but not least we update the search direction:

the idea behind this method is to find the best the approximation in the Krylov subspace of order , , that minimize the quantity: . A couple of interesting result for the CG method are the following:

**Theorem** **1**

The following statements holds when the CG method is applied to a symmetric matrix:

- .
- The residual are orthogonal: .
- The search directions are conjugated:

**Theorem 2**

If we apply the CG method to a **positive defined** symmetric matrix the element that minimize is unique and the converge with respect to:

is monotonic.

More detail on the CG method and the theorems before mentioned can be found in *Numerical Linear Algebra, Lloyd N. Trefethen*. An interesting aspect is that even if Theorem 2 states that the error of the CG in the norm induced by is monotonically decreasing, we can’t state the same for the residual . In the next sections we will try to use techniques developed in the field of optimal stopping time to address the question of when is convenient to arrest our method in order to minimize the residual.

Let’s begin by addressing a very simple problem, suppose that we can compute steps of the CG and we have reached the step . How can we decide if would be convenient in terms of residual to compute the -th step ? Well a straight forward answer would be that we should compute if:

where is the random variable whose realization is and we can use any norm of our choosing, with out altering much the meaning of the equation. Considering the fact that each step of the CG method is computed only considering thee previous we can assume that is a Markov process. In light of the fact that is Markov chain we can replace as decision criteria with:

The idea of introducing is justified by the fact that the value that of is unknown to the user at time even if it can be exactly computed at the -th step. The idea is for the user to make and “educated guess” of the out come for the CG method at step , and to decide thanks to this educated guess if proceed to compute . The “education” of our guess consist in the distribution that we assume has. We know that the CG method moves along a direction of length to minimize and we need to find a distribution that correctly represent this behavior. To do so we start from a multivariate Gaussian in centered in :

Then we puncture the Gaussian propagating it using the wave equation:

the solution of the above problem that we will denote as still is a probability density function under certain hypothesis that we will investigate later. Furthermore it has the shape of a punctured Gaussian. Such a shape tell us that we have greater probability of finding around the -dimensional hole produced around . In particular since we known that the wave front propagate as sphere of radios in the particular case of Gaussian pulses, the density seems a fear distribution for given that we have . We should remark latter on how spectral properties of together with the choice of can improve the way we build our distribution, we will remark in the same section that we can provide a general formula to compute . Now if we decide to use the euclidean norm to evaluate our residual we have that is equivalent to:

Equation tell us when is convenient to compute according to our `educated`

guess.

Here we wont to extend the explanation to why the distribution was chosen together with some general remarks regarding it. We will start from a two dimensional view to ease our minds, or at least mine. We can image the CG method as moving from to along the direction with length . So we can start assuming that is distributed around with radius . But clearly we wont the density function of to be null near since we know we are moving away from it. To achieve this result we began from a Gaussian centered in :

then we propagate this distribution using the wave equation. In particular we are interested in finding the solution to the problem:

We can see from Figure 1 that this procedure creates a hole in our normal distribution, and this hole expands with time. Furthermore if we consider the Green function associated with this problem:

where is the Heaviside function. We can easily see that the radius of the ”hole“ produced propagating using the wave equation is since:

We can as well compute explicitly the solution to the wave equation:

we consider only the absolute value part of this function to obtain a function that is Lebesgue integrable and non-negative, ie it is a probability density function if we normalize. In particular we will call the probability density function obtained this way. We can do the same procedure for a Gaussian of , and we can compute analiticaly the solution to the wave propagation, as in *Wave Equation in Higher Dimensions, Lecture Notes Maths 220A Stanford*, to see that we have a Lebsgue integrable. We will call the pdf obtained by the normalization of the absolute part of the previously mentioned solution with , starting from a Gaussian centered in . Furthermore even in higher dimension we know that the wave front propagates at a distance from the mean of the Gaussian, and so we will have an hole of radius , produced in the center of the Gaussian.

Last but not least since we know the direction of search we can choose such that it has eigenvector and that this eigenvector is associated with the greatest eigenvalue of the matrix. This produced the Gaussian shown in Figure 2.

**Figure 2**, In figure is possible to observe a Gaussian that propagates in the direction of the CG search direction.**Figure 2**, In figure is possible to observe a Gaussian that propagates in the direction of the CG search direction.**Figure 2**, In figure is possible to observe a Gaussian that propagates in the direction of the CG search direction.

The last notation we will adopt is to indicate the probability density function obtained using the direction of search as above, of radius but starting from a Gaussian centered in , propagated until .

We will here address the problem of finding the optimal stopping time for the CG method within a finite horizon .

Let’s consider again the Markov chain , we can suppose as we have done in our preliminary example that that . Here we have to deal with a time in-homogeneous therefore to apply the result developed in the book *Optimal Stopping and Free-Boundary Problems, Peskir, Goran, Shiryaev, Albert N*, we need to introduce a time homogeneous Markov chain: . Now to find the optimal stopping time for the CG method we will introduce a week version of the result presented in *Optimal Stopping and Free-Boundary Problems, Peskir, Goran, Shiryaev, Albert N*.

**Theorem 3**

Let’s consider the optimal stopping time problem:

Assuming that , where is our gain function, then the following statements holds:

- The Wald-Bellaman equation can be used to compute: for .
- The optimal stopping can be computed as: , where .

Then we can define the transition operator as follow:

In this case our gain function will be defined as follow:

which respect the hypothesis of the previous theorem since the greatest the gain function could get is . Our transition operator becomes:

In particular we can use Wald-Bellman equation to compute with a technique known as backward propagation (this is a common practice in dynamical programming):

and this allows to build the set , which inf is the optimal stopping time we were searching fore.

We have here shown two ideas, the first one is presented in section 2 and explain how to decide at step if is convenient to compute , the second one provide a technique to compute the optimal stopping time for the CG method. Those idea are just an interesting exercise on the optimal stopping time and a probabilistic approach to the arrest criteria for numerical linear algebra. This because to compute the density and we need to have and that are the most complex operation in the CG method, .

An interesting approach to make the second idea computationally worth while would be the one of using a low-rank approximation of order . In this way the complexity to compute the approximation of and is only .

Last but not least is important to mention that if we would like to implement the afore mentioned ideas we could use a Monte Carlo integration technique to evaluate the integrals in equation and and a multi dimensional root-finding algorithm to find compute .

Interesting aspect that are worth investing are how to build the best possible low-rank approximation and if theory such as the one of generalized Toepliz sequences can give us useful information to build our density function in a more efficient way.

During my first year of undergraduate Maths degree at University of Pavia I was seriously evaluating the possibility to change degree course to something more related to Software Engineering. Thankfully I came across the course of Numerical Linear Algebra, I remember that during that course ideas that I hadn’t fully understood until that moment become clearer and I was fascinated in particularly by the combination of scientific computing and pure mathematical analysis of the methods that were treated in the course. My attention was captured by a very interesting algorithm known with the name of Arnoldi iterations.

In this review we will focus our attention to the Arnoldi iterations that is a numerical linear algebra method first introduce in 50s by W.E Arnoldi \cite{arnoldi1951}. To understand what this method does and where it could be used we will take a brief detour in some general method of numerical linear algebra. The key definition needed to understand Arnoldi iterations is the following:**Definition – Hessemberg**

A matrix is said to be in the **Hessenberg** form if is has such a shape that: , where we called the entry of in the -th row and -th column. Therefore has to be shaped as:

We will now try to show why this particular matrix form has a great importance in numerical linear algebra, considering the eigenvalue problem.

Let’s consider a square matrix and let’s search such that exist that verify the equation: . We know from linear algebra that this problem is equivalent to computing the zeros of the characteristic polynomial of :

the zeros of are the eigenvalue of and we call spectrum of the set:

A numerical approach to solve this problem could be the **power iteration method**, a general overview of this method could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997. Briefly this method consist in studying the sequences:

Under the assumption that is hermitian, that the spectrum of is verify the inequality chain and that the randomly chosen vector has no zero component in the direction of the eigenvector associated with , is possible to prove that the sequence converges to , the actual proof could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997.

The main cost for the power iteration method is computing , in particular the vector matrix product , to be precise to compute we need flops. We could easily understand that for each zero entry of the matrix the computational cost is reduce therefore we aim to represent in the form with the greatest number of zero entries. If is hermitian spectral theorem tells us that could be diagonalized, therefore such that , with being a diagonal matrix. Clearly the diagonal form would be the best in terms of zero entries, unfortunately knowing the decomposition is equivalent to know the eigenvalues of , and that is the problem we are addressing. The next best matrix forms in terms of zero entries are the tridiagonal, the triangular and the Hessenberg form. If we try to express in the triangular form by orthogonal base changes we would need to perform a Schur decomposition. The problem with this approach is the cost of Schur decomposition which is computed by QR algorithm that has a base cost of , detail about the Schur decomposition could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997 and Quarteroni, Alfio, Riccardo Sacco, and Fausto Saleri. *Numerical mathematics*. Vol. 37. Springer Science & Business Media, 2010. We will come back to the QR decomposition cost later on because reducing a matrix in Hessenberg form will be a great advantage when computing QR decomposition, more information regarding the QR algorithm could be found Watkins, David S. “Understanding the QR algorithm.” *SIAM review* 24.4 (1982): 427-440.

If is an Hermitian matrix the Lanczos theorem tell us that we could reduce in the tridiagonal form by orthogonal transformations.

**Theorem – Laczos**

Let be an hermitian matrix, it exist an unitary matrix such that is verified the following equation:

with being a tridiagonal matrix.

**Proof**

We will prove this by explicitly building the matrix and , in this way we the reader could get an idea of how the Lanczos algorithm works. Lets choose a random vector of unitary norm, and we define the following quantities: and . Then for all we compute the sequences:

*in case is the null vector we take as a random vector of unitary norm. Now we define the following matrix: *

Carrying out matrix multiplication thanks to how we have defined and we obtain equation .

Further detail on the Lanczos method could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997, Quarteroni, Alfio, Riccardo Sacco, and Fausto Saleri. *Numerical mathematics*. Vol. 37. Springer Science & Business Media, 2010 and Golub, Gene H., and Dianne P. Oâ€™Leary. “Some history of the conjugate gradient and Lanczos algorithms: 1948â€“1976.” *SIAM review* 31.1 (1989): 50-102.

Unfortunately for non hermitian matrix nothing tell us that we could represent the matrix in tridiagonal form, in fact the best we could hop to achieve is representing the matrix in the Hessenberg form. From what we have said until know we know that if is hermitian to apply the power iteration we wont to represent it in tridiagonal form, but what happens if is not hermitian or even normal ?

Well if isn’t hermitian we could not use the power iteration method, instead we use the algorithm. Briefly the algorithm consist in studying the sequence:

with being the factorization of . Is possible to prove that converges to an upper triangular matrix, therefore we have performed a Schur decomposition and we could find the eigenvalue of along the diagonal entries of . Further detail on the alghorithm could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997, Quarteroni, Alfio, Riccardo Sacco, and Fausto Saleri. *Numerical mathematics*. Vol. 37. Springer Science & Business Media, 2010 and Watkins, David S. “Understanding the QR algorithm.” *SIAM review* 24.4 (1982): 427-440. In reality what matters is that if is expressed in Hessenberg form the cost of algorithm goes from to .

So far we have understood that that given a matrix if we wont to find its greatest eigenvalue we wont to be transformed by orthogonal transformation into a tridiagonal matrix if is hermitian or to be represented in Hessenberg form else way. To be precise we just wont a method that represent by orthogonal transformation in Hessenberg form because orthogonal transformation preserve the symmetry and an Hessenberg matrix that is also hermitian can only be tridiagonal. Such a method is exactly Arnoldi iterations.

**Theorem – Arnoldi**

Let , there is such that is verified the following equation:

with being a matrix expressed in Hessenberg form.

**Proof**

We will prove as well this result by direct construction of the decomposition shown in the previous equation, even if in very peculiar cases this construction might fail, as we could see in Embree, Mark. “The Arnoldi eigenvalue iteration with exact shifts can fail.” *SIAM Journal on Matrix Analysis and Applications* 31.1 (2009): 1-10. Such failures are due to computation problem and not to the linear algebra argument standing behind the algorithm. Lets start from a random vector with unitary norm. We will then build the sequence of vector for every such that the following equation is verified:

We will as well define the entries of a matrix as:

Now by matrix multiplication we could show that the matrix and verify , with being the matrix defined as:

by breaking this down equation we could easily obtain a method to perform Arnoldi iteration.

**Algorithm – Arnoldi**

We could write such a method in pseudo code as follow:

q1=ones(n,1); q1 = q1/norm(q1); Q = zeros(n,m); Q(:,1) = q1; H = zeros(min(m+1,m),n); for m=1:n z = AQ(:,k); for i=1:m H(i,k) = Q(:,i)'z; z = z - H(i,k)Q(:,i); end if k < n H(k+1,k) = norm(z); if H(k+1,k) == 0, return, end Q(:,k+1) = z/H(k+1,k); end end

If we compare the Arnoldi algorithm with the Gram-Schmidt algorithm for factorization, the code could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997 and Quarteroni, Alfio, Riccardo Sacco, and Fausto Saleri. *Numerical mathematics*. Vol. 37. Springer Science & Business Media, 2010 we immediately understand that the Arnoldi iterations is nothing more then the Gram-Schmidt method applied on the matrix:

this matrix is well known in numerical linear algebra and it goes by the name of Krylov matrix of order associated with and , we will often refer to such matrix simply by writing . More information regarding Krylov matrix could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997, for now we are only concerned with the fact that is the projection of matrix in the Krylov subspace defined as the subspace spanned by the vectors: . What we have just said will lead to the following result:

**Proposition**

Let , if is the Krylov matrix of order associated with and then the following equation is verified:

with be the Hessenberg representation produced by Arnoldi iterations, with , and with being the decomposition of .

Now a very legitimate question arises, why the Schur decomposition was discarded as a method to produce a zeros filled representation of being too cost full in favor of the Arnoldi iterations that in the form presented until now still have cost . This question shine a light over the true beauty of Arnoldi iterations, its iterative nature. What we mean when we speak about iterative nature is the fact that if we stop the Arnoldi method before computing the vector we have built the Hessenberg matrix and such that the following equation is verified:

In the next section we will discuss the reason why to solve the problem presented in the Introduction, ie computing , could be enough to study the matrix with . Clearly studying the matrix , with , imply that working with an approximation of such that the cost need to represent this approximation in a zeros filled form is , and . To simplify the discussion we will refer as Arnoldi decomposition of order to:

where and are the same as in equation previous equation. Further information regarding Arnoldi iterations could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997, Watkins, David S. “Some perspectives on the eigenvalue problem.” *SIAM review* 35.3 (1993): 430-471 and Saad, Yousef. *Numerical methods for large eigenvalue problems: revised edition*. Vol. 66. Siam, 2011.

Lets consider the matrix with spectrum that verify the following chain of inequalities: , exactly the same scenario of the previous discussion. To compute the eigenvalue of we could perform an Arnoldi decomposition of order and the apply the algorithm, the question we are going to ask ourself is what happens if instead of performing an Arnoldi decomposition of order we perform an Arnoldi decomposition of order and the apply the algorithm.

The basic idea that permeate this section is the following: if we consider the Arnoldi decomposition of order , shown before, we know that it is the projection of in the Krylov subspace of order associated with and , , therefore we expect to preserve some information regarding the spectral properties of . Further more considering the structure of and the fact that when multiplying by all the components in the directions of get amplified more then others, since is the greatest eigenvalue, we expect in particular spectral information regarding to resinate in more then others. We could verify numerically that such a reasoning is working by computing the eigenvalues of and and observing that the eigenvalues approximations are not to far off and further more the approximation get closer and closer as approaches .

As explained before the spectral information resinate particularly for greater eigenvalues, therefore we would expect for that the matrix could still provide useful information regarding , as is clearly visible in the Figure before where for the greatest eigenvalue of is very close to . In this section we will present the standard explanation to why such a phenomenon occurs, in particular we present the explanation proposed in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997, Watkins, David S. “Some perspectives on the eigenvalue problem.” *SIAM review* 35.3 (1993): 430-471 and Greenbaum, Anne, and Lloyd N. Trefethen. “GMRES/CR and Arnoldi/Lanczos as matrix approximation problems.” *SIAM Journal on Scientific Computing* 15.2 (1994): 359-368. But before presenting the idea mentioned before we need to characterize further the Arnoldi iterations and we would do so thanks to the next result:

**Proposition**

Let and , if has full rank then there is one monic polynomial of degree that minimize , with being a generic monic polynomial of degree . Further more is unique and is give by:

**Proof**

Since is monic the following equation holds for some :

where is the factorization of .

Readers familiar with the least square problem (detail could be found in Trefethen, Lloyd N., and David Bau III. *Numerical linear algebra*. Vol. 50. Siam, 1997, Watkins, David S. “Some perspectives on the eigenvalue problem.” *SIAM review* 35.3 (1993): 430-471 and Quarteroni, Alfio, Riccardo Sacco, and Fausto Saleri. *Numerical mathematics*. Vol. 37. Springer Science & Business Media, 2010) easily understand that minimizing the norm of previous equation is a least square problem, therefore we know that is orthogonal to . This means and that this imply and so by Caylay-Hamilton theorem we know that is the characteristic polynomial of .

Now lets start by tackling the first bit of the problem proposed before the last proposition, ie why the eigenvalue of get closer and closer to the eigenvalue of as approaches ?

We know that must be as small as possible, in particular if is the characteristic polynomial of we have and that is as small as it goes. Therefore get closer and closer to , so as well become closer and closer to . Later on we will try to take a different approach to formally prove that converges to . The second bit of the problem presented above is the convergence speed, in fact even if we gave a reasonable idea to why converge to , we haven’t yet proved why even for a small then is a good approximation, in terms of roots, of .

In particular we observe numerically that if we call the greatest eigenvalue of , we have that converges geometrically to , such a phenomenon can be observed in Figure above. To have an idea regarding why this phenomenon occur we shell take in consideration the matrix:

The characteristic polynomial of is . We now consider , with close to , as candidate for the characteristic polynomial of . If we take being close to we have that for the following request holds:

therefore , this means that the convergence is geometric. Clearly the idea presented in this section don’t allow to fully understand how the Arnoldi locates the eigenvalue. Furthermore the way the Arnoldi iterations locates the eigenvalues is still an open question.

**Theorem (Woodbury inequality)**

Let’s consider , and then we could write:

**Proof**

We could easily prove this result just showing that the direct product of yield the identity matrix.

I’ll now introduce what is called the Neumann series expansion, with this term we identify the following result: . Is possible to prove that the series mentioned above converges, and therefore the above identity make sense, if operator norm of is less or equal to . To prove what we mentioned above we usually use the following argument:

We first define and then we could easily prove that:

**Cutey Proof For Neumann Series Expansion**

Let’s decompose using the Arnoldi iterations , once we write , then we could easily write:

We then expand using the Woodbury identity again:

substituting this in the above equation we obtain:

therefore we have, , once we apply recursively the Woodbury identity to expand we obtain: .

This seemed a nice thing for my first post, therefore this is all from me today !

]]>