Next: A simple trust-region algorithm. Up: An introduction to the Previous: An introduction to the   Contents

Subsections

# Trust Region and Line-search Methods.

In continuous optimization, you have basically the choice between two kind of algorithm:
• Line-search methods
• Trust region methods.
In this section we will motivate the choice of a trust region algorithm for unconstrained optimization. We will see that trust region algorithms are a natural evolution of the Line-search algorithms.

## Conventions

is the objective function. We search for the minimum of it.
 The optimum. We search for it. is the gradient of . is the Hessian matrix of . The Hessian Matrix of F at point The current approximation of the Hessian Matrix of F at point . If not stated explicitly, we will always assume . The Hessian Matrix at the optimum point. is the quadratical approximation of around x.

All vectors are column vectors.
In the following section we will also use the following convention.
 is the iteration index of the algorithm. is the direction of research. Conceptually, it's only a direction not a length. is the step performed at iteration . is the length of the step preformed at iteration k. the distance from the current point to the optimum.

## General principle

The outline of a simple optimization algorithm is:
1. Search for a descent direction around ( is the current position).
2. In the direction , search the length of the step.
3. Make a step in this direction: (with )
4. Increment . Stop if otherwise, go to step 1.
A simple algorithm based on this canvas is the steepest descent algorithm'':

 (2.1)

We choose and therefore:

 (2.2)

This is a very slow algorithm: it converges linearly (see next section about convergence speed).

## Notion of Speed of convergence.

 linear convergence superlinear convergence with quadratic convergence

with
These convergence criterias are sorted in function of their speed, the slowest first.
Reaching quadratic convergence speed is very difficult.
Superlinear convergence can also be written in the following way:

 (2.3)

## A simple line-search method: The Newton's method.

We will use .
We will use the curvature information contained inside the Hessian matrix of to find the descent direction. Let us write the Taylor development limited to the degree 2 of around :

The unconstrained minimum of is:
 0 (2.4)

Equation 2.4 is called the equation of the Newton's step .
So, the Newton's method is:
1. solve (go to the minimum of the current quadratical approximation of ).
2. set
3. Increment . Stop if otherwise, go to step 1.
In more complex line-search methods, we will run a one-dimensional search () in the direction to find a value of which reduces sufficiently'' the objective function. (see Section 13.1.2 for conditions on : the Wolf conditions. A sufficient reduction is obtained if you respect these conditions.)

Near the optimum, we must always take , to allow a full step of one'' to occur: see Chapter 2.1.6 for more information.

Newton's method has quadratic convergence: see Section 13.1.1 for proof.

## must be positive definite for Line-search methods.

In the Line-search methods, we want the search direction to be a descent direction.

 (2.5)

Taking the value of g from 2.4 and putting it in 2.5, we have:
 0 0 (2.6)

The Equation 2.6 says that must always be positive definite. In line-search methods, we must always construct the matrix so that it is positive definite.

One possibility is to take (=identity matrix), which is a very bad approximation of the Hessian but which is always positive definite. We retrieve the steepest descent algorithm''.

Another possibility, if we don't have positive definite, is to use instead with being a very big number, such that is positive definite. Then we solve, as usual, the Newton's step equation (see 2.4) . Choosing a high value for has 2 effects:
1. will become negligible and we will find, as search direction, the steepest descent step''.
2. The step size is reduced.
In fact, only the above second point is important. It can be proved that, if we impose a proper limitation on the step size , we maintain global convergence even if is an indefinite matrix. Trust region algorithms are based on this principle. ( is called the trust region radius).

The old Levenberg-Marquardt algorithm uses a technique which adapts the value of during the optimization. If the iteration was successful, we decrease to exploit more the curvature information contained inside . If the previous iteration was unsuccessful, the quadratic model don't fit properly the real function. We must then only use the basic'' gradient information. We will increase in order to follow closely the gradient (''steepest descent algorithm'').

For intermediate value of , we will thus follow a direction which is a mixture of the steepest descent step'' and the Newton step''. This direction is based on a perturbated Hessian matrix and can sometime be disastrous (There is no geometrical meaning of the perturbation on ).

This old algorithm is the base for the explanation of the update of the trust region radius in Trust Region Algorithms. However, in Trust Region Algorithms, the direction of the next point to evaluate is perfectly controlled.

To summarize:
• Line search methods:
We search for the step with     and we impose
• Trust Region methods:
The step is the solution of the following constrained optimization problem:
 subject to

can be any matrix. We can have .

## Why is Newton's method crucial: Dennis-Moré theorem

The Dennis-Moré theroem is a very important theorem. It says that a non-linear optimization algorithm will converge superlinearly at the condition that, asymptotically, the steps made are equals to the Newton's steps ( ( is not the Hessian matrix of ; We must have: (and )). This is a very general result, applicable also to constrained optimization, non-smooth optimization, ...

definition:
A function is said to be Lipchitz continuous if there exists a constant such that:

 (2.7)

To make the proof, we first see two lemmas.

### Lemma 1.

The well-known Riemann integral is:

The extension to the multivariate case is straight forward:

After a change in the integration variable: , we obtain:

We substract on the left side and on the right side :

 (2.8)

Using the fact that , and the cauchy-swartz inequality we can write:

Using the fact that the Hessian is Lipchitz continuous (see Equation 2.9):
 (2.9)

The lemma 2 can be directly deduced from Equation 2.12.

### Lemma 2.

If we write the triangle inequality with and we obtain:

Using the cauchy-swartz inequality with and , and using Equation 2.10:

Using the hypothesis that :

Thus if the lemma is proven with .

### Proof of Dennis-moré theorem.

We first write the step'' Equation 2.7:
 0 0

Using lemma 2: Equation 2.10, and defining , we obtain:

Using and Equation 2.8:

 (2.10)

Using lemma 3: Equation 2.13, there exists , such that, for all , we have (using ):

 (2.11)

Combing Equation 2.14 and 2.15,
 0 (2.12)

where we have defined . This implies that:

 (2.13)

which completes the proof of superlinear convergence.
Since is Lipschitz continuous, it's easy to show that the Dennis-moré theorem remains true if Equation 2.8 is replaced by:

 (2.14)

This means that, if , then we must have (the length of the steps) to have superlinear convergence. In other words, to have superlinar convergence, the steps'' of a secant method must converge in magnitude and direction to the Newton steps (see Equation 2.4) of the same points. A step with is called a full step of one''. It's necessary to allow a full step of one'' to take place when we are near the optimum to have superlinear convergence. The Wolf conditions (see Equations 13.4 and 13.5 in Section 13.1.2) always allow a full step of one''.

When we will deal with constraints, it's sometime not possible to have a full step of one'' because we bump'' into the frontier of the feasible space. In such cases, algorithms like FSQP will try to bend'' or modify'' slightly the search direction to allow a full step of one'' to occur.

This is also why the trust region radius'' must be large near the optimum to allow a full step of one'' to occur.

Next: A simple trust-region algorithm. Up: An introduction to the Previous: An introduction to the   Contents
Frank Vanden Berghen 2004-04-19