*May, 2022 - François HU*

*Master of Science - EPITA*

*This lecture is available here: https://curiousml.github.io/*

**Generalities on optimization problems**- Notion of critical point
- Necessary and sufficient condition of optimality

**Optimization in dimension 1**- Golden section search
- Newton method

**Unconstrained optimization in dimension $n\geq2$**

Let $f:\mathbb{R}^n \to\mathbb{R}$ be a real valued function of $n$ variables

We know that a minimum $x^*$ verifies $\nabla f(x^*) = 0$. We can therefore try to to solve the equation $\nabla f(x) = 0$ by Newton method.

We can also approximate $f$ by $$ f(x+h) \approx f(x) + \nabla f(x)h + \dfrac{1}{2}h^TH_f(x)h $$ and minimise the quadratic approximation as a function of $h$

In both cases, we obtain the iteration, $$ x_{k+1} = x_k - H_f^{-1}(x_k)\nabla f(x_k) $$

We do not explicitly calculate the inverse of the Hessian. Instead, we solve the linear system $$ x_{k+1} = x_k + s_k $$

The convergence of Newton's method is quadratic provided you start the iteration close enough to the result.

**Rmk: The calculation of the Hessian is expensive!**. see alternative Hessian-free methods Quasi-Newton method and BFGS method.

- We want to minimize the function: $$ f(x)=0.5x_1^2+2.5x_2^2 $$

- Compute the
**gradient**and the**Hessian**

- We start with $x_0=\begin{bmatrix}5\\1\end{bmatrix}$, what is the value of $\nabla f(x_0)$ ?

- The linear system to be solved becomes: $$ H_f(x)s_0 = -\nabla f(x_0) $$ see Algorithm Workshop 4 for solving linear systems

- Compute the value $x_1$.

- At each point x where the gradient is non-zero, $-\nabla f(x)$ corresponds locally to the direction of descent with the greatest slope: $f$ decreases more rapidly in this direction than in another.

**Gradient method:**we start from an initial point $x_0$ and calculate the successive iterates by $$ x_{k+1} = x_k - \alpha_k \nabla f(x_k) $$ where $\alpha_k$ is a parameter that determines the distance to travel in the direction $\nabla f(x_k)$

- this parameter $\alpha_k$ can be calibrated as a solution the minimisation problem: $$ \min\limits_{\alpha_k\geq 0} f(x_k - \alpha_k \nabla f(x_k)) $$ that can be solved with an optimization algorithm in dimension 1 (see previous lesson)

- as long as the gradient is non-zero, we decrease $f(x)$. Convergence is linear.

- We want to minimize the function: $$ f(x)=0.5x_1^2+2.5x_2^2 $$

- The
**gradient**is given by $\nabla f(x)=\begin{bmatrix}x_1\\5x_2\end{bmatrix}$

- We start with $x_0=\begin{bmatrix}5\\1\end{bmatrix}$, we have $\nabla f(x_0) = \begin{bmatrix}5\\5\end{bmatrix}$

- Compute the values $\alpha_0$ and $x_1$.

**Principle**

- We want to approximate the gradient by finite difference: $$ (\nabla f(x))_i \approx \dfrac{f(x+t e_i)-f(x)}{t} $$ with small $t$ and $e_i$ the i-th standard basis vector.

- Little precision on the gradient

- Cross-Entropy method

- The Cross-Entropy method can be used to optimize an objective function $S$ with sampling approach. The idea is to sample randomly (following a
*probability disribution*) in a search space that we hope to reduce iteratively.

- Let us denote $g^* = f(\cdot, \theta^*) \in \{f(\cdot, \theta), \theta\in\Theta\}$ the optimal sample distribution that samples the optimizer.

- Iteratively,
- we sample $Y_1, \dots, Y_n \sim g_{\theta_t} = f(\cdot, \theta)$
- we estimate $\theta_{t+1}$ (e.g. MLE) on the "best" $T\%$ $Y_i$ (usually $10\%$)

- If $\theta_{t+1}$ converges (e.g. $|\theta_{t+1}-\theta_{t}|<\varepsilon$) then we stop the process

**Optimization, find:** $x^*\in\arg\max_{x} S(x)$

for t = 0:

we sample $Y_1, \dots, Y_n \sim \mathcal{N}(\mu_t, \sigma_t^2)$

we choose the best $10\%$ $Y_i$ that maximize $S$

we estimate (by MLE) $\mu_{t+1}$ and $\sigma_{t+1}^2$

- $\mu_{t+1} = $ empirical mean of the best $Y_i$
- $\sigma_{t+1}^2 = $ empirical variance of the best $Y_i$