*May, 2022 - François HU*

*Master of Science - EPITA*

*This lecture is available here: https://curiousml.github.io/*

**Generalities on optimization problems**- Notion of critical point
- Necessary and sufficient condition of optimality

**Unconstrained optimization in dimension $n=1$**- Golden section search
- Newton's method

**Unconstrained optimization in dimension $n\geq 2$**- Newton's method
- Gradient descent method
- Finite-difference method
- Cross-Entropy method

**Constrained optimization**

- Equality constraints and Lagrange
- Lagrange
- sequential quadratic programming

- Inequality constraints and Lagrange duality
- Lagrange duality
- KKT conditions

- (optional) Application: SVM
- Linear classification

A

**constrained optimization**problem is written as: $$ \min\limits_x f(x) \quad \text{subject to} \quad g(x) = 0 \text{ and } h(x) \leq 0 $$ with:- $f: \mathbb{R}^n \to \mathbb{R}$ the function to minimize;
- $g: \mathbb{R}^n \to \mathbb{R}^m$ the
**equality constraint**; - and $h: \mathbb{R}^n \to \mathbb{R}^p$ the
**inequality constraint**.

Later on, we will call this type of optimization problem: the

**primal problem**.

An

**optimization**problem with equality constraint is written as: \begin{align*} \min \quad & f(x) \\ \text{subject to} \quad & g(x) = 0 \end{align*} with $f: \mathbb{R}^n \to \mathbb{R}$ and $g: \mathbb{R}^n \to \mathbb{R}^m$A

**necessary condition**for a feasible point $x^*$ to be a solution is that $$ \nabla f(x^*) = - J_g(x^*)^T\lambda $$ with $J_g$ the**Jacobian matrix**($\neq$ the gradient !) of $g$ $$ J_g(x) = \begin{bmatrix} \dfrac{\partial g}{\partial x_1} & \dots & \dfrac{\partial g}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \nabla^T g_1\\ \vdots \\ \nabla^T g_m\\ \end{bmatrix} = \begin{bmatrix} \dfrac{\partial g_1}{\partial x_1} & \dots & \dfrac{\partial g_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \dfrac{\partial g_m}{\partial x_1} & \dots & \dfrac{\partial g_m}{\partial x_n}\\ \end{bmatrix} $$ and $\lambda\in\mathbb{R}^m$ is called the vector of**Lagrange multipliers**(named after the mathematician Joseph-Louis Lagrange in 1788).

- The
**Lagrangian**$\mathcal{L}:\mathbb{R}^{n+m}\to \mathbb{R}$ is defined by $$ \mathcal{L}(x, \lambda) = f(x) + \lambda^T g(x) = f(x) + \sum\limits_{i=1}^{m}\lambda_i g_i(x) $$ Its

**gradient**is given by $$ \nabla \mathcal{L}(x, \lambda) = \begin{bmatrix} \nabla f(x) + J_g(x)^T\lambda \\ g(x) \end{bmatrix} \implies \text{a necessary condition: a critical point of the Lagrangian } \nabla \mathcal{L}(x, \lambda) = 0 $$Its

**Hessian**is given by $$ H_\mathcal{L}(x, \lambda) = \begin{bmatrix} H_f(x) + \sum_{i=1}^{m}\lambda_i H_{g_i}(x) & J_g(x)^T\\ J_g(x) & 0 \end{bmatrix} = \begin{bmatrix} B(x, \lambda) & J_g(x)^T\\ J_g(x) & 0 \end{bmatrix} $$

- by applying Newton's method to the non-linear system $$\nabla \mathcal{L}(x, \lambda) = \begin{bmatrix} \nabla f(x) + J_g(x)^T\lambda \\ g(x) \end{bmatrix} = 0 $$ we obtain the linear system $$ \begin{bmatrix} B(x, \lambda) & J_g(x)^T\\ J_g(x) & 0 \end{bmatrix} \begin{bmatrix} s\\ \lambda \end{bmatrix} = - \begin{bmatrix} \nabla f(x) + J_g(x)^T\lambda \\ g(x) \end{bmatrix} $$

- (
*Contraint optimization*) This approach is called**sequential quadratic programming**

- (
*Uncontraint optimization*) If the problem is unconstrained, then the method reduces to**Newton's method**

An

**optimization**problem with*equality constraint*can easily be written as an*inequality constraint*: $$ g(x) = 0 \iff (g(x) \geq 0 \quad\&\quad g(x) \leq 0) $$Consider an equality constrained problem: \begin{align*} \min \quad & f(x) \\ \text{subject to} \quad & g(x) = 0 \end{align*} It can be written as \begin{align*} \min \quad & f(x) \\ \text{subject to} \quad & g(x) \leq 0\\ \quad & -g(x) \leq 0\\ \end{align*}

For simplicity, we only

**consider the inequality constraints**.

- An
**optimization**problem with (inequality) constraint is written as: \begin{align*} \min \quad & f(x) \\ \text{subject to} \quad & g(x) \leq 0 \end{align*} with $f: \mathbb{R}^n \to \mathbb{R}$ and $g: \mathbb{R}^n \to \mathbb{R}^m$

- The
**Langrangian**for this optimization problem is $$ \mathcal{L}(x, \lambda) = f(x) + \lambda^T g(x) = f(x) + \sum\limits_{i=1}^{m}\lambda_i g_i(x) $$ with $\lambda_i \geq 0$ the**Lagrange multipliers**.

The (Lagrange) **dual function** associated to the constrained optimization problem is defined by

with $\lambda_i \geq 0$

- We call the constrained optimization problem the
**primal problem**:

- and we call the following optimization problem the associated
**dual problem**:

- Note that this dual problem is
**always convexe**since the Lagrangian is concave (Lagrangian linear w.r.t. $\lambda$).

If we denote $p^*$ the solution of the primal problem (a.k.a **primal optimal**)

and $d^*$ the solution of the dual problem (a.k.a **dual optimal**)

then

(

*weak duality*) this inequality always holds: $d^* \leq p^*$(

*strong duality*) often this equality does not hold in general: $d^* = p^*$The strong duality does hold when convex problem satisfy some

**constraint qualifications**(to be defined later)

**Remark:** Lagrange dual problem often **easier to solve** (simpler constraint) !

- We define the problem
\begin{align*}
\min \quad & f(x) \\
\text{subject to} \quad & g(x) \leq 0
\end{align*}
a
**convex optimization problem**if $g_i$ are convex functions.

For a convex optimization problem, we

**usually**have a strong duality, but not always**Slater's condition**(or Slater's constraint qualifications): there exists a $x\in\mathbb{R}^n$ such that $g_i(x) < 0$ for all $i\in \{1, \dots, m\}$ (strict feasibility!)Slater's condition is a

**sufficient condition for strong duality**to hold for a convex optimization problem.

**Theorem (Karush-Kuhn-Tucker (KKT) conditions):** Let us assume that the *primal problem is convex* and that the *slater's constraint qualification holds*. We have **strong duality** if and only if all the following conditions hold:

- (
*primal feasibility*) there exists a primal optimal $x^*$ - (
*dual feasibility*) there exists a dual optimal $\lambda^*$ - (
*complementary slackness*) : $\lambda^{*T}g(x^*) = 0$ or equivalently for all $i\in\{1, \dots, m\}$, $\lambda_i^*g_i(x^*) = 0$ - (
*stationarity*) $\nabla_x\mathcal{L}(x^*, \lambda^*) = 0$

In [ ]:

```
```