Convexity and smoothness are two of the most used assumptions in the convergence analysis of optimization algorithm. In this post, I would like to present some useful results on these two important properties.
A function $f: \mathbb{R}^d \rightarrow \mathbb{R} \cup { +\infty }$ is convex if:
\[\forall x, y \in \mathbb{R}^d, \quad \forall t \in [0, 1], \quad f(tx + (1 - t)y) \leq tf(x) + (1 - t)f(y) \tag{1}\]The next two lemmas characterize the convexity of a function with the help of first and second-order derivative.
If $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is convex and differentiable then:
\[\forall x, y \in \mathbb{R}^d, \quad f(x) \geq f(y) + \langle\nabla f(y), x - y\rangle \tag{2}\]We can deduce Lemma-2 from Definition-1 by dividing by $t$ and arranging:
\[\begin{equation} \frac{f(y + t(x - y)) - f(y)}{t} \leq f(x) - f(y) \end{equation}\]Taking the limit when $t \rightarrow 0$ and using the definition of derivative gives:
\[\begin{equation} \langle\nabla f(y), x - y\rangle \leq f(x) - f(y) \end{equation}\]Given $f: \mathbb{R}^d \rightarrow \mathbb{R}$ a differentiable convex function. Then we have:
\[\langle \nabla f(x) - \nabla f(y), x-y \rangle \geq 0 \tag{3}\]The proof begins by using Lemma-2 twice (by permuting the role of $x$ and $y$), then adding the two resulting inequalities give us the lemma.
Let $f: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be convex and twice differentiable. Then, for all $x \in \mathbb{R}^d$, for every eigenvalue $\lambda$ of $\nabla^2f(x)$, we have $\lambda \geq 0$.
Using the definition of derivative, for all $v \in \mathbb{R^d}$, we have:
\[\begin{align*} \langle \nabla^2 f(x)v, v \rangle &= \langle \underset{t \rightarrow 0}{\lim} \frac{\nabla f(x + tv) - \nabla f(x)}{t}, v \rangle \\ &= \underset{t \rightarrow 0}{\lim} \frac{1}{t^2}\langle \nabla f(x + tv) - \nabla f(x), (x + tv) - x \rangle \geq 0 \tag{4} \end{align*}\]The first equality comes from the basic definition of derivative, and the second inequality comes from Lemma-3. Now for an eigenvalue $\lambda$ of $\nabla^2 f(x)$, we can take any non zero eigenvector $v$ and write:
\[\lambda ||v||^2 = \langle \lambda v, v \rangle = \langle \nabla^2 f(x)v, v \rangle \geq 0 \tag{5}\]which concludes the proof.
Let $f: \mathbb{R}^d \rightarrow \mathbb{R} \cup { +\infty }$, and $\mu > 0$. We say that $f$ is $\mu$-strongly convex if, for every $x, y \in \mathbb{R}^d$, and every $t \in [0, 1]$, we have that:
\[\mu \frac{t(1 - t)}{2}|| x - y ||^2 + f(tx + (1 - t)y) \leq tf(x) + (1 - t)f(y) \tag{6}\]$\mu$ is then the strong convexity constant of $f$.
Next we can show that it is easy to build a strongly convex function: just add a term $| . |^2$ to a convex function. This technique is called Tikhonov regularization (ridge regularization), which is used a lot in machine learning or inverse problem.
Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ and $\mu > 0$. Then $f$ is $\mu$-strongly convex if and only if there exists a convex function $g: \mathbb{R}^d \rightarrow \mathbb{R}$ such that $f(x) = g(x) + \frac{\mu}{2}||x||^2$.
Starting from the strongly convex definition of $f$:
\[\begin{align*} & f(tx + (1 - t)y) + \frac{\mu}{2}t(1-t)||x-y||^2 \leq tf(x) + (1-t)f(y) \\ & \Leftrightarrow \frac{\mu}{2}t(1-t)||x-y||^2 + g(tx + (1-t)y) + \frac{\mu}{2}||tx + (1-t)y||^2 \\\\ & \leq tg(x) + \frac{\mu}{2}t\| x \|^2 + (1-t)g(y) + \frac{\mu}{2}(1-t)\| y \|^2 \\ & \Leftrightarrow g(tx + (1-t)y) \leq tg(x) + (1-t)g(y) \\ & \Leftrightarrow \textrm{$g$ is convex} \end{align*}\]The last inequality comes from arranging all the term multiplied by $\mu$.
This regularization technique is often used in machine learning and optimization because strongly convex functions have a nice property: they admit a unique minimizer.
If $f: \mathbb{R}^d \rightarrow \mathbb{R}$ us a continuous strongly convex function, then $f$ admits a unique minimizer.
Suppose that $f$ admits 2 minimizer $x_1 \neq x_2$. Given that $f$ is strongly convex, we have then:
\(\begin{align*} & f(\frac{x_1+x_2}{2}) < \frac{f(x_1) + f(x_2)}{2} \leq f(x_1) \\ & \textrm{($f(x_2) \leq f(x_1)$ given $x_2$ is also a minimizer of $f$)} \end{align*}\) This is absurd since $x_1$ is a minimizer of $f$. Then $f$ admits a unique minimizer.
Now we will see some useful variational inequalities satisfied by strongly convex function.
If $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is differentiable and $\mu$-strongly convex then:
\[\forall x, y \in \mathbb{R}^d, \quad f(y) \geq f(x) + \langle \nabla f(x), y-x \rangle + \frac{\mu}{2}\| x-y \|^2 \tag{7}\]Define $g(x) = f(x) - \frac{\mu}{2}| x |^2$. According to Lemma-5, $g$ is convex. By definition, $g$ is clearly differentiable. Then by using the definition of convexity in Definition-1:
\[\begin{align*} & g(y) \geq g(x) + \langle \nabla g(x), y-x \rangle \\ & \Leftrightarrow f(y) - \frac{\mu}{2}\| y \|^2 \geq f(x) - \frac{\mu}{2}\| x \|^2 + \langle \nabla f(x) - \mu x, y-x \rangle \\ & \Leftrightarrow f(y) \geq f(x) + \langle \nabla f(x), y-x \rangle + \frac{\mu}{2}\| x-y \|^2 \end{align*}\]Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be a twice differentiable $\mu$-strongly convex function. Then, for all $x \in \mathbb{R}^d$, for all eigenvalue $\lambda$ of $\nabla^2 f(x)$, we have $\lambda \geq \mu$.
Define $g(x) := f(x) - \frac{\mu}{2}| x |^2$. Then according to Lemma-5, $g$ is convex. $g$ is also twice differentiable, by definition. Since we have $\nabla^2 f(x) = \nabla^2 g(x) + \mu \textrm{Id}$, the eigenvalues of $\nabla^2 f(x)$ are equal to the eigenvalues of $\nabla^2 g(x)$ plus $\mu$. Lemma-4 gives us the conclusion.
Given $f: \mathbb{R}^d \rightarrow \mathbb{R}$ a differentiable $\mu$-strongly convex function. Then we have:
\[\langle\nabla f(x) - \nabla f(y), x - y\rangle \geq \mu \| x-y \|^2 \tag{8}\]Once again we use Lemma-5 to define $g(x) := f(x) - \frac{\mu}{2}| x |^2$ as a differentiable convex function. Then the proof simply follows by the monotocity of the gradient of $g$ as in Lemma-3.
Let $f: \mathbb{R}^d \rightarrow R$, and $L > 0$. $f$ is $L$-smooth if it is differentiable and if $\nabla f$ is $L$-Lipschitz:
\[\forall x, y \in \mathbb{R}^d, \quad \| \nabla f(x) - \nabla f(y) \| \leq L \| x - y \| \tag{9}\]As for the (strong) convexity, smoothness can also be defined by means of first and second order derivative.
If $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is $L$-smooth then:
\[\forall x, y \in \mathbb{R}^d, \quad f(y) \leq f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2} \| y-x \|^2 \tag{10}\]Let $\phi(t) := f(x + t(y-x))$. Using the fundamental theorem of Calculus on $\phi$, we have:
\[\begin{align*} \phi(1) - \phi(0) &= f(y) - f(x) = \int^{t=0}_{t=1} d \phi(t)dt \\ &= \int^{t=0}_{t=1} \langle \nabla f(x+t(y-x)), y-x \rangle dt \\ &= \langle \nabla f(x), y-x \rangle + \int^{t=0}_{t=1} \langle \nabla f(x+t(y-x)) - \nabla f(x), y-x \rangle dt \\ & \leq \langle \nabla f(x), y-x \rangle + \int^{t=0}_{t=1} \| \nabla f(x+t(y-x)) - \nabla f(x) \| \| y-x \| dt \\ & \leq \langle \nabla f(x), y-x \rangle + \int^{t=0}_{t=1} Lt\| y-x \|^2 dt \\ & \leq \langle \nabla f(x), y-x \rangle + \frac{L}{2} \| y-x \|^2 \end{align*}\]The first inequality in the proof use Cauchy-Schwarz inequality, and the second one use Definition-10.
Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ a twice differentiable and $L$-smooth function. Then, for all $x \in \mathbb{R}^d$, for every eigenvalue $\lambda$ of $\nabla^2 f(x)$, we have $|\lambda| \leq L$.
Admit.
Remarks: From Lemma-7 and Lemma-11, we can see that if a function is $L$-smooth and $\mu$-strongly convex, then $\mu \leq L$. We usually define the condition number $\kappa = \frac{L}{\mu}$.
We now present the descent lemma, which is use quite a lot in the convergence analysis of optimization algorithm
If f is $L$-smooth and $\lambda > 0$ then:
\[\forall x, y \in \mathbb{R}^d, \quad f(x-\lambda \nabla f(x)) - f(x) \leq -\lambda \left(1 - \frac{\lambda L}{2}\right)\| \nabla f(x) \|^2 \tag{11}\]If moreover, $f$ is bounded from below, then:
\[\forall x \in \mathbb{R}^d, \frac{1}{2L} \| \nabla f(x) \|^2 \leq f(x) - \inf f \tag{12}\]For the first inequality, by using Lemma-11, with $y = x-\lambda \nabla f(x)$, we have:
\[\begin{align*} f(x - \lambda \nabla f(x)) &\leq f(x) - \lambda \langle \nabla f(x), \nabla f(x) \rangle + \frac{L}{2} \lambda^2 \| \nabla f(x) \|^2 \\ &= f(x) - \lambda (1 - \lambda L / 2)\| \nabla f(x) \|^2 \end{align*}\]The second inequality then simply follows by using $\lambda = 1/L$.
Remarks: By looking at the first inequality, we can see that if $\lambda \leq 2/L$, then $f(x-\lambda \nabla f(x)) \leq f(x)$. This is the reason why optimization algorithms often have the condition $\gamma \leq 1/L$ on the step-size $\gamma$.
In optimization, there are many cases where the objective function is assumed to be both smooth and convex. Such functions often enjoy strictly properties, which are presented in the below lemma.
If $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is convex and $L$-smooth, then for all $x, y \in \mathbb{R}^d$, we have that:
\[\begin{align*} \frac{1}{2L} \| \nabla f(x) - \nabla f(y) \|^2 & \leq f(y) - f(x) - \langle \nabla f(x), y-x \rangle \tag{13} \\ \frac{1}{L} \| \nabla f(x) - \nabla f(x) \|^2 & \leq \langle \nabla f(y) - \nabla f(x), y-x \rangle \quad \textrm{(Co-coercivity)} \tag{14} \end{align*}\]To prove $(14)$, we start by writing:
\[f(x) - f(y) = f(x) - f(z) + f(z) - f(y)\]Using the convexity of $f$ on $f(x) - f(z)$, and the smoothness of $f$ on $f(z) - f(y)$, we have:
\[f(x) - f(y) \leq \langle \nabla f(x), x-z \rangle + \langle \nabla f(y), z-y \rangle + \frac{L}{2}\| y-z \|^2\]The right hand size is clearly a convex function of $z$. Hence, in order to minimize it with respect to $z$, we can set the derivative of it to zero to have:
\[z = y - \frac{1}{L} (\nabla f(y) - \nabla f(x))\]Substituting this $z$ and rearranging the terms give us $(14)$.
To obtain $(15)$, we simply apply $(14)$ twice by interchanging the role of $x$ and $y$, and then adding the two resulting inequality gives us the desired result.
In some problem, where the objective function is non-convex, sometime it is useful to suppose that the function satisfies the Polyak-Lojasiewicz inequality, which is given below:
Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be a differentiable function, and $\mu > 0$. Then $f$ is $\mu$-Polyak-Lojasiewicz, if it is bounded from below, and if for all $x \in \mathbb{R}^d$,
\[f(x) - \inf f \leq \frac{1}{2\mu} \| \nabla f(x) \|^2\]Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be a differentiable and $\mu$-strongly convex function, then $f$ is $\mu$-Polyak-Lojasiewicz
Let $x^*$ be a minimizer of $f$. By using the strong convexity of $f$, we have:
\[\begin{align*} f(x) - f(x^*) & \leq \langle \nabla f(x), x-x^* \rangle - \frac{\mu}{2} \| x-x^* \|^2 \\ &= -\frac{1}{2} \| \sqrt{\mu}(x - x^*) - \frac{1}{\sqrt{\mu} \nabla f(x)} \|^2 + \frac{1}{2 \mu} \| \nabla f(x) \|^2 \\ & \leq \frac{1}{2\mu} \| \nabla f(x) \|^2 \end{align*}\]The Polyak-Lojasiewicz property is weaker than strong convexity. But it is important to know that this property can hold without strong convexity or even convexity.
This blog is highly inspired from (Garrigos and Gower 2023)