<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://mastane.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mastane.github.io/" rel="alternate" type="text/html" /><updated>2026-02-26T06:23:44-08:00</updated><id>https://mastane.github.io/feed.xml</id><title type="html">Mastane Achab</title><subtitle>Researcher in artificial intelligence</subtitle><author><name>Mastane Achab</name></author><entry><title type="html">Beyond Convexity #2: Barygradient flow</title><link href="https://mastane.github.io/posts/2024/12/blog-post-2/" rel="alternate" type="text/html" title="Beyond Convexity #2: Barygradient flow" /><published>2024-12-07T00:00:00-08:00</published><updated>2024-12-07T00:00:00-08:00</updated><id>https://mastane.github.io/posts/2024/12/blog-post-2</id><content type="html" xml:base="https://mastane.github.io/posts/2024/12/blog-post-2/"><![CDATA[<p>In <a href="https://arxiv.org/pdf/2411.00928">my latest paper</a>, I introduced a generalized proximal point algorithm (PPA): given a point $(x,q) \in \mathbb{R}^m \times \mathring \Delta_S$ ($m\ge 1$, $S \ge 2$ and $\Delta_S$ the probability simplex), the next iterate $(x’,q’)$ is given by:
\(( \nabla f + \lambda A )(x',q') = \nabla f(x,q) + c \begin{pmatrix} 0_m \\ 1_S \end{pmatrix}\)
where \(c = \log(\sum_s e^{\log(q'_s)-\lambda \ell_s(x')}) -\log(\sum_s e^{\log(q_s)}),\) 
for step-size $\lambda&gt;0$, $f(x,q)=\frac12 ||x||^2 + h(q)$ with $h(q)=\sum_{s=1}^S q_s \log(q_s)$ the negentropy, and
\(A(x,q) = \begin{pmatrix}
J_\ell(x)^\intercal q \\
-\ell(x)
\end{pmatrix}\)
where $J_\ell$ denotes the Jacobian matrix of $\ell=(\ell_1,\dots,\ell_S):\mathbb{R}^m \rightarrow \mathbb{R}^S$ with each $\ell_s$ convex ($\forall 1\le s \le S$).</p>

<p>In particular, I showed that $A$ is a monotone operator and that the $f$ -resolvent $(\nabla f + \lambda A)^{-1} \circ \nabla f$ is Bregman firmly nonexpansive with respect to the Bregman divergence $D_f$.
In other words, this ensures that this generalized PPA will converge to a fixed point, if there exists any.</p>

<p>In this blog post, we propose to generalize the gradient flow ordinary differential equation (ODE) by letting $\lambda \rightarrow 0$ in our generalized PPA (for a great introduction to gradient flow, see <a href="https://francisbach.com/gradient-flows/">Bach’s blog post</a>).</p>

<p><u>Definition:</u> Let $F(x,q) = q^\intercal \ell(x)$. We define the barygradient flow ODE as
\(\dot \zeta(t) = - \begin{pmatrix}
I_m &amp; 0 \\
0 &amp; -I_S
\end{pmatrix} \nabla F( (\nabla f)^{-1}( \zeta(t) -\log(\sum_s e^{\xi_s(t)-1}) \begin{pmatrix} 0_m \\ 1_S \end{pmatrix} ) ) + \gamma(t) \begin{pmatrix} 0_m \\ 1_S \end{pmatrix} ,\)
where $\zeta = (x,\xi) : \mathbb{R}_+ \rightarrow \mathbb{R}^m \times \mathbb{R}^S$ and
\(\gamma(t) = \frac{\sum_s [ \dot \xi_s(t) - \ell_s(x(t)) ] e^{\xi_s(t)-1}}{\sum_s e^{\xi_s(t)-1}} = q(t)^\intercal [ \dot \xi(t) - \ell(x(t)) ]\)
with $q(t)=(\nabla h)^{-1}(\xi(t)-\log(\sum_s e^{\xi_s(t)-1}) 1_S)$.</p>

<p>We point out that 
\(\begin{pmatrix}
I_m &amp; 0 \\
0 &amp; -I_S
\end{pmatrix} \nabla F( (\nabla f)^{-1}( \zeta(t) -\log(\sum_s e^{\xi_s(t)-1}) \begin{pmatrix} 0_m \\ 1_S \end{pmatrix}) ) = A(x(t), q(t)) .\)</p>

<h2 id="monotonicity-analysis">Monotonicity analysis</h2>

<p>Contrary to classic gradient flow, the function $F(x(t),q(t))$ is not necessarily nonincreasing along the flow.
Indeed,
\(\frac{d}{dt}F( (\nabla f)^{-1}( \zeta(t) -\log(\sum_s e^{\xi_s(t)-1}) \begin{pmatrix} 0_m \\ 1_S \end{pmatrix} ) ) = \frac{d}{dt}[(\nabla h)^{-1}(\xi(t)-\log(\sum_s e^{\xi_s(t)-1}) 1_S)]^\intercal \ell(x(t)) + q(t)^\intercal \frac{d}{dt} \ell(x(t)) ,\)
where 
\(\frac{d}{dt}[(\nabla h)^{-1}(\xi(t)-\log(\sum_s e^{\xi_s(t)-1}) 1_S)] = [\nabla^2 h(q(t))]^{-1} \dot \xi(t) - \frac{\sum_s \dot \xi_s(t) e^{\xi_s(t)-1}}{\sum_s e^{\xi_s(t)-1}} q(t)\)
and $\frac{d}{dt} \ell(x(t)) = J_\ell(x(t)) \dot x(t)$.</p>

<p>Hence,
\(\frac{d}{dt}F(x(t),q(t)) = \underbrace{ \ell(x(t))^\intercal [\nabla^2 h(q(t))]^{-1} \ell(x(t)) - F(x(t),q(t))^2 }_{\text{Var}_{\tau \sim q(t)}(\ell_\tau(x(t)))} - ||J_\ell(x(t))^\intercal q(t)||^2 ,\)
which is not necessarily nonpositive.</p>

<h2 id="entropy-analysis">Entropy analysis</h2>

<p>Denote $\chi(t) = h(q(t))$. Then,
\(\frac{d}{dt} \chi(t) = \dot q(t)^\intercal \nabla h(q(t))
= \{ [\nabla^2 h(q(t))]^{-1} \dot \xi(t) - [q(t)^\intercal \dot \xi(t)] q(t) \}^\intercal \{\xi(t)-\log(\sum_s e^{\xi_s(t)-1}) 1_S\} \\ = \xi(t)^\intercal \underbrace{[ \text{Diag}(q(t))-q(t)q(t)^\intercal ]}_{\text{Cov}(q(t))} \ell(x(t)),\)
where $\text{Cov}(q(t))$ denotes the covariance matrix<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> of the categorical distribution $q(t)$.</p>

<p><u>Remark:</u> The barygradient flow can be equivalently rewritten as the following natural gradient flow:
\(\dot \zeta(t) = - \begin{pmatrix}
I_m &amp; 0 \\
0 &amp; -\text{Cov}(q(t))^\dagger
\end{pmatrix} \nabla \tilde F( \zeta(t) ) + [ \gamma(t) + \frac{1_S^\intercal \ell(x(t))}{S} ] \begin{pmatrix} 0_m \\ 1_S \end{pmatrix} ,\)
where $\dagger$ denotes the Moore–Penrose pseudoinverse and
\(\tilde F(x,\xi) = (\nabla h)^{-1}(\xi - \log(\sum_s e^{\xi_s-1}) 1_S)^\intercal \ell(x).\)</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>$\text{Cov}(q(t))$ is also the Jacobian matrix of the softargmax function $\xi \mapsto (\nabla h)^{-1}(\xi-\log(\sum_s e^{\xi_s-1})1_S)$ evaluated at $\xi(t)$. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Mastane Achab</name></author><category term="gradient flow" /><category term="baryconvex optimization" /><category term="Bregman divergence" /><summary type="html"><![CDATA[In my latest paper, I introduced a generalized proximal point algorithm (PPA): given a point $(x,q) \in \mathbb{R}^m \times \mathring \Delta_S$ ($m\ge 1$, $S \ge 2$ and $\Delta_S$ the probability simplex), the next iterate $(x’,q’)$ is given by: \(( \nabla f + \lambda A )(x',q') = \nabla f(x,q) + c \begin{pmatrix} 0_m \\ 1_S \end{pmatrix}\) where \(c = \log(\sum_s e^{\log(q'_s)-\lambda \ell_s(x')}) -\log(\sum_s e^{\log(q_s)}),\) for step-size $\lambda&gt;0$, $f(x,q)=\frac12 ||x||^2 + h(q)$ with $h(q)=\sum_{s=1}^S q_s \log(q_s)$ the negentropy, and \(A(x,q) = \begin{pmatrix} J_\ell(x)^\intercal q \\ -\ell(x) \end{pmatrix}\) where $J_\ell$ denotes the Jacobian matrix of $\ell=(\ell_1,\dots,\ell_S):\mathbb{R}^m \rightarrow \mathbb{R}^S$ with each $\ell_s$ convex ($\forall 1\le s \le S$).]]></summary></entry><entry><title type="html">Beyond Convexity #1: Introduction to Cross-Convexity</title><link href="https://mastane.github.io/posts/2023/10/blog-post-1/" rel="alternate" type="text/html" title="Beyond Convexity #1: Introduction to Cross-Convexity" /><published>2023-10-14T00:00:00-07:00</published><updated>2023-10-14T00:00:00-07:00</updated><id>https://mastane.github.io/posts/2023/10/blog-post-1</id><content type="html" xml:base="https://mastane.github.io/posts/2023/10/blog-post-1/"><![CDATA[<p>In this blog post, we introduce a generalized notion of convexity for functions, that we call <a href="https://arxiv.org/abs/2309.15298">“cross-convexity”</a>, yielding inequalities that involve additional interaction terms compared to <a href="https://en.wikipedia.org/wiki/Convex_function">standard convexity</a>.</p>

<p><u>Definition:</u> A function $F:\mathbb{R}^d \rightarrow \mathbb{R}$ is said cross-convex if there exists $S\ge 1$ <a href="https://en.wikipedia.org/wiki/Logarithmically_concave_function">log-concave functions</a> $p_1,\dots,p_S$ such that
\(\forall x\in \mathbb{R}^d , \   F(x) = -\log\left( \sum_{s=1}^S p_s(x) \right) .\)</p>

<p>If $S=1$, then a cross-convex function is simply convex.
In the general case $S\ge 1$, we argue that this family of functions is still a natural one to consider as it includes the negative log-likelihood of the Gaussian mixture model.</p>

<p>Let us first recall that a differentiable function $f:\mathbb{R}^d \rightarrow \mathbb{R}$ is convex if it dominates all its tangent hyperplanes:
\(\forall a, \forall x, f(x) \ge f(a) + \nabla f(a)^\top (x-a).\)</p>

<p>$\bullet$ <u>Summation/Affine Closure:</u> A very important characteristic of the family of convex functions is that it is closed under (i) summation and (ii) affine reparametrization.
Formally, (i) if $f_1,f_2$ are two convex functions, then $f_1+f_2$ is also convex ; and (ii) if $f$ is convex, then $z \mapsto f(Az+b)$ is also convex.
These two closure properties are very important in machine learning where the objective function is typically equal to a sum over the dataset with affine neuronal transformations of the input data.
A first challenge when trying to generalize convexity for ML applications is to pick a family of functions satisfying such closedness under summation and affine reparametrization.
For instance, while the notion of <a href="https://en.wikipedia.org/wiki/Quasiconvex_function">quasi-convexity</a> is often cited as a natural extension of standard convexity, unfortunately it is not closed under summation since the sum of two quasi-convex functions is not necessarily quasi-convex.</p>

<p><u>Proposition:</u> Let $F = -\log\left( \sum_{s=1}^S p_s \right)$ and $\tilde F = -\log\left( \sum_{s=1}^{\tilde S} \tilde p_s \right)$ be two cross-convex functions with $S,\tilde S \ge 1$ and $p_s,\tilde p_s : \mathbb{R}^d \rightarrow (0,+\infty)$ log-concave functions.
Then, the sum $F+\tilde F$ is also cross-convex. Indeed,
\(F+\tilde F = -\log\left( \sum_{s=1}^S \sum_{s'=1}^{\tilde S} p_s \tilde p_{s'} \right) ,\)
where the product of two log-concave functions $p_s \tilde p_{s’}$ is also log-concave.</p>

<p>Moreover, the class of cross-convex functions is also closed under affine reparametrization (follows from the closedness of log-concave functions).</p>

<p>$\bullet$ <u>Generalized convexity inequality:</u> Note that the right-hand side in the convexity inequality, namely “$f(a) + \nabla f(a)^\top (x-a)$”, represents the tangent hyperplane of $f$ at the point $a$.
In particular, this lower bound is a critical ingredient in the analysis of gradient descent (see e.g. <a href="https://www.di.ens.fr/~fbach/ltfp_book.pdf">Bach’s LTFP book</a>).</p>

<p><u>Key Remark:</u> For any element $\mu \in [0,1]^S$ of the probability simplex (i.e., $\lVert \mu \rVert_1=1$), the KL-regularized log-loss
\(\ell_\mu(x) = -\log( p_1(x)+\dots+p_S(x) ) + D_{\text{KL}}\left( \mu \Bigg\| \left[ \frac{p_s(x)}{ \sum_{s'} p_{s'}(x) } \right]_{s} \right)\)
is convex.</p>

<p>In the following, we explain step-by-step how to obtain a similar lower bound in the cross convex case.
For simplicity, we focus on the very specific case $S=2, d=2$ and $p_1(x,y)=p_2(y,x)=p(x)$ for some log-concave function $p: \mathbb{R} \rightarrow (0,+\infty)$, which already unveils phenomena unseen in the convex scenario. We refer the reader to Lemma 1 in <a href="https://arxiv.org/abs/2309.15298">“Beyond Log-Concavity: Theory and Algorithm for Sum-Log-Concave Optimization”</a><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> for the more general statement.</p>

<h3 id="steps">Steps</h3>
<ul>
  <li>Given log-concave function $p$</li>
  <li>Compute cross-convex function $F(x,y)=-\log(p(x)+p(y))$</li>
  <li>Tangent lower bound $\mathcal{T}_{a,b}(x, y) \le F(x,y)$ at point $(a,b)$:
\(\mathcal{T}_{a,b}(x, y) = F(a,b) + \nabla F(a,b)^\top \begin{pmatrix} x-a \\ y-b \end{pmatrix} - D_{\text{KL}}\left( \begin{pmatrix} \frac{p(a)}{p(a)+p(b)} \\ \frac{p(b)}{p(a)+p(b)} \end{pmatrix} \, \Bigg \| \, \begin{pmatrix} \frac{p(x)}{p(x)+p(y)} \\ \frac{p(y)}{p(x)+p(y)} \end{pmatrix} \right)\)</li>
  <li><u>Note:</u> <em>Actually, the negative sign in front of the KL is bad news for the analysis of gradient descent…check out my paper to see how to solve that issue, by considering a reweighted version of the gradient</em></li>
</ul>

<p>$\bullet$ \(\mathfrak{B}\)<u>estiary:</u> Collection of illustrations of cross-convex functions (in red) with their tangent surface (in green): <a href="https://github.com/mastane/TheXCB">https://github.com/mastane/TheXCB</a>. The green tangent surface represents the lower bound $\mathcal{T}_{0,0}$ in the generalized convexity inequality.</p>

<ul>
  <li>
    <p>\(\mathcal{G}\)<u>aussian mixture</u>
<img src="https://mastane.github.io/images/XCB/gifs/rotating_plot_00_Gaussian.gif" width="100%" height="100%" /></p>
  </li>
  <li>
    <p>\(\mathcal{L}\)<u>ogistic mixture</u>
<img src="https://mastane.github.io/images/XCB/gifs/rotating_plot_00_Logistic.gif" width="100%" height="100%" /></p>
  </li>
  <li>
    <p>\(\mathcal{H}\)<u>yperbolic</u> \(\mathcal{S}\)<u>ecant mixture</u>
<img src="https://mastane.github.io/images/XCB/gifs/rotating_plot_00_Sech.gif" width="100%" height="100%" /></p>
  </li>
  <li>
    <p>\(\mathcal{G}\)<u>umbel mixture</u>
<img src="https://mastane.github.io/images/XCB/gifs/rotating_plot_00_Gumbel.gif" width="100%" height="100%" /></p>
  </li>
</ul>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><em><u>Note:</u></em> in Proposition 5.7 the cross gradient formula further simplifies to \(\left[ \sigma_l(Z_k) - \frac{\sigma_l(Z'_k) \Xi_{m-1,y-l}(Z'_{-k})}{\Xi_{m,y}(Z')} \right]_{0\le l\le c-1}.\) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Mastane Achab</name></author><category term="generalized convexity inequality" /><category term="mixture of log-concave distributions" /><summary type="html"><![CDATA[In this blog post, we introduce a generalized notion of convexity for functions, that we call “cross-convexity”, yielding inequalities that involve additional interaction terms compared to standard convexity.]]></summary></entry></feed>