I am currently a fourth year PhD student at the Department of Computer Sciences, UW-Madison. I am fortunate to be advised by Prof. Jelena Diakonikolas.
Earlier, I received my BS in Math from Zhiyuan Honors College at Shanghai Jiao Tong University.

My research interests are in **optimization**, with applications in machine learning.

Please reach out via xcai74 [at] wisc [dot] edu.

#### Recent Papers

(chronologically, * denotes equal contribution)
**Last Iterate Convergence of Incremental Methods and Applications in Continual Learning.**

Xufeng Cai, Jelena Diakonikolas.

arXiv preprint, arXiv:2403.06873, 2024.

abstract / arXiv

Incremental gradient methods and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature.
Yet, when it comes to their convergence guarantees, nonasymptotic (first-order or proximal) oracle complexity bounds have been obtained fairly recently, almost exclusively applying to the average iterate.
Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods,
in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor)
the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights,
which can be seen as interpolating between the last iterate and the average iterate guarantees. Additionally, we discuss how our results can be generalized to variants of studied incremental methods with permuted ordering of updates.
Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models,
which correspond to convex quadratic problems with infinitely many solutions.

**Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions.**

Xufeng Cai*, Ahmet Alacaoglu*, Jelena Diakonikolas.

In Proc. ICLR'24, 2024.

abstract / arXiv

Machine learning approaches relying on such criteria as adversarial robustness or multi-agent settings have raised the need for solving game-theoretic equilibrium problems.
Of particular relevance to these applications are methods targeting finite-sum structure, which generically arises in empirical variants of learning problems in these contexts.
Further, methods with computable approximation errors are highly desirable, as they provide verifiable exit criteria. Motivated by these applications,
we study finite-sum monotone inclusion problems, which model broad classes of equilibrium problems.
Our main contributions are variants of the classical Halpern iteration that employ variance reduction to obtain improved complexity guarantees in which \(n\) component operators in the finite sum are ``on average''
either cocoercive or Lipschitz continuous and monotone, with parameter \(L\). The resulting oracle complexity of our methods, which provide guarantees for the last iterate and for a (computable) operator norm residual,
is \(\widetilde{\mathcal{O}}( n + \sqrt{n}L\varepsilon^{-1})\), which improves upon existing methods by a factor up to \(\sqrt{n}\).
This constitutes the first variance reduction-type result for general finite-sum monotone inclusions and for more specific problems such as convex-concave optimization when operator norm residual is the optimality measure.
We further argue that, up to poly-logarithmic factors, this complexity is unimprovable in the monotone Lipschitz setting; i.e., the provided result is near-optimal.

**Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds.**

Xufeng Cai, Cheuk Yin Lin, Jelena Diakonikolas.

arXiv preprint, arXiv:2306.12498, 2023.

abstract / arXiv

Stochastic gradient descent (SGD) is perhaps the most prevalent optimization method in modern
machine learning. Contrary to the empirical practice of sampling from the datasets without replacement and
with (possible) reshuffling at each epoch, the theoretical counterpart of SGD usually relies on the assumption
of sampling with replacement. It is only very recently that SGD with sampling without replacement – shuffled SGD – has been analyzed. For convex finite sum problems with n components and under the
\(L\)-smoothness assumption for each component function, there are matching upper and lower bounds, under
sufficiently small – \(\mathcal{O}(\frac{1}{nL})\) – step sizes. Yet those bounds appear too pessimistic – in fact, the predicted
performance is generally no better than for full gradient descent – and do not agree with the empirical
observations. In this work, to narrow the gap between the theory and practice of shuffled SGD, we sharpen
the focus from general finite sum problems to empirical risk minimization with linear predictors. This
allows us to take a primal-dual perspective and interpret shuffled SGD as a primal-dual method with cyclic
coordinate updates on the dual side. Leveraging this perspective, we prove a fine-grained complexity bound
that depends on the data matrix and is never worse than what is predicted by the existing bounds. Notably,
our bound can predict much faster convergence than the existing analyses – by a factor of the order of \(\sqrt{n}\)
in some cases. We empirically demonstrate that on common machine learning datasets our bound is indeed
much tighter. We further show how to extend our analysis to convex nonsmooth problems, with similar improvements.

**Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization.**

Xufeng Cai, Chaobing Song, Stephen J. Wright, Jelena Diakonikolas.

In Proc. ICML'23, 2023.

abstract / arXiv

Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is
commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems
with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition
with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic
settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient
Lipschitz constant being defined w.r.t. a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to
decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient
methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result
when a Polyak-Łojasiewicz (PŁ) condition holds. To our knowledge, this work is the first to provide non-asymptotic
convergence guarantees — variance-reduced or not — for a cyclic block coordinate method in general composite (smooth
+ nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in
training deep neural nets.

**Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions.**

Xufeng Cai, Chaobing Song, Cristóbal Guzmán, Jelena Diakonikolas.

In Proc. NeurIPS'22, 2022.

abstract / arXiv

We study stochastic monotone inclusion problems, which widely appear in machine learning applications,
including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction.
In the cocoercive---and more generally Lipschitz-monotone---setup,
our algorithm attains \(\epsilon\) norm of the operator with \(\mathcal{O}(\frac{1}{\epsilon^3})\) stochastic operator evaluations,
which significantly improves over state of the art \(\mathcal{O}(\frac{1}{\epsilon^4})\) stochastic operator evaluations
required for existing monotone inclusion solvers applied to the same problem classes. %is the first result of this kind.
We further show how to couple one of the proposed variants of stochastic Halpern iteration
with a scheduled restart scheme to solve stochastic monotone inclusion problems with \({\mathcal{O}}(\frac{\log(1/\epsilon)}{\epsilon^2})\)
stochastic operator evaluations under additional sharpness or strong monotonicity assumptions.

#### Talks

**Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions.**

The seventh International Conference on Continuous Optimization (ICCOPT), Bethlehem, PA, USA. 2022.

#### Teaching

**CS639**, Foundations of Data Science, UW-Madison, Spring 2024 (TA).

**CS639**, Foundations of Data Science, UW-Madison, Spring 2022 (TA).

**CS760**, Machine Learning, UW-Madison, Spring 2021 (TA).

**CS760**, Machine Learning, UW-Madison, Fall 2020 (TA).

#### Miscellaneous

In my free time, I enjoy reading (not these days), photography (cameras and gear), and music (vinyls).