I am currently a fourth year PhD student at the Department of Computer Sciences, UW-Madison. I am fortunate to be advised by Jelena Diakonikolas.
Earlier, I received my BS in Mathematics from Zhiyuan Honors College at Shanghai Jiao Tong University.

My research interests are in the interplay between **optimization** and **machine learning**.

Please reach out via xcai74 [at] wisc [dot] edu.

#### Recent Papers

**Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds.**

Xufeng Cai, Cheuk Yin Lin, Jelena Diakonikolas.

Preprint, 2023.

abstract / arXiv

Stochastic gradient descent (SGD) is perhaps the most prevalent optimization method in modern
machine learning. Contrary to the empirical practice of sampling from the datasets without replacement and
with (possible) reshuffling at each epoch, the theoretical counterpart of SGD usually relies on the assumption
of sampling with replacement. It is only very recently that SGD with sampling without replacement – shuffled SGD – has been analyzed. For convex finite sum problems with n components and under the
\(L\)-smoothness assumption for each component function, there are matching upper and lower bounds, under
sufficiently small – \(\mathcal{O}(\frac{1}{nL})\) – step sizes. Yet those bounds appear too pessimistic – in fact, the predicted
performance is generally no better than for full gradient descent – and do not agree with the empirical
observations. In this work, to narrow the gap between the theory and practice of shuffled SGD, we sharpen
the focus from general finite sum problems to empirical risk minimization with linear predictors. This
allows us to take a primal-dual perspective and interpret shuffled SGD as a primal-dual method with cyclic
coordinate updates on the dual side. Leveraging this perspective, we prove a fine-grained complexity bound
that depends on the data matrix and is never worse than what is predicted by the existing bounds. Notably,
our bound can predict much faster convergence than the existing analyses – by a factor of the order of \(\sqrt{n}\)
in some cases. We empirically demonstrate that on common machine learning datasets our bound is indeed
much tighter. We further show how to extend our analysis to convex nonsmooth problems, with similar improvements.

**Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization.**

Xufeng Cai, Chaobing Song, Stephen J. Wright, Jelena Diakonikolas.

In Proc. ICML'23, 2023.

abstract / arXiv

Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is
commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems
with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition
with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic
settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient
Lipschitz constant being defined w.r.t. a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to
decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient
methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result
when a Polyak-Łojasiewicz (PŁ) condition holds. To our knowledge, this work is the first to provide non-asymptotic
convergence guarantees — variance-reduced or not — for a cyclic block coordinate method in general composite (smooth
+ nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in
training deep neural nets.

**Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions.**

Xufeng Cai, Chaobing Song, Cristóbal Guzmán, Jelena Diakonikolas.

In Proc. NeurIPS'22, 2022.

abstract / arXiv

We study stochastic monotone inclusion problems, which widely appear in machine learning applications,
including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction.
In the cocoercive---and more generally Lipschitz-monotone---setup,
our algorithm attains \(\epsilon\) norm of the operator with \(\mathcal{O}(\frac{1}{\epsilon^3})\) stochastic operator evaluations,
which significantly improves over state of the art \(\mathcal{O}(\frac{1}{\epsilon^4})\) stochastic operator evaluations
required for existing monotone inclusion solvers applied to the same problem classes. %is the first result of this kind.
We further show how to couple one of the proposed variants of stochastic Halpern iteration
with a scheduled restart scheme to solve stochastic monotone inclusion problems with \({\mathcal{O}}(\frac{\log(1/\epsilon)}{\epsilon^2})\)
stochastic operator evaluations under additional sharpness or strong monotonicity assumptions.

#### Talks

**Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions.**

The seventh International Conference on Continuous Optimization (ICCOPT), Bethlehem, PA, USA. 2022.

#### Teaching

**CS639**, Foundations of Data Science, UW-Madison, Spring 2022 (TA).

**CS760**, Machine Learning, UW-Madison, Spring 2021 (TA).

**CS760**, Machine Learning, UW-Madison, Fall 2020 (TA).

#### Miscellaneous

In my free time, I enjoy reading, photography, and music.