You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _bibliography/preprints.bib
+25-1Lines changed: 25 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,16 @@
1
1
---
2
2
---
3
+
@misc{sun2026robustnessmixturesexpertsfeature,
4
+
title={Robustness of Mixtures of Experts to Feature Noise},
5
+
author={Dong Sun and Rahul Nittala and Rebekka Burkholz},
6
+
year={2026},
7
+
eprint={2601.14792},
8
+
archivePrefix={arXiv},
9
+
primaryClass={cs.LG},
10
+
url={https://arxiv.org/abs/2601.14792},
11
+
img={robustness-of-moes.png},
12
+
abstract={Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.},
13
+
}
3
14
@misc{gadhikar2024cyclicsparsetrainingenough,
4
15
title={Cyclic Sparse Training: Is it Enough?},
5
16
author={Advait Gadhikar and Sree Harsha Nelaturu and Rebekka Burkholz},
abstract={The success of iterative pruning methods in achieving state-of-the-art sparse networks has largely been attributed to improved mask identification and an implicit regularization induced by pruning. We challenge this hypothesis and instead posit that their repeated cyclic training schedules enable improved optimization. To verify this, we show that pruning at initialization is significantly boosted by repeated cyclic training, even outperforming standard iterative pruning methods. The dominant mechanism how this is achieved, as we conjecture, can be attributed to a better exploration of the loss landscape leading to a lower training loss. However, at high sparsity, repeated cyclic training alone is not enough for competitive performance. A strong coupling between learnt parameter initialization and mask seems to be required. Standard methods obtain this coupling via expensive pruning-training iterations, starting from a dense network. To achieve this with sparse training instead, we propose SCULPT-ing, i.e., repeated cyclic training of any sparse mask followed by a single pruning step to couple the parameters and the mask, which is able to match the performance of state-of-the-art iterative pruning methods in the high sparsity regime at reduced computational cost.}
abstract={The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.},
abstract={The strong lottery ticket hypothesis holds the promise that pruning randomly initialized deep neural networks could offer a computationally efficient alternative to deep learning with stochastic gradient descent. Common parameter initialization schemes and existence proofs, however, are focused on networks with zero biases, thus foregoing the potential universal approximation property of pruning. To fill this gap, we extend multiple initialization schemes and existence proofs to nonzero biases, including explicit 'looks-linear' approaches for ReLU activation functions. These do not only enable truly orthogonal parameter initialization but also reduce potential pruning errors. In experiments on standard benchmark data, we further highlight the practical benefits of nonzero bias initialization schemes, and present theoretically inspired extensions for state-of-the-art strong lottery ticket pruning.}
Copy file name to clipboardExpand all lines: _bibliography/references.bib
+16-2Lines changed: 16 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,14 @@
1
1
---
2
2
---
3
-
4
3
@inproceedings{merge,
5
4
title={Bridging Domains through Subspace-Aware Model Merging},
6
5
author={Levy Chaves and Chao Zhou and Rebekka Burkholz and Eduardo Valle and Andra Avila},
7
6
year={2026},
8
7
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
9
8
img={model-merging.png},
9
+
url={https://arxiv.org/abs/2603.05768},
10
+
pdf={https://arxiv.org/pdf/2603.05768},
11
+
abstract={Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.},
10
12
}
11
13
12
14
@inproceedings{sanyal2026games,
@@ -65,7 +67,19 @@ @inproceedings{
65
67
url={https://openreview.net/forum?id=XKB5Hu0ACY},
66
68
pdf={https://openreview.net/pdf?id=XKB5Hu0ACY},
67
69
abstract={Understanding the implicit bias of optimization algorithms is key to explaining and improving the generalization of deep models. The hyperbolic implicit bias induced by pointwise overparameterization promotes sparsity, but also yields a small inverse Riemannian metric near zero, slowing down parameter movement and impeding meaningful parameter sign flips. To overcome this obstacle, we propose Hyperbolic Aware Minimization (HAM), which alternates a standard optimizer step with a lightweight hyperbolic mirror step. The mirror step incurs less compute and memory than pointwise overparameterization, reproduces its beneficial hyperbolic geometry for feature learning, and mitigates the small–inverse-metric bottleneck. Our characterization of the implicit bias in the context of underdetermined linear regression provides insights into the mechanism how HAM consistently increases performance --even in the case of dense training, as we demonstrate in experiments with standard vision benchmarks. HAM is especially effective in combination with different sparsification methods, advancing the state of the art.},
68
-
img={ham-hyperbolic-step.png},
70
+
img={hyperbolic-aware-minimization.png},
71
+
}
72
+
73
+
@inproceedings{
74
+
adnan2026sparseopt,
75
+
title={SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training},
76
+
author={Mohammed Adnan and Rohan Jain and Tom Jacobs and Ekansh Sharma and Rahul G Krishnan and Rebekka Burkholz and Yani Ioannou},
77
+
booktitle={The Third Conference on Parsimony and Learning (Recent Spotlight Track)},
78
+
year={2026},
79
+
url={https://openreview.net/forum?id=qerVUczDMf},
80
+
pdf={https://openreview.net/pdf?id=qerVUczDMf},
81
+
abstract={Dynamic Sparse Training (DST) methods train neural networks by maintaining sparsity while dynamically adapting the network topology. Despite the promise of reduced computation, DST methods converge significantly slower than dense training, often requiring comparable training time to achieve similar accuracy. We demonstrate both analytically and empirically that Batch Normalization (BN) adversely affects sparse training and propose SparseOpt — a sparsity-aware optimizer— to address this. Experiments on ResNet models across CIFAR-100 and ImageNet demonstrate consistently faster convergence and improved generalization with our proposed method. Our work highlights the limitations of current normalization layers in sparse training and provides the first systematic study of the interaction between Batch Normalization, sparse layers, and DST, taking a significant step toward making DST practically competitive with dense training.},
description: "I focus on understanding the intricate dynamics of training and fine-tuning in machine learning models, with the goal of developing more efficient and effective learning algorithms. My research explores how optimization processes evolve and how we can refine these methods to improve performance. Currently, I am particularly interested in gradient compression techniques."
21
+
description: "I focus on understanding the intricate dynamics of training and fine-tuning in machine learning models, with the goal of developing more efficient and effective learning algorithms. My research explores how optimization processes evolve and how we can refine these methods to improve performance. Currently, I am particularly interested in gradient compression techniques. I obtained my PhD under the supervision of Prof. Miguel Rodrigues at University College London, UK."
description: "I work on research problems at the intersection of machine learning and causality, focusing on modeling, inference, and interpreting machine learning models from a causal perspective to enhance their robustness and trustworthiness."
30
+
description: "I work on research problems at the intersection of machine learning and causality, focusing on modeling, inference, and interpreting machine learning models from a causal perspective to enhance their robustness and trustworthiness. I received my Ph.D. at the Indian Institute of Technology Hyderabad, where I was advised by Prof. Vineeth N Balasubramanian. During my Ph.D., I was awarded the prestigious Prime Minister's Research Fellowship (PMRF)."
description: "I focus on graph learning and similarity measures for graphs, with the aim of improving efficiency, expressivity, and accuracy. I completed my doctorate with distinction in the Kriege group at University of Vienna, and received the Award of Excellence from the Austrian Ministry of Women, Science and Research for my thesis."
40
+
40
41
- role: PhD students
41
42
members:
42
43
- name: Advait Gadhikar
@@ -88,8 +89,7 @@
88
89
start_date: Jul 2024
89
90
email: dong.sun@cispa.de
90
91
url: https://cispa.de/en/people/c01dosu
91
-
description: "My current research focuses on theoretically elucidating the superior performance of Mixture of Experts models, with an emphasis on their generalization performance, sample complexity, training dynamics, and robustness to adversarial noises.
92
-
I did my master's degree at ETH Zurich."
92
+
description: "My current research focuses on theoretically elucidating the superior performance of Mixture of Experts models, with an emphasis on their generalization performance, sample complexity, training dynamics, and robustness to adversarial noises. I studied a master's degree at ETH Zurich."
Copy file name to clipboardExpand all lines: _pages/openings.md
+3-5Lines changed: 3 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,14 +46,12 @@ We are a small team with a flat management structure and a collaborative work cu
46
46
47
47
The starting dates of the positions are flexible. We are committed to providing a healthy work environment and fostering diversity and respectful interaction. We welcome applications by candidates from all backgrounds and also support non-standard careers.
48
48
49
-
### Current open positions
50
-
51
-
* We have PhD and postdoc positions available for 2026.
52
-
*[PhD and Postdocs in Efficient Deep Learning](https://career.cispa.de/jobs/group-relationalml-53) at CISPA Helmholtz Center for Information Security.
53
-
54
49
### Past open positions
55
50
This is a non-exhaustive list of past open positions in our group.
56
51
52
+
* We had PhD and postdoc positions available for 2026.
53
+
*[PhD and Postdocs in Efficient Deep Learning](https://career.cispa.de/jobs/group-relationalml-53) at CISPA Helmholtz Center for Information Security.
54
+
57
55
* We received an ERC Starting Grant in 2023 ([SPARSE-ML](https://cispa.de/en/research/grants/sparse-ml)) and had several open positions for PhD students and Postdocs:
58
56
*[PhD position in sparse machine learning](https://euraxess.ec.europa.eu/jobs/144401).
59
57
*[Postdoctoral position in sparse machine learning](https://euraxess.ec.europa.eu/jobs/144392).
0 commit comments