29th International Conference on Digital Audio Effects (DAFx26), Cambridge, MA, USA, 1–4 September 2026

Audio To Audio Via Diffusion Warm Initialization

Cristóbal Andrade ¹ Sebastian J. Schlecht¹

¹Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg
Erlangen, Germany

Abstract

In this paper, we propose diffusion warm initialization as a simple yet effective approach for a range of audio-to-audio transformation tasks. To illustrate the generality of the approach, we demonstrate its use in timbre transfer, MIDI-to-Real synthesis, and multiple audio enhancement tasks. We conduct a detailed empirical analysis on timbre transfer to investigate the role of the initialization time $t_\text{init}$. The effect of $t_\text{init}$ is evaluated using pitch-based Jaccard Distance and Fréchet Audio Distance to quantify faithfulness to the input signal and alignment with the target distribution. Our results provide practical guidance for selecting $t_\text{init}$ and show that, once properly chosen, a single pretrained diffusion model combined with warm initialization can support multiple transformation objectives without task-specific training or conditioning. Despite its simplicity, this approach already achieves competitive results when compared with more complex pipelines designed specifically for these tasks. We further observe that warm initialization does not necessarily require explicit noise injection, as the guide signal itself can often serve as a valid initialization state for the backward diffusion process. Together, these findings show that warm initialization provides a simple and effective framework that serves as a fundamental building block for more complex audio transformation pipelines.

Timbre Transfer

Example	Guide Signal	Ours	Prior work
Oboe to Piano			—
String to Clarinet			Wavetransfer [1] DiffTransfer [2]
Synth to Violin			Diff Mutual Information [3]
Violin to Flute			Latent DiffBridge [4]
Computer Mouse to Piano			—
Nature to piano			—

Effect of $\tau_\text{init}$

Guide signal	$\tau_\text{init}$ = 0.55	$\tau_\text{init}$ = 0.70	$\tau_\text{init}$ = 0.80	$\tau_\text{init}$ = 0.85	$\tau_\text{init}$ = 0.90

Effect of the noise

Guide signal	λ = 1.00	λ = 0.75	λ = 0.50	λ = 0.25	λ = 0.00

Midi to Real

Example	MIDI input	Ours	Prior work
Violin			DDSP (Heuristic) [7] Midi2Params (Aligned) [7] Midi2Params (Transcribed) [7]
French Horn			CoSaRef (SDEdit) [5] CoSaRef (Zeta) [5] midi-ddsp [6]
Piano 1			—
Piano 2			—
Piano 3			—

Audio Enhancement

Example	Prior work	Ground truth
Guitar w/ background conversation	—
Trumpet	Mel2Mel DiffWave [8] Mel2Mel GL [8] Demucs [9]
Birds chirping	AudioSep [10]	—
Declipping - 10 dB	aspade [11] cqtdiff [12] ss-pew [13]	—
Declipping - 1 dB	a-spade [11] cqtdiff [12] ss-pew [13]	—
Keyboard typing	AudioSep [10]
Water drops	AudioSep [10]

Limitations

Example	Ours	Prior work	Ground truth
Humming to Piano		—	—
Sketch to Sound		Sketch2sound [14]	—
Stem Separation Flute and Guitar	Guitar Flute	Guitar AudioSep [10]	—
Stem Separation Guitar and Drums	Drums Guitar	—	Drums Guitar

References

T. Baoueb, X. Bie, H. Janati, and G. Richard, “Wavetransfer: A flexible end-to-end multi-instrument timbre transfer with diffusion,” arXiv preprint arXiv:2409.15321, 2024.
L. Comanducci, F. Antonacci, and A. Sarti, “Timbre transfer using image-to-image denoising diffusion implicit models,” in Proc. ISMIR, 2023.
C. H. Lee, J. Nistal, S. Lattner, M. Pasini, and G. Fazekas, “Diffusion timbre transfer via mutual information guided inpainting,” arXiv preprint arXiv:2601.01294, 2026.
M. Mancusi, Y. Halychanskyi, K. W. Cheuk, E. Moliner, C.-H. Lai, S. Uhlich, J. Koo, M. A. Martínez-Ramírez, W.-H. Liao, G. Fabbro, and Y. Mitsufuji, “Latent diffusion bridges for unsupervised musical audio timbre transfer,” arXiv preprint arXiv:2409.06096, 2025.
O. Take and T. Akama, “Annotation-free midi-to-audio synthesis via concatenative synthesis and generative refinement,” arXiv preprint arXiv:2410.16785, 2025.
Y. Wu, E. Manilow, Y. Deng, R. S. Swavely, K. Kastner, T. Cooijmans, A. Courville, C.-Z. A. Huang, and J. Engel, “Midi-ddsp: Detailed control of musical performance via hierarchical modeling,” arXiv preprint arXiv:2112.09312, 2022.
R. Castellon, C. Donahue, and P. Liang, “Towards realistic MIDI instrument synthesizers,” in Proceedings of the NeurIPS Workshop on Machine Learning for Creativity and Design, 2020.
N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” arXiv preprint arXiv:2204.13289, 2022.
A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” Interspeech, 2020.
X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu, R. Xia, Y. Wang, M. D. Plumbley, and W. Wang, “Separate Anything You Describe (AudioSep),”
S. Kitić, N. Bertin, and R. Gribonval, “Sparsity and cosparsity for audio declipping: A flexible non-convex approach,” in Proc. Int. Conf. Latent Variable Analysis Signal Separation, Aug. 2015, pp. 243–250.
E. Moliner, J. Lehtinen, and V. Välimäki, “Solving audio inverse problems with a diffusion model,” arXiv preprint arXiv:2210.15228, 2022.
K. Siedenburg, M. Kowalski, and M. Dörfler, “Audio declipping with social sparsity,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), May 2014, pp. 1577–1581.
H. F. Garcia, O. Nieto, J. Salamon, B. Pardo, and P. Seetharaman, “Sketch2sound: Controllable audio generation via time-varying signals and sonic limitations,” arXiv preprint arXiv:2412.08550, 2025.