29th International Conference on Digital Audio Effects (DAFx26), Cambridge, MA, USA, 1–4 September 2026
Audio To Audio Via Diffusion Warm Initialization
Cristóbal Andrade 1 Sebastian J. Schlecht1
1Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg
Erlangen, Germany
Abstract
In this paper, we propose diffusion warm initialization as a simple yet effective approach for a range of audio-to-audio transformation tasks. To illustrate the generality of the approach, we demonstrate its use in timbre transfer, MIDI-to-Real synthesis, and multiple audio enhancement tasks. We conduct a detailed empirical analysis on timbre transfer to investigate the role of the initialization time $t_\text{init}$. The effect of $t_\text{init}$ is evaluated using pitch-based Jaccard Distance and Fréchet Audio Distance to quantify faithfulness to the input signal and alignment with the target distribution. Our results provide practical guidance for selecting $t_\text{init}$ and show that, once properly chosen, a single pretrained diffusion model combined with warm initialization can support multiple transformation objectives without task-specific training or conditioning. Despite its simplicity, this approach already achieves competitive results when compared with more complex pipelines designed specifically for these tasks. We further observe that warm initialization does not necessarily require explicit noise injection, as the guide signal itself can often serve as a valid initialization state for the backward diffusion process. Together, these findings show that warm initialization provides a simple and effective framework that serves as a fundamental building block for more complex audio transformation pipelines.
Timbre Transfer
Example
Guide Signal
Ours
Prior work
Oboe to Piano
—
String to Clarinet
Wavetransfer [1]
DiffTransfer [2]
Synth to Violin
Diff Mutual Information [3]
Violin to Flute
Latent DiffBridge [4]
Computer Mouse to Piano
—
Nature to piano
—
Effect of $\tau_\text{init}$
Guide signal
$\tau_\text{init}$ = 0.55
$\tau_\text{init}$ = 0.70
$\tau_\text{init}$ = 0.80
$\tau_\text{init}$ = 0.85
$\tau_\text{init}$ = 0.90
Effect of the noise
Guide signal
λ = 1.00
λ = 0.75
λ = 0.50
λ = 0.25
λ = 0.00
Midi to Real
Example
MIDI input
Ours
Prior work
Violin
DDSP (Heuristic) [7]
Midi2Params (Aligned) [7]
Midi2Params (Transcribed) [7]
French Horn
CoSaRef (SDEdit) [5]
CoSaRef (Zeta) [5]
midi-ddsp [6]
Piano 1
—
Piano 2
—
Piano 3
—
Audio Enhancement
Example
Input
Ours
Prior work
Ground truth
Guitar w/ background conversation
—
Trumpet
Mel2Mel DiffWave [8]
Mel2Mel GL [8]
Demucs [9]
Birds chirping
AudioSep [10]
—
Declipping - 10 dB
aspade [11]
cqtdiff [12]
ss-pew [13]
—
Declipping - 1 dB
a-spade [11]
cqtdiff [12]
ss-pew [13]
—
Keyboard typing
AudioSep [10]
Water drops
AudioSep [10]
Limitations
Example
Input
Ours
Prior work
Ground truth
Humming to Piano
—
—
Sketch to Sound
Sketch2sound [14]
—
Stem Separation Flute and Guitar
Guitar
Flute
Guitar AudioSep [10]
—
Stem Separation Guitar and Drums
Drums
Guitar
—
Drums
Guitar
References
T. Baoueb, X. Bie, H. Janati, and G. Richard,
“Wavetransfer: A flexible end-to-end multi-instrument timbre transfer with diffusion,”
arXiv preprint arXiv:2409.15321, 2024.
L. Comanducci, F. Antonacci, and A. Sarti,
“Timbre transfer using image-to-image denoising diffusion implicit models,”
in Proc. ISMIR, 2023.
C. H. Lee, J. Nistal, S. Lattner, M. Pasini, and G. Fazekas,
“Diffusion timbre transfer via mutual information guided inpainting,”
arXiv preprint arXiv:2601.01294, 2026.
M. Mancusi, Y. Halychanskyi, K. W. Cheuk, E. Moliner, C.-H. Lai, S. Uhlich, J. Koo,
M. A. Martínez-Ramírez, W.-H. Liao, G. Fabbro, and Y. Mitsufuji,
“Latent diffusion bridges for unsupervised musical audio timbre transfer,”
arXiv preprint arXiv:2409.06096, 2025.
O. Take and T. Akama,
“Annotation-free midi-to-audio synthesis via concatenative synthesis and generative refinement,”
arXiv preprint arXiv:2410.16785, 2025.
Y. Wu, E. Manilow, Y. Deng, R. S. Swavely, K. Kastner, T. Cooijmans, A. Courville, C.-Z. A. Huang, and J. Engel,
“Midi-ddsp: Detailed control of musical performance via hierarchical modeling,”
arXiv preprint arXiv:2112.09312, 2022.
R. Castellon, C. Donahue, and P. Liang,
“Towards realistic MIDI instrument synthesizers,”
in Proceedings of the NeurIPS Workshop on Machine Learning for Creativity and Design, 2020.
N. Kandpal, O. Nieto, and Z. Jin,
“Music enhancement via image translation and vocoding,”
arXiv preprint arXiv:2204.13289, 2022.
A. Defossez, G. Synnaeve, and Y. Adi,
“Real time speech enhancement in the waveform domain,”
Interspeech, 2020.
X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu, R. Xia, Y. Wang, M. D. Plumbley, and W. Wang,
“Separate Anything You Describe (AudioSep),”
S. Kitić, N. Bertin, and R. Gribonval,
“Sparsity and cosparsity for audio declipping: A flexible non-convex approach,”
in Proc. Int. Conf. Latent Variable Analysis Signal Separation, Aug. 2015, pp. 243–250.
E. Moliner, J. Lehtinen, and V. Välimäki,
“Solving audio inverse problems with a diffusion model,”
arXiv preprint arXiv:2210.15228, 2022.
K. Siedenburg, M. Kowalski, and M. Dörfler,
“Audio declipping with social sparsity,”
in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), May 2014, pp. 1577–1581.
H. F. Garcia, O. Nieto, J. Salamon, B. Pardo, and P. Seetharaman,
“Sketch2sound: Controllable audio generation via time-varying signals and sonic limitations,”
arXiv preprint arXiv:2412.08550, 2025.