29th International Conference on Digital Audio Effects (DAFx26), Cambridge, MA, USA, 1–4 September 2026

Audio To Audio Via Diffusion Warm Initialization

Cristóbal Andrade 1 Sebastian J. Schlecht1

1Multimedia Communications and Signal Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg
Erlangen, Germany

Abstract

In this paper, we propose diffusion warm initialization as a simple yet effective approach for a range of audio-to-audio transformation tasks. To illustrate the generality of the approach, we demonstrate its use in timbre transfer, MIDI-to-Real synthesis, and multiple audio enhancement tasks. We conduct a detailed empirical analysis on timbre transfer to investigate the role of the initialization time $t_\text{init}$. The effect of $t_\text{init}$ is evaluated using pitch-based Jaccard Distance and Fréchet Audio Distance to quantify faithfulness to the input signal and alignment with the target distribution. Our results provide practical guidance for selecting $t_\text{init}$ and show that, once properly chosen, a single pretrained diffusion model combined with warm initialization can support multiple transformation objectives without task-specific training or conditioning. Despite its simplicity, this approach already achieves competitive results when compared with more complex pipelines designed specifically for these tasks. We further observe that warm initialization does not necessarily require explicit noise injection, as the guide signal itself can often serve as a valid initialization state for the backward diffusion process. Together, these findings show that warm initialization provides a simple and effective framework that serves as a fundamental building block for more complex audio transformation pipelines.

Timbre Transfer

Example Guide Signal Ours Prior work
Oboe to Piano
String to Clarinet
Wavetransfer [1]
DiffTransfer [2]
Synth to Violin
Diff Mutual Information [3]
Violin to Flute
Latent DiffBridge [4]
Computer Mouse to Piano
Nature to piano

Effect of $\tau_\text{init}$

Guide signal $\tau_\text{init}$ = 0.55 $\tau_\text{init}$ = 0.70 $\tau_\text{init}$ = 0.80 $\tau_\text{init}$ = 0.85 $\tau_\text{init}$ = 0.90

Effect of the noise

Guide signal λ = 1.00 λ = 0.75 λ = 0.50 λ = 0.25 λ = 0.00

Midi to Real

Example MIDI input Ours Prior work
Violin
DDSP (Heuristic) [7]
Midi2Params (Aligned) [7]
Midi2Params (Transcribed) [7]
French Horn
CoSaRef (SDEdit) [5]
CoSaRef (Zeta) [5]
midi-ddsp [6]
Piano 1
Piano 2
Piano 3

Audio Enhancement

Example Input Ours Prior work Ground truth
Guitar w/ background conversation
Trumpet
Mel2Mel DiffWave [8]
Mel2Mel GL [8]
Demucs [9]
Birds chirping
AudioSep [10]
Declipping - 10 dB
aspade [11]
cqtdiff [12]
ss-pew [13]
Declipping - 1 dB
a-spade [11]
cqtdiff [12]
ss-pew [13]
Keyboard typing
AudioSep [10]
Water drops
AudioSep [10]

Limitations

Example Input Ours Prior work Ground truth
Humming to Piano
Sketch to Sound
Sketch2sound [14]
Stem Separation
Flute and Guitar
Guitar
Flute
Guitar AudioSep [10]
Stem Separation
Guitar and Drums
Drums
Guitar
Drums
Guitar

References

  1. T. Baoueb, X. Bie, H. Janati, and G. Richard, “Wavetransfer: A flexible end-to-end multi-instrument timbre transfer with diffusion,” arXiv preprint arXiv:2409.15321, 2024.
  2. L. Comanducci, F. Antonacci, and A. Sarti, “Timbre transfer using image-to-image denoising diffusion implicit models,” in Proc. ISMIR, 2023.
  3. C. H. Lee, J. Nistal, S. Lattner, M. Pasini, and G. Fazekas, “Diffusion timbre transfer via mutual information guided inpainting,” arXiv preprint arXiv:2601.01294, 2026.
  4. M. Mancusi, Y. Halychanskyi, K. W. Cheuk, E. Moliner, C.-H. Lai, S. Uhlich, J. Koo, M. A. Martínez-Ramírez, W.-H. Liao, G. Fabbro, and Y. Mitsufuji, “Latent diffusion bridges for unsupervised musical audio timbre transfer,” arXiv preprint arXiv:2409.06096, 2025.
  5. O. Take and T. Akama, “Annotation-free midi-to-audio synthesis via concatenative synthesis and generative refinement,” arXiv preprint arXiv:2410.16785, 2025.
  6. Y. Wu, E. Manilow, Y. Deng, R. S. Swavely, K. Kastner, T. Cooijmans, A. Courville, C.-Z. A. Huang, and J. Engel, “Midi-ddsp: Detailed control of musical performance via hierarchical modeling,” arXiv preprint arXiv:2112.09312, 2022.
  7. R. Castellon, C. Donahue, and P. Liang, “Towards realistic MIDI instrument synthesizers,” in Proceedings of the NeurIPS Workshop on Machine Learning for Creativity and Design, 2020.
  8. N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” arXiv preprint arXiv:2204.13289, 2022.
  9. A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” Interspeech, 2020.
  10. X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu, R. Xia, Y. Wang, M. D. Plumbley, and W. Wang, “Separate Anything You Describe (AudioSep),”
  11. S. Kitić, N. Bertin, and R. Gribonval, “Sparsity and cosparsity for audio declipping: A flexible non-convex approach,” in Proc. Int. Conf. Latent Variable Analysis Signal Separation, Aug. 2015, pp. 243–250.
  12. E. Moliner, J. Lehtinen, and V. Välimäki, “Solving audio inverse problems with a diffusion model,” arXiv preprint arXiv:2210.15228, 2022.
  13. K. Siedenburg, M. Kowalski, and M. Dörfler, “Audio declipping with social sparsity,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), May 2014, pp. 1577–1581.
  14. H. F. Garcia, O. Nieto, J. Salamon, B. Pardo, and P. Seetharaman, “Sketch2sound: Controllable audio generation via time-varying signals and sonic limitations,” arXiv preprint arXiv:2412.08550, 2025.