DiffVox: Sound Differentiation Model
1. Tutorial Introduction

The DiffVox project was jointly released in May 2025 by a research team from Sony AI, Sony Corporation, and Queen Mary University of London. The core capability of this model lies in its advanced inference-time optimization method and innovative introduction of Gaussian prior constraints. This allows it to intelligently transform a raw human voice recording into high-quality audio that audibly approximates the target reference and conforms to professional mixing standards in terms of parameters. It is an advanced model focused on human voice style transfer, and the related paper is titled "...".DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions"(Received by DAFx25)" and "Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior(Accepted by WASPAA 2025).
This tutorial uses a single RTX 5090 graphics card as the default resource, but a minimum single RTX 4090 graphics card can be used to start the program.
2. Project Examples

3. Operation steps
1. Start the container

2. After entering the webpage, you can use the model
If "Bad Gateway" is displayed, it means the model is initializing. Please wait 2-3 minutes and refresh the page. When using Safari, audio may not play directly and needs to be downloaded first.

Related parameter descriptions
Main controller and preset
Rapid Audio
- effectThe main control panel contains core audio processing functions and preset selections.
- illustrateThis is the entry point for the entire effects processing chain, responsible for coordinating the work of all effects modules.
Dry/Wet Ratio
- effectControlling the mixing ratio of dry sound (original sound) and wet sound (processed sound)
- illustrate:
- 0%: Completely dry audio, outputs only the original sound.
- 50%: Dry and wet sound balance mixing
- 100%: Fully wet sound, only outputs processed sound.
- applicationUsed to control the intensity of effects processing and avoid over-processing.
Output Audio
- effectThe final mixed output audio
- illustrateThe complete result after all effects processing and wet/dry mixing.
Dry Audio
- effectRaw, unprocessed audio without any effects.
- illustrateIt preserves the original characteristics of the recording, making it suitable for comparison or post-processing.
Wet Audio
- effectWet sound after all effects processing
- illustrateSounds including all effects such as equalization, compression, delay, and reverb.
Select Preset (1~365)
- effectPreset Effects Library Selection
- illustrate:
- Includes 365 professionally tuned effect presets
- Covering a wide range of music styles and sound characteristics
- It can serve as a starting point for personalized adjustments.
Parametric equalizer
Parametric EQ
- effectPrecise tone adjustment tools
- illustrateBy using multiple filters to enhance or attenuate specific frequency bands, the spectral characteristics of sound can be shaped.
High-pass filter
- effectRemove low-frequency components below a specified frequency.
- application:
- Remove low-frequency noises such as breathing sounds and wind sounds.
- Reduce haziness and increase clarity
- Typical settings: 80-120 Hz
Low Shelf (Low-Frequency Shelf-Type Equalizer)
- effect: Overall boost or attenuation of all low frequencies
- application:
- Increase the thickness and warmth of the sound.
- Reduce low-frequency booming sound
- Typical frequency: 100-250 Hz
Peak Filter
- effectPrecise adjustment for specific frequency points
- application:
- Eliminating resonance peaks
- Enhance the sense of presence in vocals
- Correcting timbre issues in specific frequency bands
High Shelf Equalizer
- effectOverall boost or attenuation of all high frequencies
- application:
- Increase the sense of airiness and brightness
- Reduce harsh high frequencies
- Typical frequency: 8-12 kHz
Frequency
- effectSelect the center frequency to process.
- illustrate: Determines the frequency point where the filter operates
Gain
- effect: Control the degree of frequency enhancement or attenuation
- scope-12 dB to +12 dB
- Right now: Enhance this frequency
- negative valueAttenuate this frequency
Q
- effect: Control the width of the affected frequency range
- illustrate:
- High Q valueNarrow scope of influence, highly targeted
- Low Q valueWide range of influence, smooth effect
- applicationNarrow Q is used for precise correction, while wide Q is used for overall adjustment.
Compressors and expanders
Compressor and Expander
- effectDynamic range processor
- FunctionThe compressor reduces the dynamic range, while the extender increases the dynamic range.
Threshold
- effectSet the threshold level at which compression/expansion begins.
- illustrate:
- Signals above this level will be compressed.
- Signals below this level will be amplified.
- scope-60 dB to 0 dB
Comp.Ratio (compression ratio)
- effectControlling the intensity of compression
- illustrate:
- 2:1Mild compression
- 4:1Medium compression
- 10:1Strong compression
- ∞:1Limiter effect
Make up (gain compensation)
- effectCompensation for level loss after compression
- application: To make the volume after compression equivalent to that before compression.
Attack Time
- effect: Controls the speed at which the compressor starts working
- illustrate:
- Quick StartPreserve the transient state to increase the impact.
- Slow startSoftens transients, resulting in a smoother sound.
- scope0.1-100 ms
Release Time
- effect: Control the speed at which the compressor stops working
- illustrate:
- Release quicklyRapid recovery may produce a suction effect.
- Slow releaseSlower dynamic recovery, resulting in a more natural effect.
- scope50-1000 ms
Exp. Ratio
- effect: Control the intensity of expansion
- illustrate:
- 1:2The signal level is halved when it falls below the threshold.
- 1:10Strong expansion capability, effectively reducing noise.
- scope: 0-1 (actually the reciprocal of the expansion ratio)
Exp. Threshold
- effect: Set the starting voltage level of the extender
- illustrateSignals below this threshold will be further attenuated.
RMS Averaging coefficient
- effect: Control the compressor's sensitivity to signal response
- illustrate:
- High valueSensitive to average volume, smooth response
- low valueSensitive to instantaneous peak values, with a fast response time.
- applicationAdjust response characteristics according to music style and needs
Table tennis delay
Ping-Pong Delay
- effectStereo delay effect
- FeaturesThe echo alternates between the left and right channels.
Delay Time
- effect: Control the time interval of the echo
- scope100-1000 ms
- application:
- Short latency: increases the sense of space and depth
- Long delay: Creates a noticeable echo effect
Feedback
- effectControlling the number of echo repetitions
- illustrate:
- Low feedbackA small amount of echo
- High feedbackRepeated repetition may lead to self-excitation.
- scope: 0-1
Gain
- effect: Control the volume of the delay effect
- scope-80 dB to 0 dB
Odd/Even Delay Pan
- effect: Control the sound image position of odd and even echoes respectively
- illustrate:
- -100: Exactly left channel
- 0Centered
- 100Full right channel
- applicationCreate a three-dimensional spatial movement effect
Low Pass Frequency
- effectLow-frequency filtering of delayed echoes
- application:
- High-frequency loss simulating natural decay
- Create a warm, non-harsh echo.
Reverb Send
- effect: The amount of delay signal sent to the reverberation
- applicationAdding a sense of space to delayed echoes creates a more natural effect.
FDN reverb
FDN Reverb
- effectHigh-quality digital reverb effect
- FeaturesBased on feedback delay networks, it provides natural spatial simulation.
Tone Correction (PEQ)
- effectThe equalizer inside the reverb effect.
- Function:
- Adjusting the frequency response of the reverberation tail
- Control the brightness or warmth of the reverb.
- Avoid reverb conflict with main sound
Decay Time
- effectControlling the decay time of reverberation
- illustrate:
- Short attenuationSmall room effect
- Long attenuationHall or church effect
- scope0-9 seconds
- applicationAdjust the reverberation duration according to the size and requirements of the space.
Citation Information
The citation information for this project is as follows:
@inproceedings{ycy2025diffvox,
title={DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions},
author={Chin-Yun Yu and Marco A. Martínez-Ramírez and Junghyun Koo and Ben Hayes and Wei-Hsiang Liao and György Fazekas and Yuki Mitsufuji},
year={2025},
booktitle={Proc. DAFx},
}
@inproceedings{ycy2025ito,
title={Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior},
author={Chin-Yun Yu and Marco A. Martínez-Ramírez and Junghyun Koo and Wei-Hsiang Liao and Yuki Mitsufuji and György Fazekas},
year={2025},
booktitle={Proc. WASPAA},
}Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.