HyperAI

1. Tutorial Introduction

The DiffVox project was jointly released in May 2025 by a research team from Sony AI, Sony Corporation, and Queen Mary University of London. The core capability of this model lies in its advanced inference-time optimization method and innovative introduction of Gaussian prior constraints. This allows it to intelligently transform a raw human voice recording into high-quality audio that audibly approximates the target reference and conforms to professional mixing standards in terms of parameters. It is an advanced model focused on human voice style transfer, and the related paper is titled "...".DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions"(Received by DAFx25)" and "Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior(Accepted by WASPAA 2025).

This tutorial uses a single RTX 5090 graphics card as the default resource, but a minimum single RTX 4090 graphics card can be used to start the program.

2. Project Examples

3. Operation steps

1. Start the container

2. After entering the webpage, you can use the model

If "Bad Gateway" is displayed, it means the model is initializing. Please wait 2-3 minutes and refresh the page. When using Safari, audio may not play directly and needs to be downloaded first.

Related parameter descriptions

Main controller and preset

Rapid Audio

effectThe main control panel contains core audio processing functions and preset selections.
illustrateThis is the entry point for the entire effects processing chain, responsible for coordinating the work of all effects modules.

Dry/Wet Ratio

effectControlling the mixing ratio of dry sound (original sound) and wet sound (processed sound)
illustrate:
- 0%: Completely dry audio, outputs only the original sound.
- 50%: Dry and wet sound balance mixing
- 100%: Fully wet sound, only outputs processed sound.
applicationUsed to control the intensity of effects processing and avoid over-processing.

Output Audio

effectThe final mixed output audio
illustrateThe complete result after all effects processing and wet/dry mixing.

Dry Audio

effectRaw, unprocessed audio without any effects.
illustrateIt preserves the original characteristics of the recording, making it suitable for comparison or post-processing.

Wet Audio

effectWet sound after all effects processing
illustrateSounds including all effects such as equalization, compression, delay, and reverb.

Select Preset (1~365)

effectPreset Effects Library Selection
illustrate:
- Includes 365 professionally tuned effect presets
- Covering a wide range of music styles and sound characteristics
- It can serve as a starting point for personalized adjustments.

Parametric equalizer

Parametric EQ

effectPrecise tone adjustment tools
illustrateBy using multiple filters to enhance or attenuate specific frequency bands, the spectral characteristics of sound can be shaped.

High-pass filter

effectRemove low-frequency components below a specified frequency.
application:
- Remove low-frequency noises such as breathing sounds and wind sounds.
- Reduce haziness and increase clarity
- Typical settings: 80-120 Hz

Low Shelf (Low-Frequency Shelf-Type Equalizer)

effect: Overall boost or attenuation of all low frequencies
application:
- Increase the thickness and warmth of the sound.
- Reduce low-frequency booming sound
- Typical frequency: 100-250 Hz

Peak Filter

effectPrecise adjustment for specific frequency points
application:
- Eliminating resonance peaks
- Enhance the sense of presence in vocals
- Correcting timbre issues in specific frequency bands

High Shelf Equalizer

effectOverall boost or attenuation of all high frequencies
application:
- Increase the sense of airiness and brightness
- Reduce harsh high frequencies
- Typical frequency: 8-12 kHz

Frequency

effectSelect the center frequency to process.
illustrate: Determines the frequency point where the filter operates

Gain

effect: Control the degree of frequency enhancement or attenuation
scope-12 dB to +12 dB
Right now: Enhance this frequency
negative valueAttenuate this frequency

effect: Control the width of the affected frequency range
illustrate:
- High Q valueNarrow scope of influence, highly targeted
- Low Q valueWide range of influence, smooth effect
applicationNarrow Q is used for precise correction, while wide Q is used for overall adjustment.

Compressors and expanders

Compressor and Expander

effectDynamic range processor
FunctionThe compressor reduces the dynamic range, while the extender increases the dynamic range.

Threshold

effectSet the threshold level at which compression/expansion begins.
illustrate:
- Signals above this level will be compressed.
- Signals below this level will be amplified.
scope-60 dB to 0 dB

Comp.Ratio (compression ratio)

effectControlling the intensity of compression
illustrate:
- 2:1Mild compression
- 4:1Medium compression
- 10:1Strong compression
- ∞:1Limiter effect

Make up (gain compensation)

effectCompensation for level loss after compression
application: To make the volume after compression equivalent to that before compression.

Attack Time

effect: Controls the speed at which the compressor starts working
illustrate:
- Quick StartPreserve the transient state to increase the impact.
- Slow startSoftens transients, resulting in a smoother sound.
scope0.1-100 ms

Release Time

effect: Control the speed at which the compressor stops working
illustrate:
- Release quicklyRapid recovery may produce a suction effect.
- Slow releaseSlower dynamic recovery, resulting in a more natural effect.
scope50-1000 ms

Exp. Ratio

effect: Control the intensity of expansion
illustrate:
- 1:2The signal level is halved when it falls below the threshold.
- 1:10Strong expansion capability, effectively reducing noise.
scope: 0-1 (actually the reciprocal of the expansion ratio)

Exp. Threshold

effect: Set the starting voltage level of the extender
illustrateSignals below this threshold will be further attenuated.

RMS Averaging coefficient

effect: Control the compressor's sensitivity to signal response
illustrate:
- High valueSensitive to average volume, smooth response
- low valueSensitive to instantaneous peak values, with a fast response time.
applicationAdjust response characteristics according to music style and needs

Table tennis delay

Ping-Pong Delay

effectStereo delay effect
FeaturesThe echo alternates between the left and right channels.

Delay Time

effect: Control the time interval of the echo
scope100-1000 ms
application:
- Short latency: increases the sense of space and depth
- Long delay: Creates a noticeable echo effect

Feedback

effectControlling the number of echo repetitions
illustrate:
- Low feedbackA small amount of echo
- High feedbackRepeated repetition may lead to self-excitation.
scope: 0-1

Gain

effect: Control the volume of the delay effect
scope-80 dB to 0 dB

Odd/Even Delay Pan

effect: Control the sound image position of odd and even echoes respectively
illustrate:
- -100: Exactly left channel
- 0Centered
- 100Full right channel
applicationCreate a three-dimensional spatial movement effect

Low Pass Frequency

effectLow-frequency filtering of delayed echoes
application:
- High-frequency loss simulating natural decay
- Create a warm, non-harsh echo.

Reverb Send

effect: The amount of delay signal sent to the reverberation
applicationAdding a sense of space to delayed echoes creates a more natural effect.

FDN reverb

FDN Reverb

effectHigh-quality digital reverb effect
FeaturesBased on feedback delay networks, it provides natural spatial simulation.

Tone Correction (PEQ)

effectThe equalizer inside the reverb effect.
Function:
- Adjusting the frequency response of the reverberation tail
- Control the brightness or warmth of the reverb.
- Avoid reverb conflict with main sound

Decay Time

effectControlling the decay time of reverberation
illustrate:
- Short attenuationSmall room effect
- Long attenuationHall or church effect
scope0-9 seconds
applicationAdjust the reverberation duration according to the size and requirements of the space.

Citation Information

The citation information for this project is as follows:

@inproceedings{ycy2025diffvox,
     title={DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions}, 
     author={Chin-Yun Yu and Marco A. Martínez-Ramírez and Junghyun Koo and Ben Hayes and Wei-Hsiang Liao and György Fazekas and Yuki Mitsufuji},
     year={2025},
     booktitle={Proc. DAFx},
}

@inproceedings{ycy2025ito,
     title={Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior}, 
     author={Chin-Yun Yu and Marco A. Martínez-Ramírez and Junghyun Koo and Wei-Hsiang Liao and Yuki Mitsufuji and György Fazekas},
     year={2025},
     booktitle={Proc. WASPAA},
}

DiffVox: Sound Differentiation Model

1. Tutorial Introduction

2. Project Examples

3. Operation steps

1. Start the container

2. After entering the webpage, you can use the model

Related parameter descriptions

Main controller and preset

Parametric equalizer

Compressors and expanders

Table tennis delay

FDN reverb

Citation Information

Build AI with AI

Hyper Newsletters

Command Palette

DiffVox: Sound Differentiation Model

1. Tutorial Introduction

2. Project Examples

3. Operation steps

1. Start the container

2. After entering the webpage, you can use the model

Related parameter descriptions

Main controller and preset

Parametric equalizer

Compressors and expanders

Table tennis delay

FDN reverb

Citation Information

Build AI with AI

Hyper Newsletters