An alternative Approach in Voice Extraction

The research on audio clue-based target speaker extraction (TSE) has mostlyfocused on modeling the mixture and reference speech, achieving highperformance in English due to the availability of large datasets. However, lessattention has been given to the consistent properties of human speech acrosslanguages. To bridge this gap, we introduce an alternative model whichaddresses the challenge of transferring TSE models from one language to anotherwithout fine-tuning. In this work, we proposed a gating mechanism that is ableto modify specific frequencies based on the speaker's acoustic features. Themodel achieves an SI-SDR of 17.3544 on clean English speech and 13.2032 onclean speech mixed with Wham! noise, outperforming all other models in itsability to adapt to different languages.