8 months ago

Abstract

Frequency dynamic convolution (FDY conv) has been a milestone in the soundevent detection (SED) field, but it involves a substantial increase in modelsize due to multiple basis kernels. In this work, we propose partial frequencydynamic convolution (PFD conv), which concatenates outputs by conventional 2Dconvolution and FDY conv as static and dynamic branches respectively. PFD-CRNNwith proportion of dynamic branch output as one eighth reduces 51.9% ofparameters from FDY-CRNN while retaining the performance. Additionally, wepropose multi-dilated frequency dynamic convolution (MDFD conv), whichintegrates multiple dilated frequency dynamic convolution (DFD conv) brancheswith different dilation size sets and a static branch within a singleconvolution layer. Resulting best MDFD-CRNN with five non-dilated FDY Convbranches, three differently dilated DFD Conv branches and a static branchachieved 3.17% improvement in polyphonic sound detection score (PSDS) over FDYconv without class-wise median filter. Application of sound event bounding boxas post processing on best MDFD-CRNN achieved true PSDS1 of 0.485, which is thestate-of-the-art score in DESED dataset without external dataset or pretrainedmodel. From the results of extensive ablation studies, we discovered that notonly multiple dynamic branches but also specific proportion of static branchhelps SED. In addition, non-dilated dynamic branches are necessary in additionto dilated dynamic branches in order to obtain optimal SED performance. Theresults and discussions on ablation studies further enhance understanding andusability of FDY conv variants.

Source PDF View Code