Effective Combination of DenseNet andBiLSTM for Keyword Spotting
Keyword spotting (KWS) is a major component of human–computer interaction for smarton-device terminals and service robots, the purpose of which is to maximize the detection accuracy whilekeeping footprint size small. In this paper, based on the powerful ability of DenseNet on extracting localfeature-maps, we propose a new network architecture (DenseNet-BiLSTM) for KWS. In our DenseNet-BiLSTM, the DenseNet is primarily applied to obtain local features, while the BiLSTM is used to grabtime series features. In general, the DenseNet is used in computer vision tasks, and it may corrupt contextualinformation for speech audios. In order to make DenseNet suitable for KWS, we propose a variant DenseNet,called DenseNet-Speech, which removes the pool on the time dimension in transition layers to preservespeech time series information. In addition, our DenseNet-Speech uses less dense blocks and filters to keepthe model small, thereby reducing time consumption for mobile devices. The experimental results show thatfeature-maps from DenseNet-Speech maintain time series information well. Our method outperforms thestate-of-the-art methods in terms of accuracy on Google Speech Commands dataset. DenseNet-BiLSTM isable to achieve the accuracy of 96.6% for the 20-commands recognition task with 223K trainable parameters.