内容提要: |
The selection of features for monaural speech enhancement using supervised learning algorithms is a crucial step. It is easier to obtain superior enhancement models with more robust feature combinations. On one hand, although the commonly used Deep Neural Network (DNN) can use some of the existing more robust features, these features are limited and cannot represent the relationship between frames. On the other hand, Convolutional Neural Network (CNN) is commonly used for speech separation by extracting features from the spectrum of adjacent frames. However, experiments show that the descriptive ability of features proposed by CNN for the temporal and spatial structure of the current frame is poor. In order to combine these two methods to obtain a more robust feature set, this paper proposes a deep stack residual network architecture.The main idea is to use more robust traditional features and the CNN to explore the relationship between the contexts in the spectrum. Use this approach to improve network performance. The experimental results show that the algorithm has good speech enhancement performance and has high generalization ability. There are great improvements in metrics such as voice quality and objective clarity. |