Current natural language processing research is primarily focused on spoken languages,e.g.,English,Mandarin,German,etc.With the Advent of the smart speaker and voice-activated digital assistants,There is a renewed focus on speech recognition.However,such language technology inherently excludes deaf and mute users,thereby Motivating the importance of sign language recognition/translation
research.Existing sign language translation(SLT)methods typi-Cally treat the problem as one of video captioning and just extract 。The features directly from the whole image or just the dominant hand.However,in reality,important linguistic information is conTained in the facial expression and non-dominant hand’sgestures.
Existing approaches also only make use of RNNs,which may fail to Sufciently model long range dependencies.To address this gap,we Propose a model consisting of a three-stream 3D-CNN to capture The local spatiotemporal features of signing and a Transformer to Decode the sentence from the spatiotemporal features.We conduct
Experiments on a large scale benchmark dataset to investigate the Efectiveness of our proposed local features and decoder for signLanguage translation. |