Project

Skeleton-based American Sign Language Recognition

In our previous study we have proposed ResNet-based network models, named ContiNet and F-ContiNet, for sign language recognition. ContiNet is a 3D convolutional network with a constant number of temporal snapshots, and F-ContiNet is an evolution of ContiNet where the fusion idea is applied so that the spatial and temporal information from the early layers will have greater direct impact on the final recognition. We evaluated the models with RGB (raw) and Mask (Skeleton) video inputs from a Chinese sign language (CSL) dataset. The experiment results indicate that the proposed ContiNet and F-ContiNet outperformed the state-of-the-art approaches, and taking Mask videos as inputs always resulted in better performance. The purpose of this project as reported here is to further our understanding of the ContiNet and F-ContiNet models using a different dataset. We used the same experiment settings to evaluate ContiNet and F-ContiNet over the Microsoft sign language (MSL) dataset. To our surprise, the experiment results were opposite of the findings on the CSL dataset. In particular, ContiNet and F-ContiNet did not produce the best performance, and taking RGB videos as inputs resulted in better performance in most settings. We have analyzed the quality of the MSL dataset to diagnose the problems and possible solutions.

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.