Evaluating Pretrained MotionGPT Models for American Sign Language Generation

Published on 2025-11-12 • Avichala Research

Abstract: This research investigates the effectiveness of pretrained MotionGPT models, incorporating Large Language Models (LLMs) and instruction tuning, for generating American Sign Language (ASL) videos. The authors explore multiple pretraining strategies – direct alignment and fusion, joint pretraining, and staged pretraining – alongside model combinations with LLMs like LLaMA and Qwen, to achieve state-of-the-art performance on key ASL generation metrics, demonstrating a significant improvement over existing methods.

Problem Statement: The accessibility of information and communication for individuals who are Deaf or hard of hearing remains a significant challenge. While sign language translation tools exist, generating high-quality, natural-looking ASL videos remains a complex problem. Current approaches often struggle with realism, fluency, and the nuanced gestures required for accurate sign language expression. This research directly addresses this gap by exploring the potential of leveraging pretrained models – initially inspired by MotionGPT – and enhancing them with LLMs to produce more realistic and effective ASL video generation. The goal is to move beyond simple gesture mapping and toward systems capable of producing truly communicative ASL content.

Methodology: The study employs a multi-faceted approach centered around pretraining and fine-tuning models for ASL video generation. The core of the research revolves around adapting the MotionGPT architecture, a model designed to process and generate sequential data like motion capture data. Several key strategies are investigated:

Pretraining Modules: The authors experiment with three primary pretraining techniques. First, “Direct Alignment and Fusion” focuses on directly aligning motion capture data with textual representations, followed by fusing the information into a model. Second, “Joint pretraining” combines motion data with textual data during the initial pretraining phase. Finally, “Staged pretraining” divides the pretraining process into stages, initially focusing on motion data and subsequently incorporating LLM knowledge.
Model Combinations: The study explores different model architectures, including:
- MLP: Utilizing a Multi-Layer Perceptron (MLP) as a foundational model.
- LLM + Alignment: Combining the MLP with a pre-trained LLM (Qwen, LLaMA) to incorporate linguistic understanding.
- MLP/LLM: A combined MLP and LLM architecture.
- MLP+LLM+LLM: A stacked architecture incorporating multiple LLMs for enhanced contextual awareness.
Datasets: The research utilizes three publicly available datasets: How2Sign, American Sign Language (ASL) HamNoSys, and the RWTH-PHOENIX-Weather dataset, representing multiple sign languages (ASL, Germany Sign Language) for robustness and generalization.
Hyperparameters: Experimentation includes parameters like Codebook Size (1024), Optimizer (AdamW), Gesture Tokenizer (VQ-VAE), Learning Rate (2e-4, 1e-6), Batch Size (256), Max Token Length (250), and the use of specific LLMs, Qwen and LLaMA, each with its own unique architecture and training regime.

Findings & Results: The experiments yielded impressive results. Across multiple metrics, the combined MLP+LLM models consistently outperformed the baseline MotionGPT and other models. Specifically:

BLEU@1 & Rouge: Achieved scores of 2.5 and 8.3 respectively for the MotionGPT baseline, reaching 14.2 and 11.9 for the MLP+LLM, demonstrating improved contextual understanding and fluency.
CIDEr: The models showed a significant improvement, with the MLP+LLM achieving 28.0, and the larger models performing even better.
WER (Word Error Rate), Insertions, Deletions: These metrics consistently improved, reflecting more accurate and natural ASL generation. The largest models had a WER of 1.5 and 1.2, respectively, against the MotionGPT baseline of 140 and 4.2.
Overall Performance: The MLP+LLM model demonstrated the highest CIDEr score of 26.9, dramatically surpassing previous benchmarks.

Limitations: The research acknowledges several limitations. Primarily, the reliance on fixed, pre-trained LLMs introduces a dependency on the quality and biases present in those models. The study also focuses exclusively on generating ASL videos, potentially limiting its applicability to other sign language modalities. Further, while the experiments showed improvements, generating truly expressive ASL – including nuanced facial expressions and emotional conveyance – remains a significant challenge.

Future Work & Outlook: Future research directions include exploring techniques to mitigate bias in LLMs used for ASL generation. Investigating adaptive learning mechanisms that can tailor the generation process to the specific communicative intent of the user is a promising avenue. Developing more sophisticated control mechanisms, potentially incorporating reinforcement learning, could enable finer-grained control over the generated ASL, allowing for realistic emotional expression and interactive sign language communication. Exploring multi-modal ASL generation – incorporating audio and visual cues – represents a particularly exciting frontier, potentially leading to significantly improved accessibility and naturalness.

Avichala Commentary: This work represents a crucial step in moving beyond simple gesture-based ASL generation towards systems capable of truly communicating. The integration of LLMs alongside MotionGPT signals a key shift in AI – moving towards systems that understand not just what is being communicated but how it’s being communicated, mirroring the cognitive processes involved in human sign language use. It aligns with the broader trend of LLMs becoming agents capable of interacting with the real world. This research complements the growing field of AI Agents and offers a foundational model for creating intelligent sign language communication systems, a domain that, until recently, has been largely unexplored by mainstream AI research. The results significantly bolster the case for leveraging LLMs to unlock new capabilities in diverse areas of human-computer interaction.

Link to the Arxiv: https://arxiv.org/abs/2511.08535v1.pdf