SPEAK: An AI-Based Assistive Video Communication System for Speech and Sign Language Translation

Thejuskrishnan; Amal; Vyshnav M; Narayanan K; Saira Shamsudheen K S

Authors

Thejuskrishnan

Toc H Institute of Science & Technology

Author
Amal

Toc H Institute of Science & Technology

Author
Vyshnav M

Toc H Institute of Science & Technology

Author
Narayanan K

Toc H Institute of Science & Technology

Author
Saira Shamsudheen K S

Toc H Institute of Science & Technology

Author

Abstract

It is still very difficult for the hearing and deaf/hardof-
hearing (DHH) communities to effectively communicate, especially
when it comes to digital video conferencing. Despite
the widespread use of platforms like Zoom and Google Meet,
they frequently require costly human interpreters or invasive
hardware sensors due to their lack of native, real-time bidirectional
translation capabilities. In order to close this modality gap,
this paper presents SPEAK (Sign Processing Enhanced Audio
Kommunicator), a novel sensor-less browser-based platform. By
translating spoken language to text captions for DHH users
and sign language to text/speech for hearing users, SPEAK
enables smooth, two-way communication. By translating spoken
language to text captions for DHH users and sign language to
text/speech for hearing users, SPEAK enables smooth, two-way
communication.
For visual recognition, the system’s architecture makes use
of the Detection Transformer (DETR) model with a ResNet-50
backbone.DETR formulates detection as a direct set prediction
problem using a bipartite matching loss and self-attention mechanisms,
in contrast to conventional CNN-based detectors that
rely on region proposals. enhancing robustness against complex
backgrounds and doing away with the need for intricate, handcrafted
anchors. The audio pipeline simultaneously incorporates
Microsoft’s SpeechT5 for natural Text-to-Speech (TTS) synthesis
and OpenAI’s Whisper model for high-fidelity Automatic Speech
Recognition (ASR). optimized to save bandwidth using Voice
Activity Detection (VAD). To guarantee synchronization between
video frames and translation outputs, all modules are coordinated
within a low-latency WebRTC environment using a Flask-React
framework. SPEAK is validated as a scalable, affordable solution
for inclusive digital interaction after experimental evaluation on
a custom dataset in various lighting conditions shows a sign
detection accuracy of 92

Keywords:

Sign Language Recognition, DETR,, WebRTC,, OpenAI Whisper, Assistive Technology, Deep Learning

Indexed By

SPEAK: An AI-Based Assistive Video Communication System for Speech and Sign Language Translation

Authors

Thejuskrishnan

Amal

Vyshnav M

Narayanan K

Saira Shamsudheen K S

Abstract

Keywords:

Published

Issue

Section

License

How to Cite

Similar Articles

Similar Articles

Deep Learning for Cyber Threat Detection

Deep Learning Techniques for Image Steganography: A Comprehensive Review

Canine Dermal Analyser: Harnessing Artificial Intelligence and Deep Learning to Revolutionize Canine Skin Disease Detection

A Two-Stage Deep Learning Framework for Skin Lesion Detection and Classification Using ResNet18 and EfficientNet-B4

Fault Detection of Transmission Lines Using Unmanned Aerial Vehicle (UAV)

Pneumonia Detection From Chest X-Rays Using Deep Learning : A Comprehensive Review

NeuroRoad: An AI-Assisted Role-Based Learning Management System for Neurodivergent Education

A Comprehensive Review of Advancing Cattle Monitoring and Behavior Classification using Deep Learning

Multiple Detection and Diagnosis of Skin Diseases using CNN

Lung Cancer Subtype Classification Using Deep Learning Models