Edge-Cloud Framework for Privacy-Preserving Speech Translation

ai-technology · 2026-05-28

Researchers propose ESRT (Edge-cloud Speech Recognition and Translation), a collaborative edge-cloud MLLM framework for speech-to-text translation. It uses a split inference architecture with a lightweight encoder on-device, transmitting compressed intermediate features to the cloud, preventing voiceprint leakage and reducing bandwidth by up to 10x. This addresses privacy risks and bandwidth bottlenecks of centralized cloud systems, and resource constraints of on-device models, while aiming to overcome English-centric biases for many-to-many translation scaling.

Key facts

ESRT stands for Edge-cloud Speech Recognition and Translation.
It is a collaborative edge-cloud MLLM framework.
It uses a split inference architecture.
A lightweight speech encoder and adapter remain on the device.
Only highly compressed intermediate features are transmitted to the cloud.
This prevents voiceprint leakage.
Bandwidth requirements are reduced by up to 10 times.
The framework aims to overcome English-centric biases for many-to-many translation.

Edge-Cloud Framework for Privacy-Preserving Speech Translation

Key facts

Entities

Institutions

Sources