WebSocket API for async low-latency audio translation designed for medical consultations.
Main Features:
Architecture: The system uses a pipeline architecture with parallel branches per detected sentence:
enable_dual_direction)Processing Flow:
HandshakeRequest with language/gender/speed configurationHandshakeSuccessConversationExit to end the session gracefully, or the server
closes automatically on inactivity timeoutConversationArchive if save_conversation was true, then closesDevelopment server for audio streaming and orchestration.
⚠️ Connection Requirements (HTTP/1.1 Only): This endpoint strictly requires a standard HTTP/1.1 WebSocket Handshake.
Clients must send the following headers:
Connection: UpgradeUpgrade: websocketSec-WebSocket-Version: 13Sec-WebSocket-Key: [Base64-encoded 16-byte random key]Note: HTTP/2 is not supported for the handshake. Clients like curl or
strict proxies must force HTTP/1.1 or the required Upgrade headers will be stripped.
Main WebSocket channel for audio streaming, translation results, and session control.
Communication Sequence:
HandshakeRequest (JSON)HandshakeSuccess (JSON)AudioChunk (bytes) - continuous streamingASRResult, BranchAResult, BranchBResult (JSON) - asynchronousConversationExit (JSON) - to end the session gracefullyExitAcknowledged (JSON) - confirms exit receivedConversationArchive (JSON) - on session end, if save_conversation was trueControl messages (Client -> Server): JSON text frames sent at any point after the handshake.
set_speed - change TTS speed for subsequent sentencesflag - flag a sentence or the whole conversation for reviewcorrect_speaker - correct a misidentified speaker roletrigger_branch_c - run the alternate-speaker hypothesis for one specific sentence on demandconversation_exit - end the session gracefully (triggers cleanup and optional archive delivery)Timeouts:
WebSocket Close Codes:
Send audio chunks and session control messages
Available only on servers:
Accepts one of the following messages:
First JSON message sent by client - must arrive within 10 seconds of connection
French doctor, English patient, normal speed
{
"language_medic": "fr-FR",
"gender_medic": "female",
"language_patient": "en-GB",
"gender_patient": "male",
"patient_age": "adult",
"speed": 1
}
Slowed TTS
{
"language_medic": "fr-FR",
"gender_medic": "male",
"language_patient": "de-DE",
"gender_patient": "female",
"patient_age": "senior",
"speed": 0.8
}
Session with archive delivery at end
{
"language_medic": "fr-FR",
"gender_medic": "male",
"language_patient": "en-US",
"gender_patient": "female",
"save_conversation": true
}
Raw audio data streamed continuously after handshake
Raw PCM audio - s16le, 16kHz, mono. Recommended chunk size: 4096 bytes.
string
Adjust TTS playback speed for all sentences processed after this message. Takes effect on the next sentence - the pipeline currently in-flight is not affected.
Increase speed by 25%
{
"type": "set_speed",
"speed": 1.25
}
Reduce speed for easier comprehension
{
"type": "set_speed",
"speed": 0.75
}
Mark a sentence (or the whole session) for later review. Omit `sentence_id` to flag the entire conversation.
Flag a specific sentence as a mistranslation
{
"type": "flag",
"sentence_id": 3,
"flag_type": "mistranslation",
"note": "The word 'douleur' was translated as 'pain' but context suggests 'ache'"
}
Flag the whole conversation
{
"type": "flag",
"flag_type": "other",
"note": "Audio quality was poor throughout"
}
Override the speaker role assigned by ASR language detection for a sentence. Useful when both speakers share a language or ASR misidentified the speaker.
{
"type": "correct_speaker",
"sentence_id": 2,
"speaker": "patient"
}
Run Branch C on demand for a single sentence, regardless of whether `enable_dual_direction` was set at handshake. Useful when the client suspects a speaker misidentification on a specific sentence and wants the alternate translation without paying the cost for every sentence. The server sends an immediate `BranchCTriggered` acknowledgment, then the `BranchCResult` (or an `error`) arrives asynchronously once translation and TTS complete.
{
"type": "trigger_branch_c",
"sentence_id": 3
}
Signals the server to end the session cleanly. The server will: 1. Send `ExitAcknowledged` immediately 2. Await any in-flight audio pipelines 3. Send `ConversationArchive` if `save_conversation` was true 4. Close the WebSocket with code 1000 Preferred over simply closing the connection - ensures in-flight sentences are processed to completion and the archive is delivered before the socket closes.
{
"type": "conversation_exit"
}
Main WebSocket channel for audio streaming, translation results, and session control.
Communication Sequence:
HandshakeRequest (JSON)HandshakeSuccess (JSON)AudioChunk (bytes) - continuous streamingASRResult, BranchAResult, BranchBResult (JSON) - asynchronousConversationExit (JSON) - to end the session gracefullyExitAcknowledged (JSON) - confirms exit receivedConversationArchive (JSON) - on session end, if save_conversation was trueControl messages (Client -> Server): JSON text frames sent at any point after the handshake.
set_speed - change TTS speed for subsequent sentencesflag - flag a sentence or the whole conversation for reviewcorrect_speaker - correct a misidentified speaker roletrigger_branch_c - run the alternate-speaker hypothesis for one specific sentence on demandconversation_exit - end the session gracefully (triggers cleanup and optional archive delivery)Timeouts:
WebSocket Close Codes:
Receive transcription, translation results, and acknowledgments
Available only on servers:
Accepts one of the following messages:
Server accepted the configuration - audio streaming may begin
{
"type": "handshake_success",
"message": "Configuration received and accepted. Ready to receive audio stream.",
"medic_lang": "fr-FR",
"gender_medic": "female",
"patient_lang": "en-GB",
"gender_patient": "male",
"patient_age": "adult",
"speed": 1,
"save_conversation": false
}
ASR transcription of a detected sentence with speaker identification
{
"type": "asr_result",
"id": 1,
"text": "Bonjour, comment vous sentez-vous aujourd'hui ?",
"language": "fr-FR",
"speaker": "medic"
}
{
"type": "asr_result",
"id": 2,
"text": "I have had a headache since yesterday",
"language": "en-GB",
"speaker": "patient"
}
Literal translation with back-translation and TTS audio
{
"type": "branch_a_result",
"sentence_id": 1,
"original_text": "Bonjour, comment vous sentez-vous aujourd'hui ?",
"translated_text": "Hello, how are you feeling today?",
"back_translated_text": "Bonjour, comment vous sentez-vous aujourd'hui ?",
"audio_format": "wav"
}
Contextually reformulated translation with back-translation and TTS audio
{
"type": "branch_b_result",
"sentence_id": 1,
"reformulated_source": "Bonjour, pouvez-vous me décrire votre état de santé actuel ?",
"translated_reformulation": "Hello, can you describe your current health condition?",
"back_translated_reformulation": "Bonjour, pouvez-vous décrire votre état de santé actuel ?",
"audio_format": "wav"
}
Literal translation assuming the opposite speaker to what ASR detected. Only sent when `enable_dual_direction` was true in the handshake. Always arrives after the corresponding `BranchAResult` for the same sentence.
{
"type": "branch_c_result",
"sentence_id": 1,
"assumed_speaker_role": "patient",
"original_text": "Bonjour, comment vous sentez-vous aujourd'hui ?",
"translated_text": "Hello, how are you feeling today?",
"back_translated_text": "Bonjour, comment vous sentez-vous aujourd'hui ?",
"audio_format": "wav"
}
Confirms the new TTS speed value after a set_speed control message
{
"type": "speed_updated",
"speed": 1.25
}
Confirms receipt of a flag control message
{
"type": "flag_ack",
"sentence_id": 3
}
{
"type": "flag_ack",
"sentence_id": null
}
Confirms receipt of a correct_speaker control message
{
"type": "correction_ack",
"sentence_id": 2,
"speaker": "patient"
}
Immediate acknowledgment that a `trigger_branch_c` request was accepted. The actual result arrives later as a `BranchCResult` message once translation and TTS complete.
{
"type": "branch_c_triggered",
"sentence_id": 3
}
Confirms receipt of a `conversation_exit` control message. The server is now draining in-flight pipelines. The `ConversationArchive` (if applicable) and the WebSocket close frame follow shortly after.
{
"type": "exit_acknowledged"
}
Delivered at session end when `save_conversation` was true in the handshake and at least one sentence was processed. Contains a Base64-encoded tar.gz archive with ASR transcripts, translations, reformulations, and raw audio. Sent before the WebSocket close frame. If archive construction fails server-side, an `ErrorMessage` with `service: "Conversation Archive"` is sent instead - the client should handle both cases when `save_conversation` is true.
{
"type": "conversation_archive",
"format": "tar.gz",
"data": "H4sIAAAAAAAAAwspKk0tLk5MTQUAiO9PLBAAAAA="
}
Sent when a service fails for a specific sentence or when a control message cannot be processed. Does not close the session - streaming continues.
{
"type": "error",
"service": "VAD",
"message": "VAD unavailable."
}
{
"type": "error",
"service": "Translation Pipeline",
"message": "Translation service request timed out.",
"sentence_id": 3
}
{
"type": "error",
"service": "Control",
"message": "sentence_id must be an integer"
}
Retrieve the list of supported BCP-47 language codes. Standard HTTP GET - not a WebSocket channel.
Get supported languages
Available only on servers:
Accepts the following message:
Sorted list of BCP-47 locale codes supported by the ASR engine
[
"ar-SA",
"de-DE",
"en-GB",
"en-US",
"fr-FR"
]
First JSON message sent by client - must arrive within 10 seconds of connection
Raw audio data streamed continuously after handshake
Raw PCM audio - s16le, 16kHz, mono. Recommended chunk size: 4096 bytes.
Adjust TTS playback speed for all sentences processed after this message. Takes effect on the next sentence - the pipeline currently in-flight is not affected.
Mark a sentence (or the whole session) for later review. Omit `sentence_id` to flag the entire conversation.
Override the speaker role assigned by ASR language detection for a sentence. Useful when both speakers share a language or ASR misidentified the speaker.
Run Branch C on demand for a single sentence, regardless of whether `enable_dual_direction` was set at handshake. Useful when the client suspects a speaker misidentification on a specific sentence and wants the alternate translation without paying the cost for every sentence. The server sends an immediate `BranchCTriggered` acknowledgment, then the `BranchCResult` (or an `error`) arrives asynchronously once translation and TTS complete.
Signals the server to end the session cleanly. The server will: 1. Send `ExitAcknowledged` immediately 2. Await any in-flight audio pipelines 3. Send `ConversationArchive` if `save_conversation` was true 4. Close the WebSocket with code 1000 Preferred over simply closing the connection - ensures in-flight sentences are processed to completion and the archive is delivered before the socket closes.
Server accepted the configuration - audio streaming may begin
ASR transcription of a detected sentence with speaker identification
Literal translation with back-translation and TTS audio
Contextually reformulated translation with back-translation and TTS audio
Literal translation assuming the opposite speaker to what ASR detected. Only sent when `enable_dual_direction` was true in the handshake. Always arrives after the corresponding `BranchAResult` for the same sentence.
Confirms the new TTS speed value after a set_speed control message
Confirms receipt of a flag control message
Confirms receipt of a correct_speaker control message
Immediate acknowledgment that a `trigger_branch_c` request was accepted. The actual result arrives later as a `BranchCResult` message once translation and TTS complete.
Confirms receipt of a `conversation_exit` control message. The server is now draining in-flight pipelines. The `ConversationArchive` (if applicable) and the WebSocket close frame follow shortly after.
Delivered at session end when `save_conversation` was true in the handshake and at least one sentence was processed. Contains a Base64-encoded tar.gz archive with ASR transcripts, translations, reformulations, and raw audio. Sent before the WebSocket close frame. If archive construction fails server-side, an `ErrorMessage` with `service: "Conversation Archive"` is sent instead - the client should handle both cases when `save_conversation` is true.
Sent when a service fails for a specific sentence or when a control message cannot be processed. Does not close the session - streaming continues.
Sorted list of BCP-47 locale codes supported by the ASR engine