Orchestrator WebSocket API 1.5.2

WebSocket API for async low-latency audio translation designed for medical consultations.

Main Features:

  • Speech-to-Text (ASR)
  • Voice Activity Detection (VAD)
  • Bidirectional async low-latency translation
  • Contextual reformulation
  • Text-to-Speech (TTS) with adjustable speed
  • Back-translation for validation
  • Alternate speaker hypothesis (Branch C)
  • In-session control messages (flag, correct, speed)

Architecture: The system uses a pipeline architecture with parallel branches per detected sentence:

  • Branch A: Literal translation -> TTS -> back-translation
  • Branch B: Reformulation -> translation -> TTS -> back-translation
  • Branch C: Same as A but with speaker roles flipped (optional, see enable_dual_direction)

Processing Flow:

  1. Client sends HandshakeRequest with language/gender/speed configuration
  2. Server responds with HandshakeSuccess
  3. Client streams raw PCM audio (16kHz, mono, 16-bit)
  4. VAD detects sentence boundaries
  5. ASR -> Translation -> TTS run in parallel across branches
  6. Results sent asynchronously as JSON with Base64 audio
  7. Client may send JSON control messages at any time during streaming
  8. Client sends ConversationExit to end the session gracefully, or the server closes automatically on inactivity timeout
  9. Server sends ConversationArchive if save_conversation was true, then closes
  • #WebSocket
  • #Audio
  • #Translation
  • #Medical

Servers

  • staging.fr.vokaalia.com/v1/orchestratorwsorchestrator

    Development server for audio streaming and orchestration.

    ⚠️ Connection Requirements (HTTP/1.1 Only): This endpoint strictly requires a standard HTTP/1.1 WebSocket Handshake.

    Clients must send the following headers:

    • Connection: Upgrade
    • Upgrade: websocket
    • Sec-WebSocket-Version: 13
    • Sec-WebSocket-Key: [Base64-encoded 16-byte random key]

    Note: HTTP/2 is not supported for the handshake. Clients like curl or strict proxies must force HTTP/1.1 or the required Upgrade headers will be stripped.

Operations

  • PUB /ws/audio

    Main WebSocket channel for audio streaming, translation results, and session control.

    Communication Sequence:

    1. Client -> Server: HandshakeRequest (JSON)
    2. Server -> Client: HandshakeSuccess (JSON)
    3. Client -> Server: AudioChunk (bytes) - continuous streaming
    4. Server -> Client: ASRResult, BranchAResult, BranchBResult (JSON) - asynchronous
    5. Client -> Server: control messages (JSON) - optional, any time during streaming
    6. Server -> Client: acknowledgment messages (JSON) - in response to control messages
    7. Client -> Server: ConversationExit (JSON) - to end the session gracefully
    8. Server -> Client: ExitAcknowledged (JSON) - confirms exit received
    9. Server -> Client: ConversationArchive (JSON) - on session end, if save_conversation was true

    Control messages (Client -> Server): JSON text frames sent at any point after the handshake.

    • set_speed - change TTS speed for subsequent sentences
    • flag - flag a sentence or the whole conversation for review
    • correct_speaker - correct a misidentified speaker role
    • trigger_branch_c - run the alternate-speaker hypothesis for one specific sentence on demand
    • conversation_exit - end the session gracefully (triggers cleanup and optional archive delivery)

    Timeouts:

    • Handshake: 10 seconds
    • Client inactivity: 60 seconds
    • Max session: 9000 seconds

    WebSocket Close Codes:

    • 1000 - Normal closure (handshake timeout, client inactivity, clean session end)
    • 1003 - Unsupported data (invalid JSON in handshake)
    • 1008 - Policy violation (missing required fields, unsupported language)
    • 1011 - Internal error (server processing error)

    Send audio chunks and session control messages

    Operation IDsendAudioStream

    Available only on servers:

    Accepts one of the following messages:

    • #0Initial session configuration

      First JSON message sent by client - must arrive within 10 seconds of connection

      Message IDhandshakeRequest
      object

      Examples

    • #1PCM audio chunk

      Raw audio data streamed continuously after handshake

      Message IDaudioChunk
      string

      Raw PCM audio - s16le, 16kHz, mono. Recommended chunk size: 4096 bytes.

      Examples

    • #2Change TTS speed

      Adjust TTS playback speed for all sentences processed after this message. Takes effect on the next sentence - the pipeline currently in-flight is not affected.

      Message IDsetSpeed
      object

      Examples

    • #3Flag sentence or conversation

      Mark a sentence (or the whole session) for later review. Omit `sentence_id` to flag the entire conversation.

      Message IDflag
      object

      Examples

    • #4Correct speaker identification

      Override the speaker role assigned by ASR language detection for a sentence. Useful when both speakers share a language or ASR misidentified the speaker.

      Message IDcorrectSpeaker
      object

      Examples

    • #5Request alternate-speaker hypothesis for one sentence

      Run Branch C on demand for a single sentence, regardless of whether `enable_dual_direction` was set at handshake. Useful when the client suspects a speaker misidentification on a specific sentence and wants the alternate translation without paying the cost for every sentence. The server sends an immediate `BranchCTriggered` acknowledgment, then the `BranchCResult` (or an `error`) arrives asynchronously once translation and TTS complete.

      Message IDtriggerBranchC
      object

      Examples

    • #6End the session gracefully

      Signals the server to end the session cleanly. The server will: 1. Send `ExitAcknowledged` immediately 2. Await any in-flight audio pipelines 3. Send `ConversationArchive` if `save_conversation` was true 4. Close the WebSocket with code 1000 Preferred over simply closing the connection - ensures in-flight sentences are processed to completion and the archive is delivered before the socket closes.

      Message IDconversationExit
      object

      Examples

  • SUB /ws/audio

    Main WebSocket channel for audio streaming, translation results, and session control.

    Communication Sequence:

    1. Client -> Server: HandshakeRequest (JSON)
    2. Server -> Client: HandshakeSuccess (JSON)
    3. Client -> Server: AudioChunk (bytes) - continuous streaming
    4. Server -> Client: ASRResult, BranchAResult, BranchBResult (JSON) - asynchronous
    5. Client -> Server: control messages (JSON) - optional, any time during streaming
    6. Server -> Client: acknowledgment messages (JSON) - in response to control messages
    7. Client -> Server: ConversationExit (JSON) - to end the session gracefully
    8. Server -> Client: ExitAcknowledged (JSON) - confirms exit received
    9. Server -> Client: ConversationArchive (JSON) - on session end, if save_conversation was true

    Control messages (Client -> Server): JSON text frames sent at any point after the handshake.

    • set_speed - change TTS speed for subsequent sentences
    • flag - flag a sentence or the whole conversation for review
    • correct_speaker - correct a misidentified speaker role
    • trigger_branch_c - run the alternate-speaker hypothesis for one specific sentence on demand
    • conversation_exit - end the session gracefully (triggers cleanup and optional archive delivery)

    Timeouts:

    • Handshake: 10 seconds
    • Client inactivity: 60 seconds
    • Max session: 9000 seconds

    WebSocket Close Codes:

    • 1000 - Normal closure (handshake timeout, client inactivity, clean session end)
    • 1003 - Unsupported data (invalid JSON in handshake)
    • 1008 - Policy violation (missing required fields, unsupported language)
    • 1011 - Internal error (server processing error)

    Receive transcription, translation results, and acknowledgments

    Operation IDreceiveResults

    Available only on servers:

    Accepts one of the following messages:

    • #0Configuration confirmation

      Server accepted the configuration - audio streaming may begin

      Message IDhandshakeSuccess
      object

      Examples

    • #1Transcription result

      ASR transcription of a detected sentence with speaker identification

      Message IDasrResult
      object

      Examples

    • #2Literal translation result

      Literal translation with back-translation and TTS audio

      Message IDbranchAResult
      object

      Examples

    • #3Reformulated translation result

      Contextually reformulated translation with back-translation and TTS audio

      Message IDbranchBResult
      object

      Examples

    • #4Alternate speaker hypothesis result

      Literal translation assuming the opposite speaker to what ASR detected. Only sent when `enable_dual_direction` was true in the handshake. Always arrives after the corresponding `BranchAResult` for the same sentence.

      Message IDbranchCResult
      object

      Examples

    • #5Speed change acknowledgment

      Confirms the new TTS speed value after a set_speed control message

      Message IDspeedUpdated
      object

      Examples

    • #6Flag acknowledgment

      Confirms receipt of a flag control message

      Message IDflagAck
      object

      Examples

    • #7Speaker correction acknowledgment

      Confirms receipt of a correct_speaker control message

      Message IDcorrectionAck
      object

      Examples

    • #8Branch C on-demand acknowledgment

      Immediate acknowledgment that a `trigger_branch_c` request was accepted. The actual result arrives later as a `BranchCResult` message once translation and TTS complete.

      Message IDbranchCTriggered
      object

      Examples

    • #9Session exit acknowledgment

      Confirms receipt of a `conversation_exit` control message. The server is now draining in-flight pipelines. The `ConversationArchive` (if applicable) and the WebSocket close frame follow shortly after.

      Message IDexitAcknowledged
      object

      Examples

    • #10Conversation archive

      Delivered at session end when `save_conversation` was true in the handshake and at least one sentence was processed. Contains a Base64-encoded tar.gz archive with ASR transcripts, translations, reformulations, and raw audio. Sent before the WebSocket close frame. If archive construction fails server-side, an `ErrorMessage` with `service: "Conversation Archive"` is sent instead - the client should handle both cases when `save_conversation` is true.

      Message IDconversationArchive
      object

      Examples

    • #11Error notification

      Sent when a service fails for a specific sentence or when a control message cannot be processed. Does not close the session - streaming continues.

      Message IDerrorMessage
      object

      Examples

  • SUB /languages

    Retrieve the list of supported BCP-47 language codes. Standard HTTP GET - not a WebSocket channel.

    Get supported languages

    Operation IDgetLanguages

    Available only on servers:

    Accepts the following message:

    Supported languages

    Sorted list of BCP-47 locale codes supported by the ASR engine

    Message IDlanguagesResponse
    array<string>

    Examples

Messages

  • #1Initial session configuration

    First JSON message sent by client - must arrive within 10 seconds of connection

    Message IDhandshakeRequest
    object
  • #2PCM audio chunk

    Raw audio data streamed continuously after handshake

    Message IDaudioChunk
    string

    Raw PCM audio - s16le, 16kHz, mono. Recommended chunk size: 4096 bytes.

  • #3Change TTS speed

    Adjust TTS playback speed for all sentences processed after this message. Takes effect on the next sentence - the pipeline currently in-flight is not affected.

    Message IDsetSpeed
    object
  • #4Flag sentence or conversation

    Mark a sentence (or the whole session) for later review. Omit `sentence_id` to flag the entire conversation.

    Message IDflag
    object
  • #5Correct speaker identification

    Override the speaker role assigned by ASR language detection for a sentence. Useful when both speakers share a language or ASR misidentified the speaker.

    Message IDcorrectSpeaker
    object
  • #6Request alternate-speaker hypothesis for one sentence

    Run Branch C on demand for a single sentence, regardless of whether `enable_dual_direction` was set at handshake. Useful when the client suspects a speaker misidentification on a specific sentence and wants the alternate translation without paying the cost for every sentence. The server sends an immediate `BranchCTriggered` acknowledgment, then the `BranchCResult` (or an `error`) arrives asynchronously once translation and TTS complete.

    Message IDtriggerBranchC
    object
  • #7End the session gracefully

    Signals the server to end the session cleanly. The server will: 1. Send `ExitAcknowledged` immediately 2. Await any in-flight audio pipelines 3. Send `ConversationArchive` if `save_conversation` was true 4. Close the WebSocket with code 1000 Preferred over simply closing the connection - ensures in-flight sentences are processed to completion and the archive is delivered before the socket closes.

    Message IDconversationExit
    object
  • #8Configuration confirmation

    Server accepted the configuration - audio streaming may begin

    Message IDhandshakeSuccess
    object
  • #9Transcription result

    ASR transcription of a detected sentence with speaker identification

    Message IDasrResult
    object
  • #10Literal translation result

    Literal translation with back-translation and TTS audio

    Message IDbranchAResult
    object
  • #11Reformulated translation result

    Contextually reformulated translation with back-translation and TTS audio

    Message IDbranchBResult
    object
  • #12Alternate speaker hypothesis result

    Literal translation assuming the opposite speaker to what ASR detected. Only sent when `enable_dual_direction` was true in the handshake. Always arrives after the corresponding `BranchAResult` for the same sentence.

    Message IDbranchCResult
    object
  • #13Speed change acknowledgment

    Confirms the new TTS speed value after a set_speed control message

    Message IDspeedUpdated
    object
  • #14Flag acknowledgment

    Confirms receipt of a flag control message

    Message IDflagAck
    object
  • #15Speaker correction acknowledgment

    Confirms receipt of a correct_speaker control message

    Message IDcorrectionAck
    object
  • #16Branch C on-demand acknowledgment

    Immediate acknowledgment that a `trigger_branch_c` request was accepted. The actual result arrives later as a `BranchCResult` message once translation and TTS complete.

    Message IDbranchCTriggered
    object
  • #17Session exit acknowledgment

    Confirms receipt of a `conversation_exit` control message. The server is now draining in-flight pipelines. The `ConversationArchive` (if applicable) and the WebSocket close frame follow shortly after.

    Message IDexitAcknowledged
    object
  • #18Conversation archive

    Delivered at session end when `save_conversation` was true in the handshake and at least one sentence was processed. Contains a Base64-encoded tar.gz archive with ASR transcripts, translations, reformulations, and raw audio. Sent before the WebSocket close frame. If archive construction fails server-side, an `ErrorMessage` with `service: "Conversation Archive"` is sent instead - the client should handle both cases when `save_conversation` is true.

    Message IDconversationArchive
    object
  • #19Error notification

    Sent when a service fails for a specific sentence or when a control message cannot be processed. Does not close the session - streaming continues.

    Message IDerrorMessage
    object
  • #20Supported languages

    Sorted list of BCP-47 locale codes supported by the ASR engine

    Message IDlanguagesResponse
    array<string>

Schemas

  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object
  • object