Orchestrator WebSocket API 1.3.0

WebSocket API for async low latency audio translation designed for medical consultations.

Main Features:

  • Speech-to-Text (ASR)
  • Voice Activity Detection (VAD)
  • Bidirectional async low latency translation
  • Contextual reformulation
  • Text-to-Speech (TTS)
  • Back-translation for validation

Architecture: The system uses a pipeline architecture with two parallel branches:

  • Branch A: Literal translation
  • Branch B: Translation with contextual reformulation

Processing Flow:

  1. Client sends handshake with language/gender configuration
  2. PCM audio streaming (16kHz, mono, 16bits)
  3. Automatic sentence detection via VAD
  4. ASR → Translation → TTS (parallel on 2 branches)
  5. Results sent to client in JSON with Base64 audio

Servers

  • staging.fr.vokaalia.com/v1/orchestratorwsorchestrator

    Development server for audio streaming and orchestration.

    ⚠️ Connection Requirements (HTTP/1.1 Only): This endpoint strictly requires a standard HTTP/1.1 WebSocket Handshake.

    Clients must send the following headers:

    • Connection: Upgrade
    • Upgrade: websocket
    • Sec-WebSocket-Version: 13
    • Sec-WebSocket-Key: [Base64-encoded 16-byte random key]

    Note: HTTP/2 is not supported for the handshake. If using clients like curl or strict proxies, you must force HTTP/1.1 or ensure the client does not attempt an HTTP/2 connection, as this will strip the required Upgrade headers.

Operations

  • PUB /ws/audio

    Main WebSocket channel for audio streaming and receiving translation results.

    Communication Sequence:

    1. Client → Server: HandshakeRequest (JSON)
    2. Server → Client: HandshakeSuccess (JSON)
    3. Client → Server: AudioChunk (bytes) [continuous streaming]
    4. Server → Client: ASRResult, BranchAResult, BranchBResult (JSON) [asynchronous]

    Timeouts:

    • Handshake: 10 seconds
    • Client inactivity: 60 seconds
    • Max session: 9000 seconds

    WebSocket Close Codes (1XXX): The server may close the connection with the following standard WebSocket status codes:

    • 1000 (Normal Closure):

      • Handshake timeout (client didn't send configuration within 10s).
      • Client inactivity (no audio received for 60s).
    • 1003 (Unsupported Data):

      • Invalid JSON format in handshake message.
    • 1008 (Policy Violation):

      • Missing required fields in handshake (e.g., language_medic, language_patient).
      • Unsupported language: The provided language code is not supported by the ASR engine.
    • 1011 (Internal Error):

      • Server processing error or failure to send handshake confirmation.

    Send configuration and audio

    Operation IDsendAudioStream

    Available only on servers:

    Accepts one of the following messages:

    • #0Initial session configuration

      First JSON message sent by client to configure the session

      object

      Examples

    • #1PCM audio chunk

      Raw audio data streamed continuously after handshake

      Payload
      string
      format: binary

      Raw PCM audio (s16le, 16kHz, mono). Recommended size: 4096 bytes per chunk.

      Examples

  • SUB /ws/audio

    Main WebSocket channel for audio streaming and receiving translation results.

    Communication Sequence:

    1. Client → Server: HandshakeRequest (JSON)
    2. Server → Client: HandshakeSuccess (JSON)
    3. Client → Server: AudioChunk (bytes) [continuous streaming]
    4. Server → Client: ASRResult, BranchAResult, BranchBResult (JSON) [asynchronous]

    Timeouts:

    • Handshake: 10 seconds
    • Client inactivity: 60 seconds
    • Max session: 9000 seconds

    WebSocket Close Codes (1XXX): The server may close the connection with the following standard WebSocket status codes:

    • 1000 (Normal Closure):

      • Handshake timeout (client didn't send configuration within 10s).
      • Client inactivity (no audio received for 60s).
    • 1003 (Unsupported Data):

      • Invalid JSON format in handshake message.
    • 1008 (Policy Violation):

      • Missing required fields in handshake (e.g., language_medic, language_patient).
      • Unsupported language: The provided language code is not supported by the ASR engine.
    • 1011 (Internal Error):

      • Server processing error or failure to send handshake confirmation.

    Receive transcription and translation results

    Operation IDreceiveResults

    Available only on servers:

    Accepts one of the following messages:

    • #0Configuration confirmation

      Confirmation that server accepted the configuration

      object

      Examples

    • #1Transcription result

      ASR transcription of a detected sentence with speaker identification

      object

      Examples

    • #2Literal translation result (Branch A)

      Literal translation with back-translation and TTS audio

      object

      Examples

    • #3Reformulated translation result (Branch B)

      Translation with contextual reformulation, back-translation and TTS audio

      object

      Examples

    • #4Error message

      Error notification or service degradation

      object

      Examples

  • SUB /languages

    Retrieve the dictionary of supported languages for the ASR engine. This is a standard HTTP GET request.

    Get supported languages

    Operation IDgetLanguages

    Available only on servers:

    Accepts the following message:

    Supported Languages

    Sorted list of BCP-47 locale codes supported by the ASR engine

    array<string>

    Examples

Messages

  • #1Initial session configuration

    First JSON message sent by client to configure the session

    Message IDHandshakeRequest
    object
  • #2PCM audio chunk

    Raw audio data streamed continuously after handshake

    Message IDAudioChunk
    Payload
    string
    format: binary

    Raw PCM audio (s16le, 16kHz, mono). Recommended size: 4096 bytes per chunk.

  • #3Configuration confirmation

    Confirmation that server accepted the configuration

    Message IDHandshakeSuccess
    object
  • #4Transcription result

    ASR transcription of a detected sentence with speaker identification

    Message IDASRResult
    object
  • #5Literal translation result (Branch A)

    Literal translation with back-translation and TTS audio

    Message IDBranchAResult
    object
  • #6Reformulated translation result (Branch B)

    Translation with contextual reformulation, back-translation and TTS audio

    Message IDBranchBResult
    object
  • #7Error message

    Error notification or service degradation

    Message IDErrorMessage
    object
  • #8Supported Languages

    Sorted list of BCP-47 locale codes supported by the ASR engine

    Message IDLanguagesResponse
    array<string>

Schemas

  • object
  • object
  • object
  • object
  • object
  • object