Audio Configuration

Audio files must be encoded and packaged in formats that balance quality, size, and compatibility. Consistent encoding parameters ensure accurate recognition and low latency across both synchronous and asynchronous workflows. The API supports both containerized audio formats (such as Ogg and WebM) as well as raw PCM audio streams.

Please ensure your audio files conform to the specifications listed below. Let us know if you need help with audio formatting or API request configuration.

Supported Audio Formats

Container-based Audio

The following audio containers and their associated codecs are supported by the Corti API:

Container	Supported Encodings	Comments
Ogg	Opus, Vorbis	Excellent quality at low bandwidth
WebM	Opus, Vorbis	Excellent quality at low bandwidth
MP4/M4A	AAC, MP3	Compression may degrade transcription quality
MP3	MP3	Compression may degrade transcription quality

Allowable MIME types for streamed audio

This parameter is optional but recommended

The audioFormat parameter can be defined in transcribe and streams configuration to declare the audio format the speech to text system should expect in the incoming audio stream.

Format	Accepted MIME types
Ogg	`audio/ogg`
WebM	`audio/webm`
Opus	`audio/opus`
Vorbis	`audio/vorbis`
MP3	`audio/mpeg`, `audio/mp3`, `audio/mpeg3`
FLAC	`audio/flac`
M4A / AAC	`audio/mp4`, `audio/m4a`

For container formats (audio/ogg, audio/webm), you can optionally specify a codec parameter. Allowed codecs are opus and vorbis.Examples:

audio/ogg; codecs=opus
audio/webm; codecs=opus
audio/ogg; codecs=vorbis

WAV files are supported for upload to the /recordings endpoint, but raw PCM audio should follow approach outlined below.

Raw Audio

Raw pulse code modulation (PCM) audio is supported when rate, channels, and bits parameters are defined in configuration.

Allowable MIME types for raw audio configuration

This parameter is required for use with raw PCM audio

The audioFormat parameter can be defined in transcribe and streams configuration to declare the audio format the speech to text system should expect in the incoming audio stream.

Format	Accepted MIME types
Raw PCM	`audio/pcm`

For raw audio (audio/pcm), the parameters rate, channels, and bits must be defined.

Parameter	Type	Required	Possible Values
rate	int	`required`	`8000-48000`
channels	int	`required`	`1-2`
bits	int	`required`	`8`, `16`, `24`, or `32`
endian	str	`optional`	`little`, `big`
encoding	str	`optional`	`sint`, `uint`

Examples:

audio/pcm; rate=16000; channels=1; bits=16
audio/pcm; rate=44100; channels=2; bits=32
audio/pcm; rate=8000; channels=1; bits=8
audio/pcm; rate=48000; channels=2; bits=24
audio/pcm; rate=16000; channels=1; bits=16; endian=little; encoding=sint

When using Raw PCM audio, 16-bit little-endian mono at 16 kHz is recommended.

Audio streaming recommendations

Sample rate of 16 kHz

Captures the full range of human speech frequencies, with higher rates offering negligible recognition benefit but increasing computational cost

Audio chunk size of 250 milliseconds

Optimal speed to support both dictation and AI scribing workflows, with sending much smaller chunks more frequently can degrade recognition accuracy without improving latency

Stream at real-time speed

Audio should be streamed at or near real-time speed. Streaming audio faster than real time is not recommended and may cause buffering issues, degraded results, or stream termination. Pace audio chunks according to their actual audio duration.

Microphone Configuration

Dictation

Setting	Recommendation	Rationale
echoCancellation	Off	Ensure clear, unfiltered audio from near-field recording.
autoGainControl	Off	Manual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping.
noiseSuppression	Mild (-15dB)	Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.

Ambient Conversation

Setting	Recommendation	Rationale
echoCancellation	On	Suppresses “echo” audio that is being played by your device speaker, e.g. remote call participant’s voice + system alert sounds.
autoGainControl	On	Adaptive correction of input gain to support varying loudness and speaking patterns of conversational audio.
noiseSuppression	Mild (-15dB)	Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.

Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal speech-to-text normalization.

Channel Configuration

Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.

Audio type	Workflow	Rationale
Mono	Dictation or in-room doctor/patient conversation	Speech to text models expect a single coherent input source. Mono also reduces bandwidth and file size without affecting accuracy.
Multichannel (dual mono)	Telehealth or remote doctor/patient conversations	Assigns each participant to a dedicated channel, allowing the speech to text system to perform accurate speaker attribution. Provides better control over noise suppression and improves transcription accuracy when voices overlap.

Mono input supports transcription with diarization; however, speaker separation may be unreliable when there is not clear turn-taking in the dialogue.Multichannel input (two audio channels, one per participant, in telehealth workflow) provides opportunity for improved speaker separation and labeling.

Streams Endpoint

Mono Audio Stream with Diarization

{
"type": "config",
"configuration": {
    "transcription": {
    "primaryLanguage": "en",
    "diarize": true,
    "isMultichannel": false,
    "participants": [
        {"channel": 0, "role": "multiple"}
      ]
    },
    "mode": {
    "type": "facts",
    "outputLocale": "en"
    }
  }
}

Multichannel Audio Stream with Participants

{
"type": "config",
"configuration": {
    "transcription": {
    "primaryLanguage": "en",
    "diarize": false,
    "isMultichannel": true,
    "participants": [
        {"channel": 0, "role": "doctor"},
        {"channel": 1, "role": "patient"}
      ]
    },
    "mode": {
    "type": "facts",
    "outputLocale": "en"
    }
  }
}

Diarization Disabled

{
"type": "config",
"configuration": {
    "transcription": {
    "primaryLanguage": "en",
    "diarize": false,
    "isMultichannel": false,
    "participants": []
    },
    "mode": {
    "type": "facts",
    "outputLocale": "en"
    }
  }
}

Transcripts Endpoint

Mono Audio File with Diarization

{
"recordingId": "uuid",
"primaryLanguage": "en",
"spokenPunctuation": true,
"isMultichannel": false,
"diarize": true,
"participants": [
    {"channel": 0, "role": "multiple"}
  ]
}

Multichannel Audio File with Participants

{
"recordingId": "uuid",
"primaryLanguage": "en",
"spokenPunctuation": true,
"isMultichannel": true,
"diarize": false,
"participants": [
    {"channel": 0, "role": "doctor"},
    {"channel": 1, "role": "patient"}
  ]
}

Audio File Diarization Disabled

{
"recordingId": "uuid",
"primaryLanguage": "en",
"spokenPunctuation": true,
"isMultichannel": false,
"diarize": false,
"participants": []
}

Additional Notes

Enabling diarization is typically only required on mono audio.
Mono audio with diarization disabled will produce transcripts with one channel (-1), whereas diarized-mono transcripts will have two channels (0, 1).
For multichannel audio, each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels.
Keep all channels aligned in time; do not trim or delay audio streams independently.
Ensure each channel contains only one participant’s feed to avoid duplicated transcript content.
Recommended capture format is 16-bit / 16 kHz PCM

Please contact us if you need more information about supported audio formats or are having issues processing an audio file.Additional references and resources:

Endpoints

Features

Best Practices

Resources

Supported Audio Formats

Container-based Audio

Raw Audio

Audio streaming recommendations

Sample rate of 16 kHz

Audio chunk size of 250 milliseconds

Stream at real-time speed

Microphone Configuration

Dictation

Ambient Conversation

Channel Configuration

Streams Endpoint

Transcripts Endpoint

Additional Notes

​Supported Audio Formats

​Container-based Audio

​Raw Audio

​Audio streaming recommendations

Sample rate of 16 kHz

Audio chunk size of 250 milliseconds

Stream at real-time speed

​Microphone Configuration

​Dictation

​Ambient Conversation

​Channel Configuration

​Streams Endpoint

​Transcripts Endpoint

​Additional Notes

Supported Audio Formats

Container-based Audio

Raw Audio

Audio streaming recommendations

Sample rate of 16 kHz

Microphone Configuration

Dictation

Ambient Conversation

Channel Configuration

Streams Endpoint

Transcripts Endpoint

Additional Notes