Audio files must be encoded and packaged in formats that balance quality, size, and compatibility. Consistent encoding parameters ensure accurate recognition and low latency across both synchronous and asynchronous workflows. The API supports both containerized audio formats (such as Ogg and WebM) as well as raw PCM audio streams.Documentation Index
Fetch the complete documentation index at: https://docs.corti.ai/llms.txt
Use this file to discover all available pages before exploring further.
Supported Audio Formats
Container-based Audio
The following audio containers and their associated codecs are supported by the Corti API:| Container | Supported Encodings | Comments |
|---|---|---|
| Ogg | Opus, Vorbis | Excellent quality at low bandwidth |
| WebM | Opus, Vorbis | Excellent quality at low bandwidth |
| MP4/M4A | AAC, MP3 | Compression may degrade transcription quality |
| MP3 | MP3 | Compression may degrade transcription quality |
Allowable MIME types for streamed audio
Allowable MIME types for streamed audio
This parameter is optional but recommended
audioFormat parameter can be defined in transcribe and streams configuration to declare the audio format the speech to text system should expect in the incoming audio stream.| Format | Accepted MIME types |
|---|---|
| Ogg | audio/ogg |
| WebM | audio/webm |
| Opus | audio/opus |
| Vorbis | audio/vorbis |
| MP3 | audio/mpeg, audio/mp3, audio/mpeg3 |
| FLAC | audio/flac |
| M4A / AAC | audio/mp4, audio/m4a |
audio/ogg, audio/webm), you can optionally specify a codec parameter. Allowed codecs are opus and vorbis.Examples:WAV files are supported for upload to the
/recordings endpoint, but raw PCM audio should follow approach outlined below.Raw Audio
Raw pulse code modulation (PCM) audio is supported when rate, channels, and bits parameters are defined in configuration.Allowable MIME types for raw audio configuration
Allowable MIME types for raw audio configuration
This parameter is required for use with raw PCM audio
audioFormat parameter can be defined in transcribe and streams configuration to declare the audio format the speech to text system should expect in the incoming audio stream.| Format | Accepted MIME types |
|---|---|
| Raw PCM | audio/pcm |
audio/pcm), the parameters rate, channels, and bits must be defined.| Parameter | Type | Possible Values |
|---|---|---|
| rate | int | 8000-48000 |
| channels | int | 1-2 |
| bits | int | 8, 16, 24, or 32 |
When using Raw PCM audio, 16-bit little-endian mono at 16 kHz is recommended.Note that only signed, Little Endian (LE) audio is supported. Use of Big Endian (BE) audio will result in corrupted transcripts.
Audio streaming recommendations
Sample rate of 16 kHz
Captures the full range of human speech frequencies, with higher rates offering negligible recognition benefit but increasing computational cost
Audio chunk size of 250 milliseconds
Optimal speed to support both dictation and AI scribing workflows, with sending much smaller chunks more frequently can degrade recognition accuracy without improving latency
Stream at real-time speed
Audio should be streamed at or near real-time speed. Streaming audio faster than real time is not recommended and may cause buffering issues, degraded results, or stream termination. Pace audio chunks according to their actual audio duration.
Microphone Configuration
Dictation
| Setting | Recommendation | Rationale |
|---|---|---|
| echoCancellation | Off | Ensure clear, unfiltered audio from near-field recording. |
| autoGainControl | Off | Manual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping. |
| noiseSuppression | Mild (-15dB) | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment. |
Ambient Conversation
| Setting | Recommendation | Rationale |
|---|---|---|
| echoCancellation | On | Suppresses “echo” audio that is being played by your device speaker, e.g. remote call participant’s voice + system alert sounds. |
| autoGainControl | On | Adaptive correction of input gain to support varying loudness and speaking patterns of conversational audio. |
| noiseSuppression | Mild (-15dB) | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment. |
Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal speech-to-text normalization.
Channel Configuration
Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.| Audio type | Workflow | Rationale |
|---|---|---|
| Mono | Dictation or in-room doctor/patient conversation | Speech to text models expect a single coherent input source. Mono also reduces bandwidth and file size without affecting accuracy. |
| Multichannel (dual mono) | Telehealth or remote doctor/patient conversations | Assigns each participant to a dedicated channel, allowing the speech to text system to perform accurate speaker attribution. Provides better control over noise suppression and improves transcription accuracy when voices overlap. |
Streams Endpoint
Mono Audio Stream with Diarization
Mono Audio Stream with Diarization
Multichannel Audio Stream with Participants
Multichannel Audio Stream with Participants
Diarization Disabled
Diarization Disabled
Transcripts Endpoint
Mono Audio File with Diarization
Mono Audio File with Diarization
Multichannel Audio File with Participants
Multichannel Audio File with Participants
Audio File Diarization Disabled
Audio File Diarization Disabled
Additional Notes
- Enabling diarization is typically only required on mono audio.
- Mono audio with diarization disabled will produce transcripts with one channel (-1), whereas diarized-mono transcripts will have two channels (0, 1).
- For multichannel audio, each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels.
- Keep all channels aligned in time; do not trim or delay audio streams independently.
- Ensure each channel contains only one participant’s feed to avoid duplicated transcript content.
- Recommended capture format is 16-bit / 16 kHz PCM
Please contact us if you need more information about supported audio formats or are having issues processing an audio file.Additional references and resources: