Submit audio or video files for transcription. Supports 100+ languages with speaker diarization, PII redaction, and advanced features.
POST /voice/transcription
Code Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://your-storage.com/deposition-audio.m4a",
"speaker_labels": true,
"language_code": "en"
}'
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://your-storage.com/deposition-audio.m4a",
"speaker_labels": true,
"language_code": "en"
}'
{
"id" : "5f9c5c3e-1234-5678-9abc-def012345678" ,
"status" : "queued" ,
"audio_url" : "https://your-storage.com/deposition-audio.m4a" ,
"language_code" : "en" ,
"speaker_labels" : true
}
Required:
audio_url (string): Publicly accessible URL to your audio/video file
Supports: M4A, MP3, MP4, WAV, FLAC, OGG, WebM, and more Max file size: 5GB Max duration: 7 hours Optional:
language_code (string): Language for transcription (default: auto-detect)en - Englishes - Spanishfr - Frenchde - Germanpt - Portuguesezh - Chineseja - Japanese... 100+ languages supported speaker_labels (boolean): Enable speaker diarization (default: false)Identifies different speakers as "Speaker A", "Speaker B", etc. Perfect for depositions, interviews, meetings auto_highlights (boolean): Automatically detect key phrases (default: false)Identifies important moments in the audio content_safety_labels (boolean): Detect sensitive content (default: false)Flags potentially sensitive topics redact_pii (boolean): Redact personally identifiable information (default: false)Removes names, addresses, SSNs, credit cards, etc. Essential for HIPAA compliance redact_pii_policies (array): Specific PII types to redactOptions: name, address, email, phone_number, ssn, credit_card, date_of_birth, medical, bank_account webhook_url (string): URL to receive completion notificationWebhook called when transcription completes Recommended for long audio files language_detection (boolean): Detect language automatically (default: false)Useful for multilingual audio Retrieve transcription status and completed transcript.
GET /voice/transcription/:id
GET
/voice/transcription/5f9c5c3e-1234-5678-9abc-def012345678 Execute RequestCode Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X GET https://api.case.dev/voice/transcription/5f9c5c3e-1234-5678-9abc-def012345678 \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json"
curl https://api.case.dev/voice/transcription/5f9c5c3e-1234-5678-9abc-def012345678 \
-H "Authorization: Bearer sk_case_your_api_key_here"
{
"id" : "5f9c5c3e-1234-5678-9abc-def012345678" ,
"status" : "processing" ,
"audio_url" : "https://your-storage.com/deposition-audio.m4a" ,
"audio_duration" : 3847.2
}
{
"id" : "5f9c5c3e-1234-5678-9abc-def012345678" ,
"status" : "completed" ,
"audio_url" : "https://your-storage.com/deposition-audio.m4a" ,
"text" : "Speaker A: Please state your name for the record. Speaker B: My name is Dr. Sarah Johnson..." ,
"words" : [
{
"text" : "Please" ,
"start" : 100 ,
"end" : 350 ,
"confidence" : 0.99 ,
"speaker" : "A"
},
{
"text" : "state" ,
"start" : 400 ,
"end" : 650 ,
"confidence" : 0.98 ,
"speaker" : "A"
}
],
"utterances" : [
{
"text" : "Please state your name for the record." ,
"start" : 100 ,
"end" : 2450 ,
"confidence" : 0.97 ,
"speaker" : "A"
},
{
"text" : "My name is Dr. Sarah Johnson." ,
"start" : 3100 ,
"end" : 5200 ,
"confidence" : 0.96 ,
"speaker" : "B"
}
],
"audio_duration" : 3847.2 ,
"confidence" : 0.95 ,
"language_code" : "en"
}
queued: Job accepted, waiting to startprocessing: Transcription in progresscompleted: Finished successfullyerror: Failed (check error message)text (string): Full transcript with speaker labelswords (array): Word-level timing and confidence
text: The wordstart: Start time in millisecondsend: End time in millisecondsconfidence: Accuracy score (0-1)speaker: Speaker label if diarization enabledutterances (array): Sentence-level speaker turns
Groups words into complete sentences per speaker audio_duration (number): Duration in secondsconfidence (number): Overall transcription confidencelanguage_code (string): Detected or specified language
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_..." \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://vault.s3.amazonaws.com/deposition-2024-1234.m4a",
"speaker_labels": true,
"language_code": "en",
"redact_pii": true,
"redact_pii_policies": ["name", "address", "ssn", "medical"],
"webhook_url": "https://your-app.com/transcription-complete"
}'
Perfect for:
Depositions with multiple speakers Witness interviews Client consultations Court proceedings
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_..." \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://storage.com/doctor-notes.mp3",
"redact_pii": true,
"redact_pii_policies": ["name", "medical", "date_of_birth"],
"content_safety_labels": true
}'
HIPAA Compliant:
Automatically redacts PHI Flags sensitive medical topics Maintains compliance logs
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_..." \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://storage.com/spanish-interview.m4a",
"language_detection": true,
"speaker_labels": true
}'
Supports:
Automatic language detection 100+ languages including Spanish, Mandarin, Arabic Speaker labels work across languages Audio Duration Typical Processing Time 5 minutes 30-60 seconds 30 minutes 2-4 minutes 1 hour 5-8 minutes 3 hours 15-25 minutes
Processing speed:
~0.15x realtime (1 hour audio = ~9 minutes processing) Higher accuracy than real-time transcription Use webhooks for files over 30 minutes For shorter audio files without webhooks:
#!/bin/bash
TRANSCRIPT_ID = "5f9c5c3e-1234-5678-9abc-def012345678"
while true ; do
RESPONSE = $( curl -s https://api.case.dev/voice/transcription/ $TRANSCRIPT_ID \
-H "Authorization: Bearer sk_case_..." )
STATUS = $( echo " $RESPONSE " | jq -r '.status' )
echo "Status: $STATUS "
if [ " $STATUS " = "completed" ]; then
echo "Transcription complete!"
echo " $RESPONSE " | jq -r '.text' > transcript.txt
break
elif [ " $STATUS " = "error" ]; then
echo "Transcription failed"
break
fi
sleep 5
done
When transcription completes, we POST to your webhook_url:
{
"transcript_id" : "5f9c5c3e-1234-5678-9abc-def012345678" ,
"status" : "completed" ,
"text" : "Full transcript text..." ,
"audio_duration" : 1847.3 ,
"confidence" : 0.96
}
Webhook verification:
Includes X-Signature header with HMAC-SHA256 Verify requests are from CaseMark Per-minute pricing:
Voice transcription : $0.30 per minute ($18.00 per hour)Example costs:
1-hour deposition: $18.00 3-hour medical interview: $54.00 30-minute client call: $9.00 No additional charges for:
Language detection Multiple languages Webhook delivery Word-level timestamps M4A (recommended for iOS recordings)MP3 (universal compatibility)MP4 (video files - audio extracted)WAV (uncompressed, highest quality)FLAC (lossless compression)OGG/Opus (web optimized)WebM (browser recordings)AMR (phone recordings)Video formats supported:
MP4, MOV, AVI, MKV (audio extracted automatically) Industry-leading accuracy:
95%+ for clear audio 90%+ for phone/courtroom recordings Works with background noise, accents, technical jargon Advanced features:
Speaker diarization : Identify who said whatPII redaction : HIPAA/GDPR compliant100+ languages : Auto-detect or specifyCustom vocabulary : Coming soon for legal termsParagraph formatting : Natural text structureTimestamps : Word and sentence levelTranscribe audio files stored in vaults without downloading. The transcription API accepts S3 URLs directly for seamless integration.
# Get vault object with audio file
VAULT_ID = "sytp1b5f5j1yuj7uffzzxgw6"
OBJECT_ID = "audio123"
# Get presigned download URL (valid for 1 hour)
DOWNLOAD_URL = $( curl -s https://api.case.dev/vault/ $VAULT_ID /objects/ $OBJECT_ID \
-H "Authorization: Bearer sk_case_..." \
| jq -r '.downloadUrl' )
# Submit for transcription
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_..." \
-H "Content-Type: application/json" \
-d "{
\" audio_url \" : \" $DOWNLOAD_URL \" ,
\" speaker_labels \" : true,
\" language_code \" : \" en \"
}"
For audio files that take longer to process, generate a presigned URL with extended expiry:
# Generate 24-hour presigned URL
PRESIGNED_RESPONSE = $( curl -s -X POST https://api.case.dev/vault/ $VAULT_ID /objects/ $OBJECT_ID /presigned-url \
-H "Authorization: Bearer sk_case_..." \
-H "Content-Type: application/json" \
-d '{"operation": "GET", "expiresIn": 86400}' )
AUDIO_URL = $( echo " $PRESIGNED_RESPONSE " | jq -r '.presignedUrl' )
# Submit for transcription
curl -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer sk_case_..." \
-d "{ \" audio_url \" : \" $AUDIO_URL \" , \" speaker_labels \" : true}"
#!/bin/bash
API_KEY = "sk_case_your_api_key_here"
VAULT_ID = "sytp1b5f5j1yuj7uffzzxgw6"
AUDIO_FILE = "deposition-recording.m4a"
# Step 1: Upload audio to vault
UPLOAD_RESPONSE = $( curl -s -X POST https://api.case.dev/vault/ $VAULT_ID /upload \
-H "Authorization: Bearer $API_KEY " \
-H "Content-Type: application/json" \
-d "{
\" filename \" : \" $AUDIO_FILE \" ,
\" contentType \" : \" audio/m4a \" ,
\" metadata \" : {
\" case \" : \" 2024-CV-1234 \" ,
\" type \" : \" deposition \" ,
\" date \" : \" 2024-11-10 \"
}
}" )
OBJECT_ID = $( echo " $UPLOAD_RESPONSE " | jq -r '.objectId' )
UPLOAD_URL = $( echo " $UPLOAD_RESPONSE " | jq -r '.uploadUrl' )
# Upload the file
curl -X PUT " $UPLOAD_URL " \
-H "Content-Type: audio/m4a" \
--data-binary "@ $AUDIO_FILE "
echo "✓ Audio uploaded to vault: $OBJECT_ID "
# Step 2: Get download URL
DOWNLOAD_URL = $( curl -s https://api.case.dev/vault/ $VAULT_ID /objects/ $OBJECT_ID \
-H "Authorization: Bearer $API_KEY " \
| jq -r '.downloadUrl' )
# Step 3: Submit for transcription
TRANSCRIPT_RESPONSE = $( curl -s -X POST https://api.case.dev/voice/transcription \
-H "Authorization: Bearer $API_KEY " \
-H "Content-Type: application/json" \
-d "{
\" audio_url \" : \" $DOWNLOAD_URL \" ,
\" speaker_labels \" : true,
\" language_code \" : \" en \" ,
\" redact_pii \" : true
}" )
TRANSCRIPT_ID = $( echo " $TRANSCRIPT_RESPONSE " | jq -r '.id' )
echo "✓ Transcription started: $TRANSCRIPT_ID "
# Step 4: Poll for completion
while true ; do
STATUS_RESPONSE = $( curl -s https://api.case.dev/voice/transcription/ $TRANSCRIPT_ID \
-H "Authorization: Bearer $API_KEY " )
STATUS = $( echo " $STATUS_RESPONSE " | jq -r '.status' )
echo "Transcription status: $STATUS "
if [ " $STATUS " = "completed" ]; then
echo "✓ Transcription complete!"
# Save transcript
echo " $STATUS_RESPONSE " | jq -r '.text' > transcript.txt
# Step 5: Upload transcript back to vault
TRANSCRIPT_UPLOAD = $( curl -s -X POST https://api.case.dev/vault/ $VAULT_ID /upload \
-H "Authorization: Bearer $API_KEY " \
-H "Content-Type: application/json" \
-d "{
\" filename \" : \" ${ AUDIO_FILE % . m4a }-transcript.txt \" ,
\" contentType \" : \" text/plain \" ,
\" metadata \" : {
\" source_audio_id \" : \" $OBJECT_ID \" ,
\" transcript_id \" : \" $TRANSCRIPT_ID \"
}
}" )
TRANSCRIPT_OBJECT_ID = $( echo " $TRANSCRIPT_UPLOAD " | jq -r '.objectId' )
TRANSCRIPT_UPLOAD_URL = $( echo " $TRANSCRIPT_UPLOAD " | jq -r '.uploadUrl' )
curl -X PUT " $TRANSCRIPT_UPLOAD_URL " \
-H "Content-Type: text/plain" \
--data-binary "@transcript.txt"
echo "✓ Transcript uploaded to vault: $TRANSCRIPT_OBJECT_ID "
break
elif [ " $STATUS " = "error" ]; then
echo "✗ Transcription failed"
exit 1
fi
sleep 10
done
echo ""
echo "=== Complete! ==="
echo "Audio in vault: $OBJECT_ID "
echo "Transcript in vault: $TRANSCRIPT_OBJECT_ID "
echo "Transcript ID: $TRANSCRIPT_ID "
No Downloads Required
Transcription service accesses audio directly from S3 Eliminates local file handling Secure
Presigned URLs expire automatically Audio files stay encrypted in vault Integrated Workflow
Upload → Transcribe → Store all in one platform Keep audio and transcripts together Cost Effective
Avoid S3 egress charges Pay only for transcription time Real-time speech-to-text for live audio streams via WebSocket. Get transcripts as you speak with ultra-low latency.
wss://casemark-ai--websocket-stream-helper-fastapi-app.modal.run/ws?token=sk_case_your_api_key_here
Ultra-Fast Transcription
300ms P50 latency on word emission 91% word accuracy rate
Intelligent endpointing for turn detection Pricing
$0.30 per minute ($18.00 per hour) Same rate as async transcription Unlimited concurrent streams No setup fees or minimums Use Cases
Live deposition transcription with real-time captions Phone call transcription as conversations happen Court proceeding transcription with live display Voice agent applications with immediate feedback Connect via WebSocket with your API key in the query string:
const token = 'sk_case_your_api_key_here' ;
const ws = new WebSocket ( `wss://casemark-ai--websocket-stream-helper-fastapi-app.modal.run/ws?token=${ token }` );
ws. onopen = () => {
console. log ( 'Connected to streaming transcription' );
};
ws. onmessage = ( event ) => {
const data = JSON . parse (event.data);
console. log ( 'Transcript:' , data);
};
ws. onerror = ( error ) => {
console. error ( 'WebSocket error:' , error);
};
ws. onclose = ( event ) => {
console. log ( 'Connection closed:' , event.code, event.reason);
};
Pass your API key as a query parameter:
Parameter: token Format: ?token=sk_case_your_api_key_here Required: Yes The WebSocket connection will be rejected if:
No token is provided Token is invalid or expired API key doesn't have voice, transcription, or streaming permissions Required format:
Encoding : PCM signed 16-bit little-endianSample rate : 16,000 Hz (16kHz)Channels : Mono (1 channel)Send raw audio bytes as binary WebSocket messages:
// Example: Send audio from microphone
navigator.mediaDevices. getUserMedia ({ audio: true })
. then ( stream => {
const mediaRecorder = new MediaRecorder (stream);
const audioContext = new AudioContext ({ sampleRate: 16000 });
mediaRecorder. ondataavailable = async ( event ) => {
const audioData = await event.data. arrayBuffer ();
// Convert to PCM 16-bit if needed
const pcmData = convertToPCM16 (audioData);
// Send to WebSocket
if (ws.readyState === WebSocket. OPEN ) {
ws. send (pcmData);
}
};
mediaRecorder. start ( 100 ); // Send every 100ms
});
Recommended : 100-250ms chunksMinimum : 50msMaximum : 1000msSmaller chunks = lower latency, but more overhead.
The WebSocket will send JSON messages with different types:
{
"type" : "session_begins" ,
"session_id" : "abc123" ,
"message" : "Streaming session started"
}
{
"message_type" : "PartialTranscript" ,
"text" : "Hello, this is a test" ,
"audio_start" : 0 ,
"audio_end" : 2000 ,
"confidence" : 0.95 ,
"words" : [
{
"text" : "Hello" ,
"start" : 0 ,
"end" : 400 ,
"confidence" : 0.98
},
{
"text" : "this" ,
"start" : 400 ,
"end" : 600 ,
"confidence" : 0.97
}
]
}
Partial transcripts are interim results that may change as more audio is processed.
{
"message_type" : "FinalTranscript" ,
"text" : "Hello, this is a test." ,
"audio_start" : 0 ,
"audio_end" : 2500 ,
"confidence" : 0.96 ,
"punctuated" : true ,
"words" : [
{
"text" : "Hello" ,
"start" : 0 ,
"end" : 400 ,
"confidence" : 0.98
},
{
"text" : "this" ,
"start" : 400 ,
"end" : 600 ,
"confidence" : 0.97
},
{
"text" : "is" ,
"start" : 600 ,
"end" : 750 ,
"confidence" : 0.96
},
{
"text" : "a" ,
"start" : 750 ,
"end" : 850 ,
"confidence" : 0.95
},
{
"text" : "test" ,
"start" : 850 ,
"end" : 2500 ,
"confidence" : 0.97
}
]
}
Final transcripts are immutable and won't change. Use these for official records.
{
"message_type" : "SessionTerminated"
}
Sent when the session ends (either by client or server).
Send a JSON message to end the session cleanly:
ws. send ( JSON . stringify ({ terminate: true }));
// Wait a moment for final transcripts
setTimeout (() => {
ws. close ();
}, 1000 );
Sessions automatically end after:
5 minutes of silence (no audio data received) Connection errors Client disconnect
const WebSocket = require ( 'ws' );
const fs = require ( 'fs' );
const token = 'sk_case_your_api_key_here' ;
const ws = new WebSocket ( `wss://casemark-ai--websocket-stream-helper-fastapi-app.modal.run/ws?token=${ token }` );
ws. on ( 'open' , () => {
console. log ( '✓ Connected to streaming transcription' );
// Read audio file (16kHz, PCM 16-bit, mono)
const audioFile = fs. readFileSync ( './audio.raw' );
// Send in chunks (100ms at 16kHz = 3200 bytes)
const chunkSize = 3200 ;
let offset = 0 ;
const sendChunk = () => {
if (offset < audioFile. length ) {
const chunk = audioFile. slice (offset, offset + chunkSize);
ws. send (chunk);
offset += chunkSize;
setTimeout (sendChunk, 100 ); // Send every 100ms
} else {
// End of audio
console. log ( '✓ Audio sent, waiting for final transcripts...' );
ws. send ( JSON . stringify ({ terminate: true }));
setTimeout (() => ws. close (), 2000 );
}
};
sendChunk ();
});
ws. on ( 'message' , ( data ) => {
const message = JSON . parse (data. toString ());
if (message.message_type === 'FinalTranscript' ) {
console. log ( 'Final:' , message.text);
} else if (message.message_type === 'PartialTranscript' ) {
console. log ( 'Partial:' , message.text);
} else {
console. log ( 'Message:' , message);
}
});
ws. on ( 'error' , ( error ) => {
console. error ( 'WebSocket error:' , error);
});
ws. on ( 'close' , ( code , reason ) => {
console. log ( `Connection closed: ${ code } ${ reason }` );
});
// Get microphone access
const stream = await navigator.mediaDevices. getUserMedia ({
audio: {
sampleRate: 16000 ,
channelCount: 1 ,
echoCancellation: true ,
noiseSuppression: true ,
}
});
const token = 'sk_case_your_api_key_here' ;
const ws = new WebSocket ( `wss://casemark-ai--websocket-stream-helper-fastapi-app.modal.run/ws?token=${ token }` );
// Set up AudioWorklet for PCM processing
const audioContext = new AudioContext ({ sampleRate: 16000 });
const source = audioContext. createMediaStreamSource (stream);
await audioContext.audioWorklet. addModule ( '/audio-processor.js' );
const processor = new AudioWorkletNode (audioContext, 'pcm-processor' );
processor.port. onmessage = ( event ) => {
// Send PCM data to WebSocket
if (ws.readyState === WebSocket. OPEN ) {
ws. send (event.data);
}
};
source. connect (processor);
processor. connect (audioContext.destination);
// Handle transcripts
ws. onmessage = ( event ) => {
const data = JSON . parse (event.data);
if (data.message_type === 'FinalTranscript' ) {
document. getElementById ( 'transcript' ).textContent += data.text + ' ' ;
} else if (data.message_type === 'PartialTranscript' ) {
document. getElementById ( 'interim' ).textContent = data.text;
}
};
// Stop recording
document. getElementById ( 'stop' ). onclick = () => {
ws. send ( JSON . stringify ({ terminate: true }));
stream. getTracks (). forEach ( track => track. stop ());
audioContext. close ();
ws. close ();
};
The WebSocket may send error messages:
{
"error" : "Authentication failed" ,
"code" : "AUTH_INVALID"
}
Common error codes:
AUTH_INVALID - Invalid or missing API keyPERMISSION_DENIED - API key lacks streaming permissionsSERVICE_UNAVAILABLE - Streaming service temporarily downUPSTREAM_ERROR - AssemblyAI service errorPROCESSING_ERROR - Failed to process audio dataSESSION_NOT_FOUND - No active sessionWebSocket close codes indicate why the connection ended:
Code Reason Description 1000 Normal closure Session ended normally 1008 Policy violation Authentication failed or insufficient perms 1011 Internal error Server error (temporary) 4000 Bad audio format Audio doesn't meet requirements 4001 Rate limit exceeded Too many concurrent connections
Optimize for accuracy:
Use noise cancellation when capturing microphone input Minimize background noise Use high-quality microphones for depositions Test with your specific audio setup Minimize end-to-end latency:
Send smaller chunks (100ms) for real-time display Use wired internet connection (not WiFi when possible) Host your application close to your users Process partial transcripts for immediate feedback Handle transient failures:
let reconnectAttempts = 0 ;
const maxReconnects = 3 ;
function connect () {
const ws = new WebSocket ( `wss://casemark-ai--websocket-stream-helper-fastapi-app.modal.run/ws?token=${ token }` );
ws. onclose = ( event ) => {
if (event.code !== 1000 && reconnectAttempts < maxReconnects) {
reconnectAttempts ++ ;
console. log ( `Reconnecting... (${ reconnectAttempts }/${ maxReconnects })` );
setTimeout (connect, 1000 * reconnectAttempts);
}
};
ws. onopen = () => {
reconnectAttempts = 0 ; // Reset on successful connection
};
return ws;
}
Monitor your usage:
Track connection duration to estimate costs Implement automatic disconnection after inactivity Set session time limits for budget control Monitor concurrent connections
const startTime = Date. now ();
ws. onclose = () => {
const durationSeconds = (Date. now () - startTime) / 1000 ;
const cost = (durationSeconds / 60 ) * 0.30 ; // $0.30 per minute
console. log ( `Session duration: ${ durationSeconds }s, Est. cost: $${ cost . toFixed ( 2 ) }` );
};
Feature Async Transcription Streaming Transcription Latency Minutes 300ms Protocol HTTP REST WebSocket Use Case Pre-recorded files Live audio Pricing $0.30/minute $0.30/minute Input Audio URL Raw audio stream Output Complete transcript Progressive transcripts Speaker Labels ✓ Yes Coming soon Auto Highlights ✓ Yes ✗ No Content Safety ✓ Yes ✗ No
When to use async:
Transcribing pre-recorded depositions Batch processing multiple files Need speaker diarization or advanced features When to use streaming:
Live courtroom transcription Real-time phone call transcription Voice assistant applications Interactive voice agents