Submit a document for OCR processing. The API extracts text, preserves layout, and can generate searchable PDFs.
Code Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X POST https://api.case.dev/ocr/v1/process \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"document_url": "https://your-storage.com/scanned-deposition.pdf",
"document_id": "case-2024-1234-depo",
"org_id": "your-org-id",
"engine": "doctr",
"features": {
"embed": {}
}
}'
curl -X POST https://api.case.dev/ocr/v1/process \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"document_url": "https://your-storage.com/scanned-deposition.pdf",
"document_id": "case-2024-1234-depo",
"org_id": "your-org-id",
"engine": "doctr",
"features": {
"embed": {}
}
}'
{
"id" : "1f4a195e-026b-41ff-b367-c61089f5f367" ,
"status" : "pending" ,
"document_id" : "case-2024-1234-depo" ,
"org_id" : "your-org-id" ,
"document_url" : "https://your-storage.com/scanned-deposition.pdf" ,
"engine" : "doctr" ,
"features" : {},
"page_count" : 0 ,
"chunk_count" : 0 ,
"chunks_completed" : 0 ,
"chunks_processing" : 0 ,
"chunks_failed" : 0 ,
"created_at" : "2025-11-04T09:30:12Z" ,
"links" : {
"self" : "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367" ,
"original" : "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/original" ,
"json" : "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/json" ,
"text" : "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/text" ,
"chunks" : "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/chunks"
}
}
Required:
document_url (string): Publicly accessible URL to your document
Supports: PDF, PNG, JPG, TIFF, and more Max file size: 500MB Can be S3 URL (we'll generate presigned URL automatically) Optional:
document_id (string): Your internal reference IDorg_id (string): Your organization ID (auto-detected from API key if not provided)engine (string): OCR engine to use (default: doctr)
doctr - Fast, good for printed texttesseract - Better for handwritingpaddle - Specialized for tables and formscallback_url (string): Webhook for completion notificationfeatures (object): Additional processing options
embed: Generate searchable PDF with text layertables: Extract tables as structured dataforms: Detect and extract form fieldsresult_bucket (string): S3 bucket to store resultsresult_prefix (string): S3 key prefix for resultsIf your document is in S3, you can use an s3:// URL and we'll handle presigning:
{
"document_url" : "s3://your-bucket/documents/deposition-scan.pdf" ,
"document_id" : "depo-2024-1234"
}
We automatically generate a presigned URL valid for 24 hours (OCR can take a while for large documents).
Get the current status and results of your OCR job.
GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 Execute RequestCode Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json"
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
-H "Authorization: Bearer sk_case_your_api_key_here"
{
"id" : "1f4a195e-026b-41ff-b367-c61089f5f367" ,
"status" : "processing" ,
"page_count" : 245 ,
"chunk_count" : 50 ,
"chunks_completed" : 23 ,
"chunks_processing" : 15 ,
"chunks_failed" : 0 ,
"created_at" : "2025-11-04T09:30:12Z" ,
"updated_at" : "2025-11-04T09:35:47Z"
}
{
"id" : "1f4a195e-026b-41ff-b367-c61089f5f367" ,
"status" : "completed" ,
"page_count" : 245 ,
"chunk_count" : 50 ,
"chunks_completed" : 50 ,
"chunks_processing" : 0 ,
"chunks_failed" : 0 ,
"text" : "Full extracted text from all 245 pages..." ,
"confidence" : 0.96 ,
"created_at" : "2025-11-04T09:30:12Z" ,
"updated_at" : "2025-11-04T09:48:23Z" ,
"processing_time_ms" : 1091000 ,
"links" : {
"original" : "https://vision-api.com/results/original.pdf" ,
"searchable_pdf" : "https://vision-api.com/results/searchable.pdf" ,
"json" : "https://vision-api.com/results/data.json" ,
"text" : "https://vision-api.com/results/text.txt"
}
}
pending: Job queued, not started yetprocessing: OCR in progresscompleted: Successfully finishederror: Failed (check error message)failed: Failed processingProgress:
page_count: Total pages in documentchunk_count: Document split into chunks for parallel processingchunks_completed: Chunks finishedchunks_processing: Chunks currently being processedchunks_failed: Chunks that failed (indicates quality issues)Results:
text: Full extracted text (only when completed)confidence: Overall accuracy (0-1, higher is better)processing_time_ms: How long OCR tookOutput Files:
links.original: Original uploaded documentlinks.searchable_pdf: PDF with embedded text layer (searchable)links.json: Structured JSON with page/word coordinateslinks.text: Plain text extractionlinks.chunks: Individual chunk resultsDocument Type Pages Typical Time Simple typed doc 10 30 seconds Scanned deposition 100 3-5 minutes Large discovery file 500 15-20 minutes Mixed quality scan 250 8-12 minutes
Factors affecting speed:
Page count (linear scaling) Image quality (low quality = slower) Complexity (tables/forms = slower) Handwriting (much slower than print) We provide direct download endpoints for all OCR result types:
GET /ocr/v1/:id/download/text - Plain text extraction
GET /ocr/v1/:id/download/json - Structured OCR data with coordinates
GET /ocr/v1/:id/download/pdf - Searchable PDF with text layer
GET /ocr/v1/:id/download/original - Original uploaded document
GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text Execute RequestCode Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json"
# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text \
-H "Authorization: Bearer sk_case_..." \
-o extracted-text.txt
GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf Execute RequestCode Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json"
# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf \
-H "Authorization: Bearer sk_case_..." \
-o searchable-deposition.pdf
GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json Execute RequestCode Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json"
# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json \
-H "Authorization: Bearer sk_case_..." \
-o ocr-data.json
The JSON download includes:
Word-level bounding boxes Confidence scores per word Page-level layout information Table structures (if extracted) GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original Execute RequestCode Examples cURL TypeScript Node.js Python PHP Go Rust Swift
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json"
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original \
-H "Authorization: Bearer sk_case_..." \
-o original-document.pdf
You can also extract the text directly from the status endpoint (for text only):
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
-H "Authorization: Bearer sk_case_..." \
| jq -r '.text' > extracted-text.txt
Per page : ~$0.01-0.03 depending on complexityTypical deposition (150 pages): ~$2-4Medical record (500 pages): ~$8-1510 pages : ~30 seconds50 pages : ~2 minutes200 pages : ~8 minutes500 pages : ~18 minutesSpeed varies by:
Image quality (low quality = slower) Layout complexity (tables/forms = slower) Engine choice (doctr fastest, paddle slowest but best) Process vault documents with OCR for text extraction without downloading. The OCR API accepts S3 URLs directly, making vault integration seamless.
You can submit vault documents for OCR using the s3:// URL format - the router automatically generates presigned URLs:
# Get vault object to find S3 bucket and key
VAULT_ID = "sytp1b5f5j1yuj7uffzzxgw6"
OBJECT_ID = "i5ar122d3h11a1802a3mogob"
OBJECT_INFO = $( curl -s https://api.case.dev/vault/ $VAULT_ID /objects/ $OBJECT_ID \
-H "Authorization: Bearer sk_case_your_api_key_here" )
FILENAME = $( echo " $OBJECT_INFO " | jq -r '.filename' )
# Get vault info for bucket name
VAULT_INFO = $( curl -s https://api.case.dev/vault/ $VAULT_ID \
-H "Authorization: Bearer sk_case_your_api_key_here" )
FILES_BUCKET = $( echo " $VAULT_INFO " | jq -r '.filesBucket' )
# Submit for OCR using s3:// URL
curl -X POST https://api.case.dev/ocr/v1/process \
-H "Authorization: Bearer sk_case_your_api_key_here" \
-H "Content-Type: application/json" \
-d "{
\" document_url \" : \" s3:// $FILES_BUCKET /objects/ $OBJECT_ID / $FILENAME \" ,
\" document_id \" : \" vault- $OBJECT_ID \" ,
\" engine \" : \" doctr \"
}"
The router automatically generates a 24-hour presigned URL - perfect for large documents that take time to process.
Alternatively, use the vault's download URL directly:
# Get download URL from vault (expires in 1 hour)
DOWNLOAD_URL = $( curl -s https://api.case.dev/vault/ $VAULT_ID /objects/ $OBJECT_ID \
-H "Authorization: Bearer sk_case_..." \
| jq -r '.downloadUrl' )
# Submit to OCR
curl -X POST https://api.case.dev/ocr/v1/process \
-H "Authorization: Bearer sk_case_..." \
-H "Content-Type: application/json" \
-d "{
\" document_url \" : \" $DOWNLOAD_URL \" ,
\" document_id \" : \" vault-doc-001 \"
}"
Here's a production-ready end-to-end workflow:
#!/bin/bash
set -e
API_KEY = "sk_case_your_api_key_here"
VAULT_ID = "sytp1b5f5j1yuj7uffzzxgw6"
LOCAL_FILE = "scanned-deposition.pdf"
echo "=== Step 1: Upload to Vault ==="
UPLOAD_RESPONSE = $( curl -s -X POST https://api.case.dev/vault/ $VAULT_ID /upload \
-H "Authorization: Bearer $API_KEY " \
-H "Content-Type: application/json" \
-d "{
\" filename \" : \" $( basename $LOCAL_FILE ) \" ,
\" contentType \" : \" application/pdf \" ,
\" metadata \" : {
\" case \" : \" 2024-CV-1234 \" ,
\" type \" : \" deposition \" ,
\" witness \" : \" Dr. Sarah Johnson \"
}
}" )
OBJECT_ID = $( echo " $UPLOAD_RESPONSE " | jq -r '.objectId' )
UPLOAD_URL = $( echo " $UPLOAD_RESPONSE " | jq -r '.uploadUrl' )
echo "Object ID: $OBJECT_ID "
# Upload the file
curl -s -X PUT " $UPLOAD_URL " \
-H "Content-Type: application/pdf" \
--data-binary "@ $LOCAL_FILE "
echo "✓ Document uploaded to vault"
echo ""
echo "=== Step 2: Get Vault Info for S3 Bucket ==="
VAULT_INFO = $( curl -s https://api.case.dev/vault/ $VAULT_ID \
-H "Authorization: Bearer $API_KEY " )
FILES_BUCKET = $( echo " $VAULT_INFO " | jq -r '.filesBucket' )
echo "Bucket: $FILES_BUCKET "
echo ""
echo "=== Step 3: Submit for OCR ==="
S3_URL = "s3:// $FILES_BUCKET /objects/ $OBJECT_ID /$( basename $LOCAL_FILE )"
OCR_RESPONSE = $( curl -s -X POST https://api.case.dev/ocr/v1/process \
-H "Authorization: Bearer $API_KEY " \
-H "Content-Type: application/json" \
-d "{
\" document_url \" : \" $S3_URL \" ,
\" document_id \" : \" depo- $OBJECT_ID \" ,
\" engine \" : \" doctr \" ,
\" features \" : { \" embed \" : {}}
}" )
OCR_JOB_ID = $( echo " $OCR_RESPONSE " | jq -r '.id' )
echo "✓ OCR job submitted: $OCR_JOB_ID "
echo ""
echo "=== Step 4: Wait for OCR Completion ==="
while true ; do
OCR_STATUS_RESPONSE = $( curl -s https://api.case.dev/ocr/v1/ $OCR_JOB_ID \
-H "Authorization: Bearer $API_KEY " )
STATUS = $( echo " $OCR_STATUS_RESPONSE " | jq -r '.status' )
PAGE_COUNT = $( echo " $OCR_STATUS_RESPONSE " | jq -r '.page_count' )
CHUNKS_COMPLETED = $( echo " $OCR_STATUS_RESPONSE " | jq -r '.chunks_completed' )
CHUNK_COUNT = $( echo " $OCR_STATUS_RESPONSE " | jq -r '.chunk_count' )
echo "Status: $STATUS | Pages: $PAGE_COUNT | Chunks: $CHUNKS_COMPLETED / $CHUNK_COUNT "
if [ " $STATUS " = "completed" ]; then
echo "✓ OCR completed!"
break
elif [ " $STATUS " = "failed" ]; then
echo "✗ OCR failed"
exit 1
fi
sleep 5
done
echo ""
echo "=== Step 5: Download OCR Results ==="
# Download extracted text
curl -s https://api.case.dev/ocr/v1/ $OCR_JOB_ID /download/text \
-H "Authorization: Bearer $API_KEY " \
-o extracted-text.txt
echo "✓ Text saved to extracted-text.txt"
# Download searchable PDF
curl -s https://api.case.dev/ocr/v1/ $OCR_JOB_ID /download/pdf \
-H "Authorization: Bearer $API_KEY " \
-o searchable.pdf
echo "✓ Searchable PDF saved to searchable.pdf"
echo ""
echo "=== Step 6: Trigger Vault Ingestion for Semantic Search ==="
curl -s -X POST https://api.case.dev/vault/ $VAULT_ID /ingest/ $OBJECT_ID \
-H "Authorization: Bearer $API_KEY " > /dev/null
echo "✓ Vault ingestion started"
echo ""
echo "=== Complete! ==="
echo "- Document in vault: $OBJECT_ID "
echo "- OCR job: $OCR_JOB_ID "
echo "- Extracted text: extracted-text.txt"
echo "- Searchable PDF: searchable.pdf"
echo "- Semantic search: Processing (will be ready in a few minutes)"
No Downloads Required
OCR processes files directly from S3 Eliminates download/upload roundtrip Cost Effective
Avoid S3 egress charges from repeated downloads Pay only for OCR processing Faster Processing
Direct S3 access is faster than HTTPS downloads 24-hour presigned URLs work for large files Secure
Presigned URLs expire automatically No need to make files publicly accessible Integrated Workflow
Store → OCR → Search all in one platform OCR text feeds back into vault ingestion Semantic search across all documents Scanned Depositions
# Upload scanned PDF → OCR → Make searchable
curl -X POST .../vault/ $VAULT_ID /upload ...
curl -X POST .../ocr/v1/process -d '{"document_url": "s3://..."}'
curl -X POST .../vault/ $VAULT_ID /ingest/ $OBJECT_ID
Medical Records Processing
# Batch process multiple medical records
for file in medical-records/*.pdf ; do
# Upload to vault
# Submit for OCR
# Wait for completion
# Trigger ingestion
done
Discovery Document Analysis
# Upload 500-page document
# OCR with table extraction
curl -X POST .../ocr/v1/process \
-d '{"features": {"tables": {}, "embed": {}}}'
# Get structured JSON with tables