API Reference

Process Document

Submit a document for OCR processing. The API extracts text, preserves layout, and can generate searchable PDFs.

Endpoint

POST /ocr/v1/process

API Key

POST

/ocr/v1/process

Request Body

Code Examples

curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
  "document_url": "https://your-storage.com/scanned-deposition.pdf",
  "document_id": "case-2024-1234-depo",
  "org_id": "your-org-id",
  "engine": "doctr",
  "features": {
    "embed": {}
  }
}'

Example Request

curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "document_url": "https://your-storage.com/scanned-deposition.pdf",
    "document_id": "case-2024-1234-depo",
    "org_id": "your-org-id",
    "engine": "doctr",
    "features": {
      "embed": {}
    }
  }'

Example Response

{
  "id": "1f4a195e-026b-41ff-b367-c61089f5f367",
  "status": "pending",
  "document_id": "case-2024-1234-depo",
  "org_id": "your-org-id",
  "document_url": "https://your-storage.com/scanned-deposition.pdf",
  "engine": "doctr",
  "features": {},
  "page_count": 0,
  "chunk_count": 0,
  "chunks_completed": 0,
  "chunks_processing": 0,
  "chunks_failed": 0,
  "created_at": "2025-11-04T09:30:12Z",
  "links": {
    "self": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367",
    "original": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/original",
    "json": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/json",
    "text": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/text",
    "chunks": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/chunks"
  }
}

Request Parameters

Required:

document_url (string): Publicly accessible URL to your document
- Supports: PDF, PNG, JPG, TIFF, and more
- Max file size: 500MB
- Can be S3 URL (we'll generate presigned URL automatically)

Optional:

document_id (string): Your internal reference ID
org_id (string): Your organization ID (auto-detected from API key if not provided)
engine (string): OCR engine to use (default: doctr)
- doctr - Fast, good for printed text
- tesseract - Better for handwriting
- paddle - Specialized for tables and forms
callback_url (string): Webhook for completion notification
features (object): Additional processing options
- embed: Generate searchable PDF with text layer
- tables: Extract tables as structured data
- forms: Detect and extract form fields
result_bucket (string): S3 bucket to store results
result_prefix (string): S3 key prefix for results

Using S3 URLs

If your document is in S3, you can use an s3:// URL and we'll handle presigning:

{
  "document_url": "s3://your-bucket/documents/deposition-scan.pdf",
  "document_id": "depo-2024-1234"
}

We automatically generate a presigned URL valid for 24 hours (OCR can take a while for large documents).

Check OCR Status

Get the current status and results of your OCR job.

Endpoint

GET /ocr/v1/:id

API Key

GET

/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367

Code Examples

curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"

Example Request

curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
  -H "Authorization: Bearer sk_case_your_api_key_here"

Example Response (Processing)

{
  "id": "1f4a195e-026b-41ff-b367-c61089f5f367",
  "status": "processing",
  "page_count": 245,
  "chunk_count": 50,
  "chunks_completed": 23,
  "chunks_processing": 15,
  "chunks_failed": 0,
  "created_at": "2025-11-04T09:30:12Z",
  "updated_at": "2025-11-04T09:35:47Z"
}

Example Response (Completed)

{
  "id": "1f4a195e-026b-41ff-b367-c61089f5f367",
  "status": "completed",
  "page_count": 245,
  "chunk_count": 50,
  "chunks_completed": 50,
  "chunks_processing": 0,
  "chunks_failed": 0,
  "text": "Full extracted text from all 245 pages...",
  "confidence": 0.96,
  "created_at": "2025-11-04T09:30:12Z",
  "updated_at": "2025-11-04T09:48:23Z",
  "processing_time_ms": 1091000,
  "links": {
    "original": "https://vision-api.com/results/original.pdf",
    "searchable_pdf": "https://vision-api.com/results/searchable.pdf",
    "json": "https://vision-api.com/results/data.json",
    "text": "https://vision-api.com/results/text.txt"
  }
}

Status Values

pending: Job queued, not started yet
processing: OCR in progress
completed: Successfully finished
error: Failed (check error message)
failed: Failed processing

Response Fields

Progress:

page_count: Total pages in document
chunk_count: Document split into chunks for parallel processing
chunks_completed: Chunks finished
chunks_processing: Chunks currently being processed
chunks_failed: Chunks that failed (indicates quality issues)

Results:

text: Full extracted text (only when completed)
confidence: Overall accuracy (0-1, higher is better)
processing_time_ms: How long OCR took

Output Files:

links.original: Original uploaded document
links.searchable_pdf: PDF with embedded text layer (searchable)
links.json: Structured JSON with page/word coordinates
links.text: Plain text extraction
links.chunks: Individual chunk results

Processing Times

Document Type	Pages	Typical Time
Simple typed doc	10	30 seconds
Scanned deposition	100	3-5 minutes
Large discovery file	500	15-20 minutes
Mixed quality scan	250	8-12 minutes

Factors affecting speed:

Page count (linear scaling)
Image quality (low quality = slower)
Complexity (tables/forms = slower)
Handwriting (much slower than print)

Download OCR Results

We provide direct download endpoints for all OCR result types:

Endpoints

GET /ocr/v1/:id/download/text       - Plain text extraction
GET /ocr/v1/:id/download/json       - Structured OCR data with coordinates
GET /ocr/v1/:id/download/pdf        - Searchable PDF with text layer
GET /ocr/v1/:id/download/original   - Original uploaded document

Download Plain Text

API Key

GET

/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text

Code Examples

curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"

# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text \
  -H "Authorization: Bearer sk_case_..." \
  -o extracted-text.txt

Download Searchable PDF

API Key

GET

/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf

Code Examples

curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"

# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf \
  -H "Authorization: Bearer sk_case_..." \
  -o searchable-deposition.pdf

Download Structured JSON

API Key

GET

/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json

Code Examples

curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"

# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json \
  -H "Authorization: Bearer sk_case_..." \
  -o ocr-data.json

The JSON download includes:

Word-level bounding boxes
Confidence scores per word
Page-level layout information
Table structures (if extracted)

Download Original Document

API Key

GET

/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original

Code Examples

curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"

curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original \
  -H "Authorization: Bearer sk_case_..." \
  -o original-document.pdf

Alternative: Extract from Status Response

You can also extract the text directly from the status endpoint (for text only):

curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
  -H "Authorization: Bearer sk_case_..." \
  | jq -r '.text' > extracted-text.txt

Pricing & Speed

Processing Cost

Per page: ~$0.01-0.03 depending on complexity
Typical deposition (150 pages): ~$2-4
Medical record (500 pages): ~$8-15

Processing Speed

10 pages: ~30 seconds
50 pages: ~2 minutes
200 pages: ~8 minutes
500 pages: ~18 minutes

Speed varies by:

Image quality (low quality = slower)
Layout complexity (tables/forms = slower)
Engine choice (doctr fastest, paddle slowest but best)

Vault Integration

Process vault documents with OCR for text extraction without downloading. The OCR API accepts S3 URLs directly, making vault integration seamless.

Using S3 URLs for Vault Documents (Recommended)

You can submit vault documents for OCR using the s3:// URL format - the router automatically generates presigned URLs:

# Get vault object to find S3 bucket and key
VAULT_ID="sytp1b5f5j1yuj7uffzzxgw6"
OBJECT_ID="i5ar122d3h11a1802a3mogob"

OBJECT_INFO=$(curl -s https://api.case.dev/vault/$VAULT_ID/objects/$OBJECT_ID \
  -H "Authorization: Bearer sk_case_your_api_key_here")

FILENAME=$(echo "$OBJECT_INFO" | jq -r '.filename')

# Get vault info for bucket name
VAULT_INFO=$(curl -s https://api.case.dev/vault/$VAULT_ID \
  -H "Authorization: Bearer sk_case_your_api_key_here")

FILES_BUCKET=$(echo "$VAULT_INFO" | jq -r '.filesBucket')

# Submit for OCR using s3:// URL
curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d "{
    \"document_url\": \"s3://$FILES_BUCKET/objects/$OBJECT_ID/$FILENAME\",
    \"document_id\": \"vault-$OBJECT_ID\",
    \"engine\": \"doctr\"
  }"

The router automatically generates a 24-hour presigned URL - perfect for large documents that take time to process.

Using Presigned URLs from Vault

Alternatively, use the vault's download URL directly:

# Get download URL from vault (expires in 1 hour)
DOWNLOAD_URL=$(curl -s https://api.case.dev/vault/$VAULT_ID/objects/$OBJECT_ID \
  -H "Authorization: Bearer sk_case_..." \
  | jq -r '.downloadUrl')

# Submit to OCR
curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_..." \
  -H "Content-Type: application/json" \
  -d "{
    \"document_url\": \"$DOWNLOAD_URL\",
    \"document_id\": \"vault-doc-001\"
  }"

Complete Vault + OCR + Ingestion Workflow

Here's a production-ready end-to-end workflow:

#!/bin/bash
set -e

API_KEY="sk_case_your_api_key_here"
VAULT_ID="sytp1b5f5j1yuj7uffzzxgw6"
LOCAL_FILE="scanned-deposition.pdf"

echo "=== Step 1: Upload to Vault ==="
UPLOAD_RESPONSE=$(curl -s -X POST https://api.case.dev/vault/$VAULT_ID/upload \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"filename\": \"$(basename $LOCAL_FILE)\",
    \"contentType\": \"application/pdf\",
    \"metadata\": {
      \"case\": \"2024-CV-1234\",
      \"type\": \"deposition\",
      \"witness\": \"Dr. Sarah Johnson\"
    }
  }")

OBJECT_ID=$(echo "$UPLOAD_RESPONSE" | jq -r '.objectId')
UPLOAD_URL=$(echo "$UPLOAD_RESPONSE" | jq -r '.uploadUrl')

echo "Object ID: $OBJECT_ID"

# Upload the file
curl -s -X PUT "$UPLOAD_URL" \
  -H "Content-Type: application/pdf" \
  --data-binary "@$LOCAL_FILE"

echo "✓ Document uploaded to vault"

echo ""
echo "=== Step 2: Get Vault Info for S3 Bucket ==="
VAULT_INFO=$(curl -s https://api.case.dev/vault/$VAULT_ID \
  -H "Authorization: Bearer $API_KEY")

FILES_BUCKET=$(echo "$VAULT_INFO" | jq -r '.filesBucket')
echo "Bucket: $FILES_BUCKET"

echo ""
echo "=== Step 3: Submit for OCR ==="
S3_URL="s3://$FILES_BUCKET/objects/$OBJECT_ID/$(basename $LOCAL_FILE)"

OCR_RESPONSE=$(curl -s -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"document_url\": \"$S3_URL\",
    \"document_id\": \"depo-$OBJECT_ID\",
    \"engine\": \"doctr\",
    \"features\": {\"embed\": {}}
  }")

OCR_JOB_ID=$(echo "$OCR_RESPONSE" | jq -r '.id')
echo "✓ OCR job submitted: $OCR_JOB_ID"

echo ""
echo "=== Step 4: Wait for OCR Completion ==="
while true; do
  OCR_STATUS_RESPONSE=$(curl -s https://api.case.dev/ocr/v1/$OCR_JOB_ID \
    -H "Authorization: Bearer $API_KEY")

  STATUS=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.status')
  PAGE_COUNT=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.page_count')
  CHUNKS_COMPLETED=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.chunks_completed')
  CHUNK_COUNT=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.chunk_count')

  echo "Status: $STATUS | Pages: $PAGE_COUNT | Chunks: $CHUNKS_COMPLETED/$CHUNK_COUNT"

  if [ "$STATUS" = "completed" ]; then
    echo "✓ OCR completed!"
    break
  elif [ "$STATUS" = "failed" ]; then
    echo "✗ OCR failed"
    exit 1
  fi

  sleep 5
done

echo ""
echo "=== Step 5: Download OCR Results ==="
# Download extracted text
curl -s https://api.case.dev/ocr/v1/$OCR_JOB_ID/download/text \
  -H "Authorization: Bearer $API_KEY" \
  -o extracted-text.txt

echo "✓ Text saved to extracted-text.txt"

# Download searchable PDF
curl -s https://api.case.dev/ocr/v1/$OCR_JOB_ID/download/pdf \
  -H "Authorization: Bearer $API_KEY" \
  -o searchable.pdf

echo "✓ Searchable PDF saved to searchable.pdf"

echo ""
echo "=== Step 6: Trigger Vault Ingestion for Semantic Search ==="
curl -s -X POST https://api.case.dev/vault/$VAULT_ID/ingest/$OBJECT_ID \
  -H "Authorization: Bearer $API_KEY" > /dev/null

echo "✓ Vault ingestion started"
echo ""
echo "=== Complete! ==="
echo "- Document in vault: $OBJECT_ID"
echo "- OCR job: $OCR_JOB_ID"
echo "- Extracted text: extracted-text.txt"
echo "- Searchable PDF: searchable.pdf"
echo "- Semantic search: Processing (will be ready in a few minutes)"

Key Benefits of Vault Integration

No Downloads Required

OCR processes files directly from S3
Eliminates download/upload roundtrip

Cost Effective

Avoid S3 egress charges from repeated downloads
Pay only for OCR processing

Faster Processing

Direct S3 access is faster than HTTPS downloads
24-hour presigned URLs work for large files

Secure

Presigned URLs expire automatically
No need to make files publicly accessible

Integrated Workflow

Store → OCR → Search all in one platform
OCR text feeds back into vault ingestion
Semantic search across all documents

Common Use Cases

Scanned Depositions

# Upload scanned PDF → OCR → Make searchable
curl -X POST .../vault/$VAULT_ID/upload ...
curl -X POST .../ocr/v1/process -d '{"document_url": "s3://..."}'
curl -X POST .../vault/$VAULT_ID/ingest/$OBJECT_ID

Medical Records Processing

# Batch process multiple medical records
for file in medical-records/*.pdf; do
  # Upload to vault
  # Submit for OCR
  # Wait for completion
  # Trigger ingestion
done

Discovery Document Analysis

# Upload 500-page document
# OCR with table extraction
curl -X POST .../ocr/v1/process \
  -d '{"features": {"tables": {}, "embed": {}}}'
# Get structured JSON with tables

Use Cases

Best Practices