API Reference

Process Document

Submit a document for OCR processing. The API extracts text, preserves layout, and can generate searchable PDFs.

Endpoint

POST /ocr/v1/process
POST
/ocr/v1/process
curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
  "document_url": "https://your-storage.com/scanned-deposition.pdf",
  "document_id": "case-2024-1234-depo",
  "org_id": "your-org-id",
  "engine": "doctr",
  "features": {
    "embed": {}
  }
}'

Example Request

curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "document_url": "https://your-storage.com/scanned-deposition.pdf",
    "document_id": "case-2024-1234-depo",
    "org_id": "your-org-id",
    "engine": "doctr",
    "features": {
      "embed": {}
    }
  }'

Example Response

{
  "id": "1f4a195e-026b-41ff-b367-c61089f5f367",
  "status": "pending",
  "document_id": "case-2024-1234-depo",
  "org_id": "your-org-id",
  "document_url": "https://your-storage.com/scanned-deposition.pdf",
  "engine": "doctr",
  "features": {},
  "page_count": 0,
  "chunk_count": 0,
  "chunks_completed": 0,
  "chunks_processing": 0,
  "chunks_failed": 0,
  "created_at": "2025-11-04T09:30:12Z",
  "links": {
    "self": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367",
    "original": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/original",
    "json": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/json",
    "text": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/text",
    "chunks": "https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/chunks"
  }
}

Request Parameters

Required:

  • document_url (string): Publicly accessible URL to your document
    • Supports: PDF, PNG, JPG, TIFF, and more
    • Max file size: 500MB
    • Can be S3 URL (we'll generate presigned URL automatically)

Optional:

  • document_id (string): Your internal reference ID
  • org_id (string): Your organization ID (auto-detected from API key if not provided)
  • engine (string): OCR engine to use (default: doctr)
    • doctr - Fast, good for printed text
    • tesseract - Better for handwriting
    • paddle - Specialized for tables and forms
  • callback_url (string): Webhook for completion notification
  • features (object): Additional processing options
    • embed: Generate searchable PDF with text layer
    • tables: Extract tables as structured data
    • forms: Detect and extract form fields
  • result_bucket (string): S3 bucket to store results
  • result_prefix (string): S3 key prefix for results

Using S3 URLs

If your document is in S3, you can use an s3:// URL and we'll handle presigning:

{
  "document_url": "s3://your-bucket/documents/deposition-scan.pdf",
  "document_id": "depo-2024-1234"
}

We automatically generate a presigned URL valid for 24 hours (OCR can take a while for large documents).


Check OCR Status

Get the current status and results of your OCR job.

Endpoint

GET /ocr/v1/:id
GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"

Example Request

curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
  -H "Authorization: Bearer sk_case_your_api_key_here"

Example Response (Processing)

{
  "id": "1f4a195e-026b-41ff-b367-c61089f5f367",
  "status": "processing",
  "page_count": 245,
  "chunk_count": 50,
  "chunks_completed": 23,
  "chunks_processing": 15,
  "chunks_failed": 0,
  "created_at": "2025-11-04T09:30:12Z",
  "updated_at": "2025-11-04T09:35:47Z"
}

Example Response (Completed)

{
  "id": "1f4a195e-026b-41ff-b367-c61089f5f367",
  "status": "completed",
  "page_count": 245,
  "chunk_count": 50,
  "chunks_completed": 50,
  "chunks_processing": 0,
  "chunks_failed": 0,
  "text": "Full extracted text from all 245 pages...",
  "confidence": 0.96,
  "created_at": "2025-11-04T09:30:12Z",
  "updated_at": "2025-11-04T09:48:23Z",
  "processing_time_ms": 1091000,
  "links": {
    "original": "https://vision-api.com/results/original.pdf",
    "searchable_pdf": "https://vision-api.com/results/searchable.pdf",
    "json": "https://vision-api.com/results/data.json",
    "text": "https://vision-api.com/results/text.txt"
  }
}

Status Values

  • pending: Job queued, not started yet
  • processing: OCR in progress
  • completed: Successfully finished
  • error: Failed (check error message)
  • failed: Failed processing

Response Fields

Progress:

  • page_count: Total pages in document
  • chunk_count: Document split into chunks for parallel processing
  • chunks_completed: Chunks finished
  • chunks_processing: Chunks currently being processed
  • chunks_failed: Chunks that failed (indicates quality issues)

Results:

  • text: Full extracted text (only when completed)
  • confidence: Overall accuracy (0-1, higher is better)
  • processing_time_ms: How long OCR took

Output Files:

  • links.original: Original uploaded document
  • links.searchable_pdf: PDF with embedded text layer (searchable)
  • links.json: Structured JSON with page/word coordinates
  • links.text: Plain text extraction
  • links.chunks: Individual chunk results

Processing Times

Document TypePagesTypical Time
Simple typed doc1030 seconds
Scanned deposition1003-5 minutes
Large discovery file50015-20 minutes
Mixed quality scan2508-12 minutes

Factors affecting speed:

  • Page count (linear scaling)
  • Image quality (low quality = slower)
  • Complexity (tables/forms = slower)
  • Handwriting (much slower than print)

Download OCR Results

We provide direct download endpoints for all OCR result types:

Endpoints

GET /ocr/v1/:id/download/text       - Plain text extraction
GET /ocr/v1/:id/download/json       - Structured OCR data with coordinates
GET /ocr/v1/:id/download/pdf        - Searchable PDF with text layer
GET /ocr/v1/:id/download/original   - Original uploaded document

Download Plain Text

GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"
# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/text \
  -H "Authorization: Bearer sk_case_..." \
  -o extracted-text.txt

Download Searchable PDF

GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"
# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/pdf \
  -H "Authorization: Bearer sk_case_..." \
  -o searchable-deposition.pdf

Download Structured JSON

GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"
# Direct download (recommended)
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/json \
  -H "Authorization: Bearer sk_case_..." \
  -o ocr-data.json

The JSON download includes:

  • Word-level bounding boxes
  • Confidence scores per word
  • Page-level layout information
  • Table structures (if extracted)

Download Original Document

GET
/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original
curl -X GET https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json"
curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367/download/original \
  -H "Authorization: Bearer sk_case_..." \
  -o original-document.pdf

Alternative: Extract from Status Response

You can also extract the text directly from the status endpoint (for text only):

curl https://api.case.dev/ocr/v1/1f4a195e-026b-41ff-b367-c61089f5f367 \
  -H "Authorization: Bearer sk_case_..." \
  | jq -r '.text' > extracted-text.txt

Pricing & Speed

Processing Cost

  • Per page: ~$0.01-0.03 depending on complexity
  • Typical deposition (150 pages): ~$2-4
  • Medical record (500 pages): ~$8-15

Processing Speed

  • 10 pages: ~30 seconds
  • 50 pages: ~2 minutes
  • 200 pages: ~8 minutes
  • 500 pages: ~18 minutes

Speed varies by:

  • Image quality (low quality = slower)
  • Layout complexity (tables/forms = slower)
  • Engine choice (doctr fastest, paddle slowest but best)

Vault Integration

Process vault documents with OCR for text extraction without downloading. The OCR API accepts S3 URLs directly, making vault integration seamless.

You can submit vault documents for OCR using the s3:// URL format - the router automatically generates presigned URLs:

# Get vault object to find S3 bucket and key
VAULT_ID="sytp1b5f5j1yuj7uffzzxgw6"
OBJECT_ID="i5ar122d3h11a1802a3mogob"

OBJECT_INFO=$(curl -s https://api.case.dev/vault/$VAULT_ID/objects/$OBJECT_ID \
  -H "Authorization: Bearer sk_case_your_api_key_here")

FILENAME=$(echo "$OBJECT_INFO" | jq -r '.filename')

# Get vault info for bucket name
VAULT_INFO=$(curl -s https://api.case.dev/vault/$VAULT_ID \
  -H "Authorization: Bearer sk_case_your_api_key_here")

FILES_BUCKET=$(echo "$VAULT_INFO" | jq -r '.filesBucket')

# Submit for OCR using s3:// URL
curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d "{
    \"document_url\": \"s3://$FILES_BUCKET/objects/$OBJECT_ID/$FILENAME\",
    \"document_id\": \"vault-$OBJECT_ID\",
    \"engine\": \"doctr\"
  }"

The router automatically generates a 24-hour presigned URL - perfect for large documents that take time to process.

Using Presigned URLs from Vault

Alternatively, use the vault's download URL directly:

# Get download URL from vault (expires in 1 hour)
DOWNLOAD_URL=$(curl -s https://api.case.dev/vault/$VAULT_ID/objects/$OBJECT_ID \
  -H "Authorization: Bearer sk_case_..." \
  | jq -r '.downloadUrl')

# Submit to OCR
curl -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer sk_case_..." \
  -H "Content-Type: application/json" \
  -d "{
    \"document_url\": \"$DOWNLOAD_URL\",
    \"document_id\": \"vault-doc-001\"
  }"

Complete Vault + OCR + Ingestion Workflow

Here's a production-ready end-to-end workflow:

#!/bin/bash
set -e

API_KEY="sk_case_your_api_key_here"
VAULT_ID="sytp1b5f5j1yuj7uffzzxgw6"
LOCAL_FILE="scanned-deposition.pdf"

echo "=== Step 1: Upload to Vault ==="
UPLOAD_RESPONSE=$(curl -s -X POST https://api.case.dev/vault/$VAULT_ID/upload \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"filename\": \"$(basename $LOCAL_FILE)\",
    \"contentType\": \"application/pdf\",
    \"metadata\": {
      \"case\": \"2024-CV-1234\",
      \"type\": \"deposition\",
      \"witness\": \"Dr. Sarah Johnson\"
    }
  }")

OBJECT_ID=$(echo "$UPLOAD_RESPONSE" | jq -r '.objectId')
UPLOAD_URL=$(echo "$UPLOAD_RESPONSE" | jq -r '.uploadUrl')

echo "Object ID: $OBJECT_ID"

# Upload the file
curl -s -X PUT "$UPLOAD_URL" \
  -H "Content-Type: application/pdf" \
  --data-binary "@$LOCAL_FILE"

echo "✓ Document uploaded to vault"

echo ""
echo "=== Step 2: Get Vault Info for S3 Bucket ==="
VAULT_INFO=$(curl -s https://api.case.dev/vault/$VAULT_ID \
  -H "Authorization: Bearer $API_KEY")

FILES_BUCKET=$(echo "$VAULT_INFO" | jq -r '.filesBucket')
echo "Bucket: $FILES_BUCKET"

echo ""
echo "=== Step 3: Submit for OCR ==="
S3_URL="s3://$FILES_BUCKET/objects/$OBJECT_ID/$(basename $LOCAL_FILE)"

OCR_RESPONSE=$(curl -s -X POST https://api.case.dev/ocr/v1/process \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"document_url\": \"$S3_URL\",
    \"document_id\": \"depo-$OBJECT_ID\",
    \"engine\": \"doctr\",
    \"features\": {\"embed\": {}}
  }")

OCR_JOB_ID=$(echo "$OCR_RESPONSE" | jq -r '.id')
echo "✓ OCR job submitted: $OCR_JOB_ID"

echo ""
echo "=== Step 4: Wait for OCR Completion ==="
while true; do
  OCR_STATUS_RESPONSE=$(curl -s https://api.case.dev/ocr/v1/$OCR_JOB_ID \
    -H "Authorization: Bearer $API_KEY")

  STATUS=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.status')
  PAGE_COUNT=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.page_count')
  CHUNKS_COMPLETED=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.chunks_completed')
  CHUNK_COUNT=$(echo "$OCR_STATUS_RESPONSE" | jq -r '.chunk_count')

  echo "Status: $STATUS | Pages: $PAGE_COUNT | Chunks: $CHUNKS_COMPLETED/$CHUNK_COUNT"

  if [ "$STATUS" = "completed" ]; then
    echo "✓ OCR completed!"
    break
  elif [ "$STATUS" = "failed" ]; then
    echo "✗ OCR failed"
    exit 1
  fi

  sleep 5
done

echo ""
echo "=== Step 5: Download OCR Results ==="
# Download extracted text
curl -s https://api.case.dev/ocr/v1/$OCR_JOB_ID/download/text \
  -H "Authorization: Bearer $API_KEY" \
  -o extracted-text.txt

echo "✓ Text saved to extracted-text.txt"

# Download searchable PDF
curl -s https://api.case.dev/ocr/v1/$OCR_JOB_ID/download/pdf \
  -H "Authorization: Bearer $API_KEY" \
  -o searchable.pdf

echo "✓ Searchable PDF saved to searchable.pdf"

echo ""
echo "=== Step 6: Trigger Vault Ingestion for Semantic Search ==="
curl -s -X POST https://api.case.dev/vault/$VAULT_ID/ingest/$OBJECT_ID \
  -H "Authorization: Bearer $API_KEY" > /dev/null

echo "✓ Vault ingestion started"
echo ""
echo "=== Complete! ==="
echo "- Document in vault: $OBJECT_ID"
echo "- OCR job: $OCR_JOB_ID"
echo "- Extracted text: extracted-text.txt"
echo "- Searchable PDF: searchable.pdf"
echo "- Semantic search: Processing (will be ready in a few minutes)"

Key Benefits of Vault Integration

No Downloads Required

  • OCR processes files directly from S3
  • Eliminates download/upload roundtrip

Cost Effective

  • Avoid S3 egress charges from repeated downloads
  • Pay only for OCR processing

Faster Processing

  • Direct S3 access is faster than HTTPS downloads
  • 24-hour presigned URLs work for large files

Secure

  • Presigned URLs expire automatically
  • No need to make files publicly accessible

Integrated Workflow

  • Store → OCR → Search all in one platform
  • OCR text feeds back into vault ingestion
  • Semantic search across all documents

Common Use Cases

Scanned Depositions

# Upload scanned PDF → OCR → Make searchable
curl -X POST .../vault/$VAULT_ID/upload ...
curl -X POST .../ocr/v1/process -d '{"document_url": "s3://..."}'
curl -X POST .../vault/$VAULT_ID/ingest/$OBJECT_ID

Medical Records Processing

# Batch process multiple medical records
for file in medical-records/*.pdf; do
  # Upload to vault
  # Submit for OCR
  # Wait for completion
  # Trigger ingestion
done

Discovery Document Analysis

# Upload 500-page document
# OCR with table extraction
curl -X POST .../ocr/v1/process \
  -d '{"features": {"tables": {}, "embed": {}}}'
# Get structured JSON with tables