Best Practices

Pro Tips

Get Better Accuracy

Use high-quality scans
- 300 DPI or higher
- Good contrast
- Straight/not skewed
Choose the right engine
- doctr: Best for typed/printed text (fastest)
- tesseract: Better for mixed print/handwriting
- paddle: Specialized for complex layouts, tables
Enable table extraction for forms/spreadsheets

{
  "features": {
    "tables": {
      "format": "csv"
    }
  }
}

Working with Large Files

Option 1: Webhook (Recommended)

{
  "document_url": "https://...",
  "callback_url": "https://your-app.com/ocr-complete",
  "features": { "embed": {} }
}

We POST results when done - no need to poll!

Option 2: Polling with Backoff

# Check every minute for first 5 minutes
# Then every 5 minutes
# Timeout after 1 hour

RETRIES=0
MAX_RETRIES=20

while [ $RETRIES -lt $MAX_RETRIES ]; do
  STATUS=$(curl -s https://api.case.dev/ocr/v1/$JOB_ID \
    -H "Authorization: Bearer sk_case_..." | jq -r '.status')

  if [ "$STATUS" = "completed" ]; then
    break
  fi

  # Progressive backoff
  if [ $RETRIES -lt 5 ]; then
    sleep 60  # 1 minute for first 5 checks
  else
    sleep 300 # 5 minutes after that
  fi

  RETRIES=$((RETRIES + 1))
done

Cost Optimization

Batch similar documents together (parallel processing)
Use appropriate engine - simpler engines are faster/cheaper
Skip features you don't need (tables, forms slow it down)
Cache results - don't reprocess the same document

Common Issues & Solutions

Issue: "Failed to download document"

Cause: URL not accessible Solution:

Verify URL is publicly accessible
Use S3 presigned URLs if file is private
Or use s3:// URLs (we handle presigning)

Issue: Poor text accuracy

Cause: Low quality scan, handwriting, or wrong engine Solution:

Try different engine (tesseract for handwriting)
Improve scan quality (300+ DPI)
Check original image quality
Review confidence score in results

Issue: Tables not extracted correctly

Cause: Complex table layouts Solution:

Use paddle engine (better for tables)
Enable table features explicitly
Consider manual review for critical tables

Issue: Processing takes too long

Cause: Large document or complex layout Solution:

Use webhooks instead of polling
Split very large documents (500+ pages)
Check chunks_processing to see progress

Best Practices

Before Processing

Verify document quality - view the PDF/image first
Check file size - under 500MB recommended
Test with small sample before processing hundreds of pages
Choose appropriate engine based on document type

During Processing

Use webhooks for documents over 50 pages
Poll every 30-60 seconds (not more frequently)
Monitor chunks progress to estimate completion time
Implement timeout logic (30 minutes for very large files)

After Completion

Check confidence score - below 0.85 needs manual review
Verify critical information - dates, names, numbers
Download searchable PDF for easier review/sharing
Store results - OCR is expensive, don't reprocess

Advanced Features

Table Extraction

Get tables as structured data:

{
  "document_url": "https://storage.com/financial-records.pdf",
  "engine": "paddle",
  "features": {
    "tables": {
      "format": "csv",
      "include_headers": true,
      "preserve_formatting": true
    }
  }
}

Results include CSV files for each detected table.

Form Field Detection

Extract form fields and values:

{
  "document_url": "https://storage.com/intake-form.pdf",
  "engine": "paddle",
  "features": {
    "forms": {
      "extract_checkboxes": true,
      "extract_signatures": true
    }
  }
}

Results include field names and values as JSON.

Searchable PDF Generation

Convert scans into searchable PDFs:

{
  "document_url": "https://storage.com/scanned-contract.pdf",
  "features": {
    "embed": {
      "preserve_images": true,
      "font": "Arial",
      "font_size": "auto"
    }
  }
}

Download from links.searchable_pdf - looks identical but now searchable!

Use Cases

Discovery Documents

OCR scanned exhibits
Generate searchable PDFs for review
Extract text for keyword search
Feed to LLM for analysis

Medical Records

Digitize handwritten doctor notes
Extract vital signs from charts
Parse medication lists from forms
Enable semantic search across records

Depositions

Convert scanned transcripts to searchable text
Extract Q&A format automatically
Index for quick reference
Analyze with LLMs

Contracts

Digitize old paper contracts
Extract clauses and terms
Compare versions
Search across document library

API Reference

Examples