Best Practices

Pro Tips

Get Better Accuracy

  1. Use high-quality scans
    • 300 DPI or higher
    • Good contrast
    • Straight/not skewed
  2. Choose the right engine
    • doctr: Best for typed/printed text (fastest)
    • tesseract: Better for mixed print/handwriting
    • paddle: Specialized for complex layouts, tables
  3. Enable table extraction for forms/spreadsheets
{
  "features": {
    "tables": {
      "format": "csv"
    }
  }
}

Working with Large Files

Option 1: Webhook (Recommended)

{
  "document_url": "https://...",
  "callback_url": "https://your-app.com/ocr-complete",
  "features": { "embed": {} }
}

We POST results when done - no need to poll!

Option 2: Polling with Backoff

# Check every minute for first 5 minutes
# Then every 5 minutes
# Timeout after 1 hour

RETRIES=0
MAX_RETRIES=20

while [ $RETRIES -lt $MAX_RETRIES ]; do
  STATUS=$(curl -s https://api.case.dev/ocr/v1/$JOB_ID \
    -H "Authorization: Bearer sk_case_..." | jq -r '.status')

  if [ "$STATUS" = "completed" ]; then
    break
  fi

  # Progressive backoff
  if [ $RETRIES -lt 5 ]; then
    sleep 60  # 1 minute for first 5 checks
  else
    sleep 300 # 5 minutes after that
  fi

  RETRIES=$((RETRIES + 1))
done

Cost Optimization

  1. Batch similar documents together (parallel processing)
  2. Use appropriate engine - simpler engines are faster/cheaper
  3. Skip features you don't need (tables, forms slow it down)
  4. Cache results - don't reprocess the same document

Common Issues & Solutions

Issue: "Failed to download document"

Cause: URL not accessible Solution:

  • Verify URL is publicly accessible
  • Use S3 presigned URLs if file is private
  • Or use s3:// URLs (we handle presigning)

Issue: Poor text accuracy

Cause: Low quality scan, handwriting, or wrong engine Solution:

  • Try different engine (tesseract for handwriting)
  • Improve scan quality (300+ DPI)
  • Check original image quality
  • Review confidence score in results

Issue: Tables not extracted correctly

Cause: Complex table layouts Solution:

  • Use paddle engine (better for tables)
  • Enable table features explicitly
  • Consider manual review for critical tables

Issue: Processing takes too long

Cause: Large document or complex layout Solution:

  • Use webhooks instead of polling
  • Split very large documents (500+ pages)
  • Check chunks_processing to see progress

Best Practices

Before Processing

  1. Verify document quality - view the PDF/image first
  2. Check file size - under 500MB recommended
  3. Test with small sample before processing hundreds of pages
  4. Choose appropriate engine based on document type

During Processing

  1. Use webhooks for documents over 50 pages
  2. Poll every 30-60 seconds (not more frequently)
  3. Monitor chunks progress to estimate completion time
  4. Implement timeout logic (30 minutes for very large files)

After Completion

  1. Check confidence score - below 0.85 needs manual review
  2. Verify critical information - dates, names, numbers
  3. Download searchable PDF for easier review/sharing
  4. Store results - OCR is expensive, don't reprocess

Advanced Features

Table Extraction

Get tables as structured data:

{
  "document_url": "https://storage.com/financial-records.pdf",
  "engine": "paddle",
  "features": {
    "tables": {
      "format": "csv",
      "include_headers": true,
      "preserve_formatting": true
    }
  }
}

Results include CSV files for each detected table.

Form Field Detection

Extract form fields and values:

{
  "document_url": "https://storage.com/intake-form.pdf",
  "engine": "paddle",
  "features": {
    "forms": {
      "extract_checkboxes": true,
      "extract_signatures": true
    }
  }
}

Results include field names and values as JSON.

Searchable PDF Generation

Convert scans into searchable PDFs:

{
  "document_url": "https://storage.com/scanned-contract.pdf",
  "features": {
    "embed": {
      "preserve_images": true,
      "font": "Arial",
      "font_size": "auto"
    }
  }
}

Download from links.searchable_pdf - looks identical but now searchable!


Use Cases

Discovery Documents

  • OCR scanned exhibits
  • Generate searchable PDFs for review
  • Extract text for keyword search
  • Feed to LLM for analysis

Medical Records

  • Digitize handwritten doctor notes
  • Extract vital signs from charts
  • Parse medication lists from forms
  • Enable semantic search across records

Depositions

  • Convert scanned transcripts to searchable text
  • Extract Q&A format automatically
  • Index for quick reference
  • Analyze with LLMs

Contracts

  • Digitize old paper contracts
  • Extract clauses and terms
  • Compare versions
  • Search across document library