Best Practices
Pro Tips
Get Better Accuracy
- Use high-quality scans
- 300 DPI or higher
- Good contrast
- Straight/not skewed
- Choose the right engine
doctr: Best for typed/printed text (fastest)tesseract: Better for mixed print/handwritingpaddle: Specialized for complex layouts, tables
- Enable table extraction for forms/spreadsheets
Working with Large Files
Option 1: Webhook (Recommended)
We POST results when done - no need to poll!
Option 2: Polling with Backoff
Cost Optimization
- Batch similar documents together (parallel processing)
- Use appropriate engine - simpler engines are faster/cheaper
- Skip features you don't need (tables, forms slow it down)
- Cache results - don't reprocess the same document
Common Issues & Solutions
Issue: "Failed to download document"
Cause: URL not accessible Solution:
- Verify URL is publicly accessible
- Use S3 presigned URLs if file is private
- Or use
s3://URLs (we handle presigning)
Issue: Poor text accuracy
Cause: Low quality scan, handwriting, or wrong engine Solution:
- Try different engine (
tesseractfor handwriting) - Improve scan quality (300+ DPI)
- Check original image quality
- Review
confidencescore in results
Issue: Tables not extracted correctly
Cause: Complex table layouts Solution:
- Use
paddleengine (better for tables) - Enable table features explicitly
- Consider manual review for critical tables
Issue: Processing takes too long
Cause: Large document or complex layout Solution:
- Use webhooks instead of polling
- Split very large documents (500+ pages)
- Check
chunks_processingto see progress
Best Practices
Before Processing
- Verify document quality - view the PDF/image first
- Check file size - under 500MB recommended
- Test with small sample before processing hundreds of pages
- Choose appropriate engine based on document type
During Processing
- Use webhooks for documents over 50 pages
- Poll every 30-60 seconds (not more frequently)
- Monitor chunks progress to estimate completion time
- Implement timeout logic (30 minutes for very large files)
After Completion
- Check confidence score - below 0.85 needs manual review
- Verify critical information - dates, names, numbers
- Download searchable PDF for easier review/sharing
- Store results - OCR is expensive, don't reprocess
Advanced Features
Table Extraction
Get tables as structured data:
Results include CSV files for each detected table.
Form Field Detection
Extract form fields and values:
Results include field names and values as JSON.
Searchable PDF Generation
Convert scans into searchable PDFs:
Download from links.searchable_pdf - looks identical but now searchable!
Use Cases
Discovery Documents
- OCR scanned exhibits
- Generate searchable PDFs for review
- Extract text for keyword search
- Feed to LLM for analysis
Medical Records
- Digitize handwritten doctor notes
- Extract vital signs from charts
- Parse medication lists from forms
- Enable semantic search across records
Depositions
- Convert scanned transcripts to searchable text
- Extract Q&A format automatically
- Index for quick reference
- Analyze with LLMs
Contracts
- Digitize old paper contracts
- Extract clauses and terms
- Compare versions
- Search across document library