The problem: Opposing counsel sent you 500 pages of blurry photocopies. You need to search them, but they’re just images.
The solution: Run OCR to extract text, then search or analyze with AI.
1. Submit for OCR
import Casedev from 'casedev';
const client = new Casedev({ apiKey: process.env.CASEDEV_API_KEY });
// Process a document uploaded by your user
const job = await client.ocr.v1.process({
document_url: documentUrl, // URL from your user's upload
engine: 'doctr', // Fast, good for printed text
features: {
embed: {} // Generate searchable PDF
}
});
console.log(`OCR job started: ${job.id}`);
2. Wait for completion
OCR runs asynchronously. Poll for status or use webhooks to notify your users:
// Poll for completion
let result = await client.ocr.v1.retrieve(job.id);
while (result.status === 'processing' || result.status === 'pending') {
console.log(`Status: ${result.status} (${result.chunks_completed}/${result.chunk_count} pages)`);
await new Promise(r => setTimeout(r, 5000));
result = await client.ocr.v1.retrieve(job.id);
}
if (result.status === 'completed') {
console.log(`✅ OCR complete! ${result.page_count} pages processed.`);
console.log(`Confidence: ${(result.confidence * 100).toFixed(1)}%`);
}
3. Download results
Provide extracted text, structured data, or a searchable PDF:
// Download plain text for your user
const text = await client.ocr.v1.download(job.id, 'text');
// Download searchable PDF (original with invisible text layer)
const pdf = await client.ocr.v1.download(job.id, 'pdf');
fs.writeFileSync('searchable-document.pdf', Buffer.from(pdf));
// Download structured JSON (with word coordinates for highlighting)
const json = await client.ocr.v1.download(job.id, 'json');
console.log(`Extracted ${json.pages.length} pages`);
4. Analyze with AI
Enhance your feature with automatic data extraction:
// Extract key information for your user
const analysis = await client.llm.v1.chat.createCompletion({
model: 'anthropic/claude-sonnet-4.5',
messages: [
{
role: 'system',
content: 'Extract key dates, parties, and claims from this document. Format as JSON.'
},
{
role: 'user',
content: text
}
],
temperature: 0 // Deterministic for factual extraction
});
// Return structured data to your user
console.log(analysis.choices[0].message.content);
OCR engines
Choose the right engine based on your users’ document types:
| Engine | Best for | Speed |
|---|
doctr | Clean printed text | Fast |
tesseract | Mixed print/handwriting | Medium |
paddle | Tables, forms, complex layouts | Slower |
Recommendation: Start with doctr for most use cases. Switch to paddle if your users need table extraction or have complex document layouts.