Skip to main content

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Document   │ ──▶ │     OCR      │ ──▶ │  Embeddings  │ ──▶ │    Search    │
│    Upload    │     │  Processing  │     │  Generation  │     │    Index     │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

Prerequisites

  • Case.dev API key
  • Node.js 18+ or Python 3.9+
  • Documents to process (PDFs, images, Word docs)

Step 1: Create a vault

import Casedev from 'casedev';
import fs from 'fs';
import path from 'path';

const client = new Casedev({ apiKey: process.env.CASEDEV_API_KEY });

async function createDiscoveryPipeline(matterId: string, documentsDir: string) {
  // 1. Create a vault for this matter
  const vault = await client.vault.create({
    name: `Matter ${matterId} - Discovery`,
    description: 'Documents received from opposing counsel'
  });
  
  console.log(`✅ Created vault: ${vault.id}`);
  return vault;
}

Step 2: Batch upload documents

async function uploadDocuments(vaultId: string, documentsDir: string) {
  const files = fs.readdirSync(documentsDir);
  const results = [];
  
  for (const file of files) {
    const filePath = path.join(documentsDir, file);
    const stat = fs.statSync(filePath);
    
    if (!stat.isFile()) continue;
    
    // Get content type
    const ext = path.extname(file).toLowerCase();
    const contentTypes: Record<string, string> = {
      '.pdf': 'application/pdf',
      '.doc': 'application/msword',
      '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
      '.jpg': 'image/jpeg',
      '.jpeg': 'image/jpeg',
      '.png': 'image/png',
      '.tiff': 'image/tiff',
      '.txt': 'text/plain',
    };
    
    const contentType = contentTypes[ext] || 'application/octet-stream';
    
    // Get presigned upload URL
    const upload = await client.vault.upload(vaultId, {
      filename: file,
      contentType,
      metadata: {
        source: 'discovery',
        matter_id: matterId,
        original_path: filePath,
      }
    });
    
    // Upload file to S3
    const fileBuffer = fs.readFileSync(filePath);
    await fetch(upload.uploadUrl, {
      method: 'PUT',
      headers: { 'Content-Type': contentType },
      body: fileBuffer
    });
    
    console.log(`📄 Uploaded: ${file}`);
    results.push({ file, objectId: upload.objectId });
  }
  
  return results;
}

Step 3: Trigger ingestion

Ingestion runs OCR (if needed) and generates embeddings for search.
async function ingestDocuments(vaultId: string, uploads: { file: string; objectId: string }[]) {
  const jobs = [];
  
  for (const { file, objectId } of uploads) {
    // Trigger ingestion (OCR + embeddings)
    const job = await client.vault.ingest(vaultId, objectId);
    jobs.push({ file, jobId: job.id });
    console.log(`🔄 Ingesting: ${file}`);
  }
  
  // Wait for all jobs to complete
  for (const { file, jobId } of jobs) {
    let status = 'processing';
    while (status === 'processing' || status === 'pending') {
      await new Promise(r => setTimeout(r, 5000));
      const job = await client.vault.getIngestStatus(vaultId, jobId);
      status = job.status;
    }
    console.log(`✅ Ingested: ${file}`);
  }
}

Step 4: Search your documents

async function searchDiscovery(vaultId: string, query: string) {
  const results = await client.vault.search(vaultId, {
    query,
    method: 'hybrid',  // Combines semantic + keyword
    topK: 10
  });
  
  console.log(`\n🔍 Results for: "${query}"\n`);
  
  for (const chunk of results.chunks) {
    console.log(`📄 ${chunk.filename} (page ${chunk.page})`);
    console.log(`   Score: ${chunk.hybridScore.toFixed(2)}`);
    console.log(`   "${chunk.text.substring(0, 200)}..."\n`);
  }
  
  return results;
}

Complete example

import Casedev from 'casedev';
import fs from 'fs';
import path from 'path';

const client = new Casedev({ apiKey: process.env.CASEDEV_API_KEY });

async function main() {
  const matterId = '2024-1234';
  const documentsDir = './discovery_dump';
  
  // 1. Create vault
  const vault = await client.vault.create({
    name: `Matter ${matterId} - Discovery`,
    description: 'Documents from opposing counsel'
  });
  
  // 2. Upload all documents
  const files = fs.readdirSync(documentsDir);
  for (const file of files) {
    const filePath = path.join(documentsDir, file);
    if (!fs.statSync(filePath).isFile()) continue;
    
    const upload = await client.vault.upload(vault.id, {
      filename: file,
      contentType: 'application/pdf'
    });
    
    await fetch(upload.uploadUrl, {
      method: 'PUT',
      body: fs.readFileSync(filePath)
    });
    
    await client.vault.ingest(vault.id, upload.objectId);
    console.log(`✅ ${file}`);
  }
  
  // 3. Search
  const results = await client.vault.search(vault.id, {
    query: 'evidence of safety violations in 2023',
    method: 'hybrid'
  });
  
  console.log(results.chunks);
}

main();
Production tip: For large document sets (1000+), use parallel uploads with a concurrency limit of 10-20 to maximize throughput while avoiding rate limits.

Cost estimate

ComponentCost
Storage$0.023/GB/month
OCR$0.01/page
Embeddings$0.0001/1K tokens
Search$0.001/query
Example: 5,000 pages of discovery ≈ 5075onetimeprocessing,50-75 one-time processing, 5-10/month storage.