Best Practices

Tips and best practices for using LLMs

Pro Tips

Reduce Costs

  1. Use smaller models for simple tasks (Haiku vs Sonnet vs Opus)
  2. Cache system prompts - Models with cache support reuse previous context
  3. Limit max_tokens - Only generate what you need
  4. Try cheaper alternatives - DeepSeek and Qwen models are often 10-100x cheaper

Improve Quality

  1. Use system prompts to set context and behavior
  2. Provide examples in your prompt (few-shot learning)
  3. Break complex tasks into multiple smaller requests
  4. Use vision models for scanned documents instead of OCR + text models

Handle Long Documents

  1. Claude models support 200K tokens (~150K words)
  2. Gemini 2.5 Flash supports 1M tokens (entire depositions)
  3. Chunk and summarize progressively for ultra-long documents
  4. Use embeddings + retrieval instead of sending entire document

Error Handling

Rate Limits:

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Solution: Implement exponential backoff retry logic

Context Length Exceeded:

{
  "error": {
    "message": "Context length exceeded",
    "type": "invalid_request_error"
  }
}

Solution: Reduce input text or switch to a model with larger context window

Invalid Model:

{
  "error": {
    "message": "Model not found",
    "type": "invalid_request_error"
  }
}

Solution: Check /llm/v1/models for valid model IDs


Common Issues & Solutions

Issue: Response is cut off

Cause: Hit max_tokens limit Solution: Increase max_tokens or ask the model to be more concise

Issue: Slow responses

Cause: Large context, complex models Solution:

  • Use faster models (Haiku, Flash, Mini variants)
  • Reduce input length
  • Enable streaming for better UX

Issue: Hallucinations or inaccurate information

Cause: Model making things up Solution:

  • Lower temperature (try 0.3 or 0)
  • Add "Only use information provided in the context" to system prompt
  • Use RAG (retrieval) to ground responses in facts

Issue: Cost too high

Cause: Using expensive models or long contexts Solution:

  • Switch to cheaper models (check pricing in /llm/v1/models)
  • Implement caching for repeated prompts
  • Summarize documents before analysis

Best Practices

  1. Always set max_tokens to avoid runaway costs
  2. Use temperature=0 for factual/deterministic tasks
  3. Include system prompts for consistent behavior
  4. Monitor usage in responses to track costs
  5. Implement retry logic with exponential backoff
  6. Cache frequently used prompts with models that support it
  7. Test with cheaper models first, scale up if needed
  8. Use streaming for better user experience in UIs