Best Practices

Tips and best practices for using LLMs

Pro Tips

Reduce Costs

Use smaller models for simple tasks (Haiku vs Sonnet vs Opus)
Cache system prompts - Models with cache support reuse previous context
Limit max_tokens - Only generate what you need
Try cheaper alternatives - DeepSeek and Qwen models are often 10-100x cheaper

Improve Quality

Use system prompts to set context and behavior
Provide examples in your prompt (few-shot learning)
Break complex tasks into multiple smaller requests
Use vision models for scanned documents instead of OCR + text models

Handle Long Documents

Claude models support 200K tokens (~150K words)
Gemini 2.5 Flash supports 1M tokens (entire depositions)
Chunk and summarize progressively for ultra-long documents
Use embeddings + retrieval instead of sending entire document

Error Handling

Rate Limits:

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Solution: Implement exponential backoff retry logic

Context Length Exceeded:

{
  "error": {
    "message": "Context length exceeded",
    "type": "invalid_request_error"
  }
}

Solution: Reduce input text or switch to a model with larger context window

Invalid Model:

{
  "error": {
    "message": "Model not found",
    "type": "invalid_request_error"
  }
}

Solution: Check /llm/v1/models for valid model IDs

Common Issues & Solutions

Issue: Response is cut off

Cause: Hit max_tokens limit Solution: Increase max_tokens or ask the model to be more concise

Issue: Slow responses

Cause: Large context, complex models Solution:

Use faster models (Haiku, Flash, Mini variants)
Reduce input length
Enable streaming for better UX

Issue: Hallucinations or inaccurate information

Cause: Model making things up Solution:

Lower temperature (try 0.3 or 0)
Add "Only use information provided in the context" to system prompt
Use RAG (retrieval) to ground responses in facts

Issue: Cost too high

Cause: Using expensive models or long contexts Solution:

Switch to cheaper models (check pricing in /llm/v1/models)
Implement caching for repeated prompts
Summarize documents before analysis

Best Practices

Always set max_tokens to avoid runaway costs
Use temperature=0 for factual/deterministic tasks
Include system prompts for consistent behavior
Monitor usage in responses to track costs
Implement retry logic with exponential backoff
Cache frequently used prompts with models that support it
Test with cheaper models first, scale up if needed
Use streaming for better user experience in UIs

LLM API endpoints and operations

Real-world LLM usage examples

On This Page

Pro Tips
Common Issues & Solutions
Best Practices