Best Practices
Tips and best practices for using LLMs
Pro Tips
Reduce Costs
- Use smaller models for simple tasks (Haiku vs Sonnet vs Opus)
- Cache system prompts - Models with cache support reuse previous context
- Limit max_tokens - Only generate what you need
- Try cheaper alternatives - DeepSeek and Qwen models are often 10-100x cheaper
Improve Quality
- Use system prompts to set context and behavior
- Provide examples in your prompt (few-shot learning)
- Break complex tasks into multiple smaller requests
- Use vision models for scanned documents instead of OCR + text models
Handle Long Documents
- Claude models support 200K tokens (~150K words)
- Gemini 2.5 Flash supports 1M tokens (entire depositions)
- Chunk and summarize progressively for ultra-long documents
- Use embeddings + retrieval instead of sending entire document
Error Handling
Rate Limits:
Solution: Implement exponential backoff retry logic
Context Length Exceeded:
Solution: Reduce input text or switch to a model with larger context window
Invalid Model:
Solution: Check /llm/v1/models for valid model IDs
Common Issues & Solutions
Issue: Response is cut off
Cause: Hit max_tokens limit
Solution: Increase max_tokens or ask the model to be more concise
Issue: Slow responses
Cause: Large context, complex models Solution:
- Use faster models (Haiku, Flash, Mini variants)
- Reduce input length
- Enable streaming for better UX
Issue: Hallucinations or inaccurate information
Cause: Model making things up Solution:
- Lower temperature (try 0.3 or 0)
- Add "Only use information provided in the context" to system prompt
- Use RAG (retrieval) to ground responses in facts
Issue: Cost too high
Cause: Using expensive models or long contexts Solution:
- Switch to cheaper models (check pricing in
/llm/v1/models) - Implement caching for repeated prompts
- Summarize documents before analysis
Best Practices
- Always set max_tokens to avoid runaway costs
- Use temperature=0 for factual/deterministic tasks
- Include system prompts for consistent behavior
- Monitor usage in responses to track costs
- Implement retry logic with exponential backoff
- Cache frequently used prompts with models that support it
- Test with cheaper models first, scale up if needed
- Use streaming for better user experience in UIs