Our OpenAI bill hit $47,000 in month three. For a startup.
We had built Sokrateque.ai—an AI research assistant that searches 10M+ academic papers. It was working beautifully. Users loved it. Accuracy was 90%+.
But at $0.50 per query, we were bleeding money. With 3,000 queries per day, we were on track for $540K in annual API costs alone.
The math was brutal: At our pricing ($20/month per user), we needed 2,250 paying users just to break even on API costs. Not including servers, engineering, support—just the OpenAI bill.
We had two choices:
- Raise prices 3-4x (kill growth)
- Cut costs dramatically (or die)
We chose option 2. Over 4 months, we reduced costs by 70% (from $0.50 to $0.15 per query) while maintaining—and in some cases improving—quality.
Here's exactly how we did it, with real code and real numbers.
The Cost Breakdown: Where Money Actually Goes
Before optimizing, understand where costs come from:
Typical LLM Application Cost Structure:
Per-query cost breakdown (before optimization):
Input tokens: $0.25 (50%) ← Prompt + retrieved context
Output tokens: $0.20 (40%) ← Generated response
Embeddings: $0.03 (6%) ← Vector search
Other: $0.02 (4%) ← Reranking, validation, etc.
────────────────────────────
Total: $0.50 per query
The insight: Input tokens dominate costs. That's where to focus.
Our optimization journey:
Month 1: $0.50 per query (baseline)
Month 2: $0.35 per query (30% reduction) ← Prompt optimization
Month 3: $0.22 per query (56% reduction) ← Caching + model selection
Month 4: $0.15 per query (70% reduction) ← Advanced techniques
Let's walk through each stage.
Stage 1: Prompt Optimization (30% Cost Reduction)
The problem: We were sending massive prompts on every query.
Original approach (expensive):
# Naive implementation (expensive)
def answer_query(user_query):
# Retrieve 10 relevant papers
papers = vector_search(user_query, top_k=10)
# Build massive prompt
prompt = f"""
You are an academic research assistant...
PAPERS:
"""
# Include FULL text of all 10 papers (huge!)
for paper in papers:
prompt += f"\nPaper {paper.id}:\n"
prompt += f"Title: {paper.title}\n"
prompt += f"Full text: {paper.full_text}\n" # ← 10,000+ tokens per paper!
prompt += f"\nUser question: {user_query}\nAnswer:"
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response
# Cost: Input = 120K tokens × $0.01/1K = $1.20 per query
# Total: ~$1.22 per query (unsustainable!)
Problems:
- Sending 10 full papers (120K+ tokens) on every query
- Repetitive instructions taking up tokens
- Including information LLM doesn't need
Optimization 1.1: Compress Retrieved Context
Don't send full papers. Send only relevant excerpts.
def answer_query_optimized(user_query):
# Retrieve papers
papers = vector_search(user_query, top_k=10)
# NEW: Extract only relevant sections
relevant_excerpts = []
for paper in papers:
# Find most relevant paragraphs from each paper
excerpts = extract_relevant_excerpts(
paper=paper,
query=user_query,
max_excerpts=3, # Only top 3 paragraphs per paper
max_tokens=500 # Limit each excerpt
)
relevant_excerpts.extend(excerpts)
# Build compressed prompt
prompt = f"""
Answer based on these excerpts from academic papers:
{format_excerpts(relevant_excerpts)}
Question: {user_query}
Answer:
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response
# Cost reduction:
# Before: 120K input tokens
# After: 15K input tokens (only relevant excerpts)
# Savings: 88% on input tokens = 44% total cost reduction
Implementation of excerpt extraction:
def extract_relevant_excerpts(paper, query, max_excerpts=3, max_tokens=500):
"""
Find most relevant sections of paper for query.
"""
# Split paper into paragraphs
paragraphs = split_into_paragraphs(paper.full_text)
# Score each paragraph by relevance to query
scored_paragraphs = []
query_embedding = embed(query)
for para in paragraphs:
para_embedding = embed(para)
similarity = cosine_similarity(query_embedding, para_embedding)
scored_paragraphs.append({
'text': para,
'score': similarity,
'token_count': count_tokens(para)
})
# Sort by relevance
scored_paragraphs.sort(key=lambda x: x['score'], reverse=True)
# Select top paragraphs that fit in token budget
selected = []
total_tokens = 0
for para in scored_paragraphs:
if len(selected) >= max_excerpts:
break
if total_tokens + para['token_count'] <= max_tokens:
selected.append(para)
total_tokens += para['token_count']
return selected
# Result: Only most relevant 500-1000 tokens per paper
# vs. 10,000+ tokens for full paper
Quality impact: Accuracy actually improved from 87% to 89%. Why? Less irrelevant information for LLM to parse. Model focuses on what matters.
Optimization 1.2: System Prompt Efficiency
Move repeated instructions to system prompt (cheaper tokens).
# Before: Instructions in every user message
messages = [
{
"role": "user",
"content": f"""
You are an academic research assistant. [100 tokens of instructions]
Papers: {papers}
Question: {query}
"""
}
]
# Cost: System instructions counted as input tokens (expensive)
# After: Instructions in system prompt (reusable)
messages = [
{
"role": "system",
"content": """
You are an academic research assistant. [100 tokens of instructions]
Always cite papers when answering.
Be concise and accurate.
"""
},
{
"role": "user",
"content": f"""
Papers: {papers}
Question: {query}
"""
}
]
# Cost: System prompt cached by OpenAI (50% discount on tokens)
# Savings: 50% on 100 instruction tokens = small but adds up
Optimization 1.3: Reduce Output Verbosity
Shorter outputs = lower costs.
# Before: No length control
prompt = "Explain this research paper in detail."
# Output: 800 tokens (LLM is verbose by default)
# Cost: 800 tokens × $0.03/1K = $0.024
# After: Explicit length control
prompt = "Explain this research paper in 2-3 sentences (max 100 words)."
# Output: 150 tokens (controlled)
# Cost: 150 tokens × $0.03/1K = $0.0045
# Savings: 81% on output tokens = 16% total cost reduction
# Alternative: Use max_tokens parameter
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
max_tokens=200 # Hard limit
)
Stage 1 Results:
| Optimization | Cost Reduction | Quality Impact |
|---|---|---|
| Compress context (excerpts) | -44% | +2% (improved!) |
| System prompt efficiency | -3% | No change |
| Reduce output verbosity | -16% | No change |
| TOTAL STAGE 1 | -30% | +2% (better!) |
New cost per query: $0.35 (was $0.50)
Stage 2: Caching & Model Selection (26% Additional Reduction)
Optimization 2.1: Aggressive Caching
The insight: Many queries are similar or identical.
Level 1: Exact query cache (Redis)
def answer_with_cache(user_query):
# Check if exact query asked before
cache_key = f"query:{hash(user_query)}"
cached_response = redis.get(cache_key)
if cached_response:
return json.loads(cached_response) # Free!
# Not cached, generate answer
response = generate_answer(user_query)
# Cache for 24 hours
redis.setex(cache_key, 86400, json.dumps(response))
return response
# Hit rate: 15% (15% of queries are exact duplicates)
# Savings: 15% × $0.35 = $0.0525 per query average
Level 2: Semantic similarity cache
# Check for semantically similar queries
def semantic_cache_lookup(user_query):
# Embed query
query_embedding = embed(user_query)
# Search cache for similar queries
similar_queries = vector_search_cache(
query_embedding,
similarity_threshold=0.95 # Very high threshold
)
if similar_queries:
# Return cached response
return similar_queries[0].response
return None
# Hit rate: Additional 10% (beyond exact matches)
# Total cache hit rate: 25%
# Savings: 25% × $0.35 = $0.0875 per query average
Level 3: Cache retrieved documents
# Don't re-retrieve same papers for similar queries
def retrieve_with_cache(query):
cache_key = f"retrieval:{hash(query)}"
cached_papers = redis.get(cache_key)
if cached_papers:
return json.loads(cached_papers)
# Retrieve from vector DB (costs embedding API call)
papers = vector_search(query, top_k=10)
# Cache for 1 hour
redis.setex(cache_key, 3600, json.dumps(papers))
return papers
# Hit rate: 30% (retrieval cached)
# Savings: 30% of embedding costs
Optimization 2.2: Model Selection (Tiered Strategy)
Not all queries need GPT-4. Use cheaper models when possible.
Model pricing (OpenAI, Jan 2025):
GPT-4 Turbo:
- Input: $0.01 per 1K tokens
- Output: $0.03 per 1K tokens
GPT-3.5 Turbo:
- Input: $0.0005 per 1K tokens (20x cheaper!)
- Output: $0.0015 per 1K tokens (20x cheaper!)
Tiered strategy:
def route_to_appropriate_model(user_query, complexity_score):
"""
Route queries to cheapest model that can handle them.
"""
if complexity_score < 0.3: # Simple query
model = "gpt-3.5-turbo"
cost_per_query = 0.03
elif complexity_score < 0.7: # Medium query
model = "gpt-4-turbo"
cost_per_query = 0.35
else: # Complex query
model = "gpt-4"
cost_per_query = 0.50
return model, cost_per_query
def calculate_query_complexity(query, retrieved_docs):
"""
Score query complexity (0-1).
"""
complexity = 0
# Factor 1: Query length
if count_tokens(query) > 100:
complexity += 0.2
# Factor 2: Number of documents
if len(retrieved_docs) > 5:
complexity += 0.2
# Factor 3: Technical difficulty (keywords)
technical_keywords = ['analyze', 'compare', 'synthesize', 'evaluate']
if any(kw in query.lower() for kw in technical_keywords):
complexity += 0.3
# Factor 4: Multi-document synthesis required
if len(retrieved_docs) > 3:
complexity += 0.3
return min(complexity, 1.0)
Results from tiered routing:
Query Distribution:
- Simple (GPT-3.5): 40% of queries
- Medium (GPT-4 Turbo): 45% of queries
- Complex (GPT-4): 15% of queries
Cost Impact:
- 40% × $0.03 = $0.012
- 45% × $0.35 = $0.158
- 15% × $0.50 = $0.075
- Average: $0.245 per query
vs. All GPT-4: $0.35 per query
Savings: 30% with tiered approach
Quality impact: Overall 1% accuracy drop (96% → 95%). User testing: Users couldn't tell the difference. Cost savings worth it.
Stage 2 Results:
| Optimization | Cost Reduction | Quality Impact |
|---|---|---|
| Caching (multi-level) | -25% | No change |
| Model selection (tiered) | -30% | -1% (acceptable) |
| Batch processing | -5% | No change |
| TOTAL STAGE 2 | -26% | -1% |
New cost per query: $0.22 (was $0.35, originally $0.50)
Cumulative reduction: 56%
Stage 3: Advanced Optimizations (14% Additional Reduction)
Optimization 3.1: Streaming Responses
Don't wait for entire response. Stream tokens as generated.
# Before: Wait for complete response
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[...]
)
# User waits 5-10 seconds, sees nothing
# After: Stream response
stream = openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
stream=True # Enable streaming
)
for chunk in stream:
token = chunk.choices[0].delta.content
yield token # Send to user immediately
# User sees tokens appearing in real-time (feels instant)
Cost benefit:
- Direct: None (same token cost)
- Indirect: Users perceive as faster → willing to use cheaper models
- Indirect: Can abort generation early if user navigates away (saves output tokens)
Optimization 3.2: Prompt Compression Techniques
Technique 1: Abbreviations and shorthand
# Before: Verbose metadata (60 tokens)
prompt = f"""
Paper Title: {paper.title}
Authors: {', '.join(paper.authors)}
Publication Year: {paper.year}
Journal: {paper.journal}
Abstract: {paper.abstract}
"""
# After: Compressed format (35 tokens)
prompt = f"""
{paper.title} ({paper.year})
{paper.authors[0]} et al. | {paper.journal}
{paper.abstract}
"""
# Savings: 42% on metadata tokens
# Quality impact: None (LLM understands both)
Technique 2: Remove filler words
# Before:
"Please analyze the following papers and provide a comprehensive summary..."
# After:
"Analyze papers, summarize:"
# Savings: 50% on instruction tokens
# Quality: No degradation (LLM doesn't need politeness)
Optimization 3.3: Query Preprocessing
Filter out junk before calling LLM.
def preprocess_query(user_query):
"""
Validate and potentially handle query without LLM.
"""
# Filter 1: Empty or too short
if len(user_query.strip()) < 10:
return {"response": "Please provide a more detailed question.", "skip_llm": True}
# Filter 2: Obvious spam/abuse
if is_spam(user_query):
return {"response": "Invalid query.", "skip_llm": True}
# Filter 3: FAQ (exact match)
faq_answer = check_faq_database(user_query)
if faq_answer:
return {"response": faq_answer, "skip_llm": True}
# Filter 4: Out of scope
if not is_academic_query(user_query):
return {
"response": "I can only help with academic research questions.",
"skip_llm": True
}
# Passed all filters, proceed to LLM
return {"skip_llm": False}
# Savings: 5-8% of queries filtered (no API cost)
Optimization 3.4: Adaptive Context Window
Don't always use maximum context. Adjust based on query.
def get_optimal_context_size(query, available_docs):
"""
Determine how many documents actually needed.
"""
# Simple query → fewer docs needed
if is_simple_query(query):
return min(3, len(available_docs))
# Comparative query → more docs needed
elif is_comparative_query(query):
return min(7, len(available_docs))
# Complex synthesis → many docs needed
else:
return min(10, len(available_docs))
# Example:
query = "Who wrote this paper?" # Simple
docs_needed = 1 # Only need 1 paper
query = "Compare methodologies across recent papers" # Complex
docs_needed = 10 # Need many papers
# Savings: 30% fewer input tokens on simple queries
Stage 3 Results:
| Optimization | Cost Reduction | Quality Impact |
|---|---|---|
| Streaming (indirect) | -3% | Better UX |
| Prompt compression | -5% | No change |
| Query preprocessing | -3% | Better (spam filtered) |
| Adaptive context | -5% | No change |
| TOTAL STAGE 3 | -14% | Improved UX |
Final cost per query: $0.15 (was $0.22, originally $0.50)
TOTAL CUMULATIVE REDUCTION: 70%
The Complete Cost Optimization Playbook
Here's the framework we now use on every new project:
Phase 1: Low-Hanging Fruit (Week 1-2)
Immediate wins with minimal effort:
✅ Compress prompts
- Remove verbose instructions
- Use system prompts for repeated text
- Eliminate filler words
- Expected savings: 10-20%
✅ Cache exact queries
- Simple Redis cache for duplicates
- 24-hour TTL
- Expected savings: 10-15%
✅ Control output length
- Add max_tokens parameter
- Request concise responses in prompt
- Expected savings: 10-15%
Total Phase 1 savings: 30-40%
Implementation time: 1-2 weeks
Cost: $5K-$10K engineering
Phase 2: Strategic Optimizations (Month 2-3)
Require more engineering but high ROI:
✅ Implement tiered model routing
- Classify query complexity
- Route to appropriate model (GPT-3.5 / GPT-4 Turbo / GPT-4)
- Expected savings: 20-30%
✅ Semantic caching
- Cache similar queries (not just exact)
- Vector search in cache
- Expected savings: 10-15%
✅ Optimize retrieval
- Send only relevant excerpts
- Hierarchical retrieval
- Expected savings: 10-20%
Total Phase 2 savings: 40-50% (additional)
Implementation time: 1-2 months
Cost: $15K-$25K engineering
Phase 3: Advanced Techniques (Month 4+)
For high-scale systems, worth the complexity:
✅ Custom compression
- Domain-specific abbreviations
- Structured formats (JSON)
- Expected savings: 5-10%
✅ Adaptive context
- Variable document count
- Query-dependent retrieval
- Expected savings: 5-10%
✅ Template reuse
- Identify common patterns
- Generate templates
- Expected savings: 5-10%
Total Phase 3 savings: 15-25% (additional)
Implementation time: 1-2 months
Cost: $10K-$20K engineering
Cost Optimization by Use Case
Different applications need different strategies:
Use Case 1: Customer Support Chatbot
Characteristics:
- High query volume (10K+/day)
- Many duplicate questions
- Real-time responses required
Recommended optimizations (priority order):
- Exact caching (highest impact) - 30-40% of support queries are duplicates. ROI: 30-40% cost reduction for $5K implementation
- FAQ database - Answer common questions without LLM. ROI: 15-20% queries answered without API cost
- Model tiering - Simple questions → GPT-3.5, Complex → GPT-4. ROI: 20-30% cost reduction
Expected total savings: 50-60%
Use Case 2: Content Generation
Characteristics:
- Long outputs (500-2000 tokens)
- Creative/unique content
- Lower query volume
Recommended optimizations:
- Prompt compression - Input tokens dominate less. ROI: 10-15% savings
- Streaming (UX + cost) - Abort early if user navigates away. ROI: 5-10% savings
- Output length control - Request exactly what's needed. ROI: 15-25% savings
Expected total savings: 30-40%
Use Case 3: Document Analysis (RAG Systems)
Characteristics:
- Large context windows (many documents)
- Moderate query volume
- High input token costs
Recommended optimizations:
- Context compression (critical) - Send excerpts, not full documents. ROI: 40-60% savings
- Retrieval caching - Cache retrieved documents. ROI: 20-30% savings
- Semantic caching - Similar queries = similar docs. ROI: 15-25% savings
Expected total savings: 60-70%
Monitoring and Measuring Cost Optimization
You can't optimize what you don't measure.
Metrics to Track:
# Cost per query tracking
cost_metrics = {
'input_tokens': 0,
'output_tokens': 0,
'embedding_tokens': 0,
'total_cost': 0.0,
'model_used': '',
'cache_hit': False
}
# Track every query
def track_query_cost(query_id, metrics):
db.insert('query_costs', {
'query_id': query_id,
'timestamp': datetime.now(),
'input_tokens': metrics['input_tokens'],
'output_tokens': metrics['output_tokens'],
'total_cost': metrics['total_cost'],
'model': metrics['model_used'],
'cached': metrics['cache_hit']
})
# Daily dashboard
def generate_cost_dashboard():
today = datetime.now().date()
metrics = {
'total_queries': count_queries(today),
'total_cost': sum_costs(today),
'avg_cost_per_query': avg_cost(today),
# Breakdown by source
'cached_queries': count_cached(today),
'cache_hit_rate': cache_hit_rate(today),
# Model distribution
'gpt4_queries': count_by_model(today, 'gpt-4'),
'gpt35_queries': count_by_model(today, 'gpt-3.5-turbo'),
# Cost by component
'input_cost': sum_input_costs(today),
'output_cost': sum_output_costs(today),
'embedding_cost': sum_embedding_costs(today)
}
return metrics
Dashboard should show:
- Daily cost trends (identify spikes)
- Cost per query (track optimization impact)
- Cache hit rate (measure caching effectiveness)
- Model distribution (ensure tiering works)
- Token usage breakdown (input vs output)
Common Mistakes to Avoid
Mistake 1: Optimizing Without Measuring
❌ Wrong: "Let's compress prompts, it should help!"
✅ Right: Measure baseline → Implement change → Measure again → Compare
# Before optimizing:
# Establish baseline
baseline_cost = calculate_average_cost(days=7)
baseline_quality = measure_quality_metrics(days=7)
# Make change
implement_optimization()
# Wait 7 days for statistical significance
time.sleep(7 * 86400)
# Measure impact
new_cost = calculate_average_cost(days=7)
new_quality = measure_quality_metrics(days=7)
# Compare
cost_reduction = (baseline_cost - new_cost) / baseline_cost
quality_change = (new_quality - baseline_quality) / baseline_quality
print(f"Cost reduction: {cost_reduction:.1%}")
print(f"Quality change: {quality_change:.1%}")
Mistake 2: Sacrificing Quality for Cost
❌ Wrong: Use GPT-3.5 for everything (cheapest!)
✅ Right: Use cheapest model that maintains quality threshold
Quality thresholds:
- Customer-facing: ≥95% accuracy
- Internal tools: ≥90% accuracy
- Experimental: ≥85% accuracy
If optimization drops quality below threshold → don't do it.
Mistake 3: Over-Engineering Too Early
❌ Wrong: Build complex caching + routing + compression on day 1
✅ Right: Implement in stages, prioritize by ROI
ROI calculation:
ROI = (Annual Savings) / (Implementation Cost)
Example optimizations ranked:
1. Exact caching: $120K savings / $5K cost = 24x ROI ✅ Do first
2. Model routing: $80K savings / $20K cost = 4x ROI ✅ Do second
3. Prompt compression: $30K savings / $15K cost = 2x ROI ✅ Do third
4. Custom templates: $10K savings / $20K cost = 0.5x ROI ❌ Skip
Start with highest ROI optimizations. Stop when ROI < 2x.
Mistake 4: Not Planning for Scale
❌ Wrong: "We're only spending $500/month, doesn't matter"
✅ Right: Optimize early, before costs explode
Cost scaling example:
Month 1: 1K queries/day × $0.50 = $15K/month ← "Manageable"
Month 6: 10K queries/day × $0.50 = $150K/month ← "Problem"
Month 12: 50K queries/day × $0.50 = $750K/month ← "Crisis"
If you optimize in Month 1:
Month 12: 50K queries/day × $0.15 = $225K/month ← "Sustainable"
Savings: $525K/year by optimizing early
Optimize when costs are small (easier to test), before they become existential.
The Economics: When Optimization Pays Off
Should you optimize costs?
Decision framework:
IF annual_api_cost < $50K:
→ Focus on product, not optimization
→ Cost is not your problem yet
IF annual_api_cost between $50K and $200K:
→ Implement Phase 1 optimizations (quick wins)
→ ROI: 30-40% savings, $15K-$80K/year value
IF annual_api_cost between $200K and $500K:
→ Implement Phase 1 + Phase 2
→ ROI: 50-60% savings, $100K-$300K/year value
IF annual_api_cost > $500K:
→ Implement all phases
→ Consider fine-tuning or self-hosting models
→ ROI: 60-70% savings, $300K+/year value
Break-even calculation:
Optimization implementation cost: $30K (one-time)
Annual API cost before optimization: $200K
Savings from optimization: 50% = $100K/year
Payback period: $30K / $100K = 0.3 years (3.6 months)
5-year ROI: ($100K × 5 - $30K) / $30K = 1,567%
Worth it if:
- Payback < 12 months ✅
- API costs are trending upward ✅
- You plan to scale significantly ✅
Beyond OpenAI: Alternative Providers
Sometimes the best optimization is switching providers.
Provider Cost Comparison (Jan 2025):
Per 1M tokens pricing:
GPT-4 (OpenAI):
- Input: $10.00
- Output: $30.00
Claude 3.5 Sonnet (Anthropic):
- Input: $3.00
- Output: $15.00
- Savings vs GPT-4: 50-70%
Llama 3.1 405B (self-hosted on AWS):
- Compute: ~$2.00 per 1M tokens (at scale)
- Setup: $50K infrastructure
- Savings vs GPT-4: 80-90% (at >10M tokens/month)
Gemini 1.5 Pro (Google):
- Input: $7.00
- Output: $21.00
- Savings vs GPT-4: 30%
Decision matrix:
Query volume < 1M tokens/month:
→ Stay with OpenAI (switching cost not worth it)
Query volume 1M-10M tokens/month:
→ Consider Claude (50% savings, easy switch)
Query volume 10M-100M tokens/month:
→ Consider multi-provider (OpenAI + Claude + Gemini)
→ Route by use case
Query volume > 100M tokens/month:
→ Consider self-hosted (Llama, Mistral)
→ Savings can be 80-90%
Switching considerations:
- ✅ Quality differences (test thoroughly)
- ✅ API compatibility (different interfaces)
- ✅ Rate limits (varies by provider)
- ✅ Geographic availability (latency matters)
Our Final Architecture (After All Optimizations)
Here's what our production system looks like now:
class OptimizedLLMSystem:
def __init__(self):
self.cache = MultiLevelCache() # Redis + in-memory
self.models = ModelRouter() # GPT-3.5, GPT-4, Claude
self.metrics = CostTracker() # Monitor everything
async def answer_query(self, user_query, user_id):
# Step 1: Preprocessing
preprocessed = preprocess_query(user_query)
if preprocessed['skip_llm']:
return preprocessed['response']
# Step 2: Check cache (exact + semantic)
cached = await self.cache.get(user_query)
if cached:
self.metrics.record_cache_hit(user_id)
return cached
# Step 3: Retrieve relevant documents (with caching)
docs = await self.retrieve_with_cache(user_query)
# Step 4: Extract relevant excerpts (compress context)
excerpts = extract_relevant_excerpts(docs, user_query)
# Step 5: Route to appropriate model
complexity = calculate_query_complexity(user_query, excerpts)
model = self.models.select_model(complexity)
# Step 6: Construct optimized prompt
prompt = construct_compressed_prompt(user_query, excerpts)
# Step 7: Generate with streaming
response = await self.generate_streaming(
model=model,
prompt=prompt,
max_tokens=self.calculate_optimal_length(user_query)
)
# Step 8: Cache result
await self.cache.set(user_query, response)
# Step 9: Track metrics
self.metrics.record_query(
user_id=user_id,
model=model,
input_tokens=count_tokens(prompt),
output_tokens=count_tokens(response),
cost=self.calculate_cost(model, prompt, response),
cached=False
)
return response
Results:
- Cost: $0.15 per query (was $0.50)
- Quality: 94% accuracy (was 95%, acceptable drop)
- Speed: 3.2s average (was 4.5s, improved with streaming)
- Cache hit rate: 28%
- Model distribution: 42% GPT-3.5, 43% GPT-4 Turbo, 15% GPT-4
Takeaways: The Cost Optimization Framework
The journey:
Stage 1: Prompt Optimization (30% savings, 2 weeks)
→ Compress context → Optimize instructions → Control output length
Stage 2: Infrastructure (26% more, 2 months)
→ Multi-level caching → Model routing → Batch processing
Stage 3: Advanced (14% more, 1-2 months)
→ Query preprocessing → Adaptive context → Template reuse
Total: 70% cost reduction over 4 months
Key principles:
- Measure everything (you can't optimize blindly)
- Prioritize by ROI (highest impact first)
- Maintain quality threshold (don't sacrifice accuracy for cost)
- Optimize early (before costs explode)
- Test incrementally (A/B test every change)
Start here:
If you're just beginning cost optimization:
- Week 1: Add exact caching (15% savings)
- Week 2: Compress prompts (15% savings)
- Week 3-4: Add model routing (20% savings)
That's 50% savings in one month with relatively simple changes.
Need Help Optimizing Your LLM Costs?
We've optimized costs for 25+ production systems, saving clients $2M+ annually in API costs.
We offer a free LLM Cost Audit where we'll:
- ✅ Analyze your current cost structure
- ✅ Identify highest-impact optimizations
- ✅ Provide estimated savings (before/after)
- ✅ Recommend implementation roadmap
No obligation, just honest technical assessment.
Related Reading:
Need Help with Your AI Project?
We offer free 45-minute strategy calls to help you avoid these mistakes.
Book Free Call


