You're building an LLM application. You need it to be accurate on your specific domain—legal documents, medical records, proprietary code, whatever.
The question everyone asks: "Should I use RAG, fine-tune the model, or just write better prompts?"
The answer everyone hates: "It depends."
But here's the thing: it depends on very specific, measurable factors. And once you understand those factors, the decision becomes straightforward.
I've built 25+ production LLM systems over the past 3 years. Some used RAG. Some used fine-tuning. Most used sophisticated prompt engineering. A few used all three.
Here's the framework I use to decide—and the technical details you need to implement each approach correctly.
TL;DR: The Decision Matrix
If you just want the answer:
Use PROMPT ENGINEERING when:
✅ Model already knows your domain (general knowledge)
✅ You need to control output format or style
✅ Budget/time constrained (<$5K, <2 weeks)
✅ Requirements change frequently
Use RAG when:
✅ Need access to specific, up-to-date information
✅ Information changes frequently (documents, data)
✅ Cannot fit all context in prompt (>128K tokens)
✅ Want to cite sources and provide transparency
✅ Budget: $10K-$50K, Timeline: 1-2 months
Use FINE-TUNING when:
✅ Need model to learn new patterns or behaviors
✅ Need to teach domain-specific language/style
✅ Need extreme consistency across many queries
✅ Have 1,000+ high-quality training examples
✅ Budget: $50K-$200K, Timeline: 2-4 months
Use RAG + FINE-TUNING when:
✅ Complex domain requiring both new knowledge AND new behavior
✅ Example: Legal AI that needs case law (RAG) and legal reasoning (fine-tuning)
✅ Budget: $100K+, Timeline: 3-6 months
But let's go deeper. Because the devil (and success) is in the details.
Understanding the Three Approaches
Approach 1: Prompt Engineering
What it is: Crafting the input prompt to guide the model toward desired outputs—without changing the model itself.
How it works:
# Basic prompt
prompt = "Summarize this document"
# Engineered prompt
prompt = """
You are a legal document analyst with 10 years of experience.
Your task: Summarize the attached contract, focusing on:
1. Key obligations of each party
2. Payment terms and schedules
3. Termination clauses
4. Liability limitations
Format your response as a bulleted list under each heading.
Use precise legal terminology.
Document:
{document_text}
"""
Techniques:
- Role prompting ("You are an expert...")
- Few-shot examples (show desired input/output pairs)
- Chain-of-thought (ask model to show reasoning)
- System prompts (persistent instructions)
- Output formatting (JSON, markdown, structured data)
Pros:
- ✅ Fast: Deploy in minutes to days
- ✅ Cheap: Just API costs ($0.001-$0.10 per request)
- ✅ Flexible: Change prompts instantly
- ✅ No training data required
- ✅ Works with any model (GPT-4, Claude, etc.)
Cons:
- ❌ Limited by model's existing knowledge
- ❌ Context window constraints (4K-128K tokens)
- ❌ Can be fragile (small prompt changes = big output changes)
- ❌ Harder to enforce consistency across many queries
- ❌ Prompt injection risks (users can manipulate prompts)
When it's enough:
- Model already understands your domain (general knowledge)
- You're controlling how it responds, not what it knows
- Your data fits in context window
- You need to iterate quickly
Approach 2: Retrieval-Augmented Generation (RAG)
What it is: Retrieve relevant information from external knowledge base, then feed it to LLM in the prompt.
How it works:
# Step 1: User asks a question
user_query = "What's our return policy for electronics?"
# Step 2: Convert query to embedding
query_embedding = embed(user_query)
# Step 3: Search vector database for relevant docs
relevant_docs = vector_db.search(
query_embedding,
top_k=5,
similarity_threshold=0.8
)
# Step 4: Construct prompt with retrieved context
prompt = f"""
Answer the user's question based on the following documents:
{format_docs(relevant_docs)}
User question: {user_query}
If the answer isn't in the provided documents, say "I don't know."
"""
# Step 5: Generate answer
answer = llm.generate(prompt)
Architecture:
┌─────────────────────────────────────────┐
│ 1. Document Ingestion │
│ Documents → Chunks → Embeddings │
│ → Store in Vector DB │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 2. Query Time │
│ User Query → Embedding │
│ → Search Vector DB │
│ → Retrieve Top K Documents │
│ → Feed to LLM with Query │
│ → Generate Answer │
└─────────────────────────────────────────┘
Pros:
- ✅ Access to unlimited external knowledge (not limited by context window)
- ✅ Always up-to-date (update docs, RAG uses new info immediately)
- ✅ Transparency & citation (can show which docs were used)
- ✅ No training required (just index documents)
- ✅ Cost-effective for large knowledge bases
Cons:
- ❌ Retrieval quality critical (bad retrieval = bad answers)
- ❌ Infrastructure required (vector DB, embeddings, pipelines)
- ❌ Latency overhead (retrieval + generation takes longer)
- ❌ Doesn't teach model new behaviors (only provides information)
- ❌ Chunking strategy matters (split docs wrong = poor results)
When it's the right choice:
- You have specific documents/data model needs to reference
- Information changes frequently (docs update often)
- Need to cite sources for transparency/trust
- Knowledge base too large for context window
- Want to add new information without retraining
Approach 3: Fine-Tuning
What it is: Further train a base model on your specific data to teach it new patterns, behaviors, or domain knowledge.
How it works:
# Step 1: Prepare training data (1,000+ examples)
training_data = [
{
"messages": [
{"role": "system", "content": "You are a medical diagnosis assistant."},
{"role": "user", "content": "Patient has fever, cough, fatigue for 3 days"},
{"role": "assistant", "content": "Likely differential diagnoses:\n1. Viral URI (most common)\n2. Influenza\n3. COVID-19\n\nRecommend: PCR testing, symptomatic treatment..."}
]
},
# ... 999+ more examples
]
# Step 2: Fine-tune model
fine_tuned_model = openai.FineTuning.create(
model="gpt-4",
training_data=training_data,
hyperparameters={
"n_epochs": 3,
"learning_rate": 1e-5
}
)
# Step 3: Use fine-tuned model
response = fine_tuned_model.generate(
prompt="Patient has severe headache and neck stiffness"
)
What fine-tuning teaches:
- New domain knowledge: Medical terminology, legal language, code patterns
- Specific behaviors: How to structure responses, what format to use
- Style and tone: Formal, casual, technical, simple
- Task-specific skills: Classification, extraction, reasoning patterns
- Consistency: Reliably follow domain conventions
Pros:
- ✅ Model learns your domain deeply
- ✅ Consistent behavior across queries
- ✅ Can handle domain-specific language
- ✅ No retrieval overhead (knowledge baked in)
- ✅ Can teach complex reasoning patterns
- ✅ Better at long-tail/edge cases (if trained on them)
Cons:
- ❌ Expensive: $10K-$200K+ for quality fine-tuning
- ❌ Time-consuming: 2-4 months minimum
- ❌ Requires large training dataset (1,000+ high-quality examples)
- ❌ Hard to update (need to retrain to add new info)
- ❌ Risk of overfitting (model gets too specific)
- ❌ Hallucination risk (model "invents" info if undertrained)
- ❌ Model drift over time (base model updates, your fine-tune doesn't)
When it's the right choice:
- Need to teach model new behaviors or reasoning patterns
- Have 1,000+ high-quality training examples
- Need extreme consistency (legal, medical, financial domains)
- Domain language differs significantly from general English
- Budget and timeline support it ($50K+, 2-4 months)
The Technical Deep-Dive: When to Choose What
Let's get specific. Here are the decision criteria:
Factor 1: What Are You Actually Trying to Fix?
Problem: Model doesn't have the information
Example: "Who won the 2024 company sales award?"
Solution: RAG
Why: Model can't know company-specific information. Need to retrieve from docs.
# RAG retrieves from company HR database
relevant_docs = search("2024 sales award winner")
# Returns: "Sarah Johnson won 2024 Q4 Sales Award"
Wrong approach: Fine-tuning
- Would need to retrain every time someone wins an award
- Inefficient and expensive
Problem: Model doesn't respond in the right format/style
Example: Model gives casual responses, you need formal legal language
Solution: Prompt Engineering (or fine-tuning if extreme consistency needed)
Why: Model knows the information, you're just shaping how it responds.
# Prompt engineering
system_prompt = """
You are a formal legal assistant. Always:
- Use proper legal terminology
- Structure responses in numbered paragraphs
- Cite relevant statutes when applicable
- Maintain professional tone
"""
Wrong approach: RAG
- Providing more documents won't change response style
- That's a generation problem, not a knowledge problem
Problem: Model doesn't understand domain-specific reasoning
Example: Medical diagnosis requires understanding symptom patterns and differential diagnosis workflow
Solution: Fine-Tuning
Why: Need to teach model the reasoning process, not just facts.
# Fine-tune on medical reasoning examples
training_example = {
"input": "Patient: fever 102°F, severe headache, neck stiffness, photophobia",
"output": "Concerning for meningitis. Differential diagnosis:\n1. Bacterial meningitis (most urgent)\n2. Viral meningitis\n3. Subarachnoid hemorrhage\n\nImmediate actions:\n- Lumbar puncture\n- Blood cultures\n- Head CT if focal neuro signs\n\nEmpiric antibiotics if bacterial suspected (don't wait for LP results)."
}
Wrong approach: Prompt engineering alone
- Can improve somewhat, but won't teach systematic reasoning
- Model needs to internalize the clinical decision-making process
Factor 2: How Much Data Do You Have?
You have: Documents, knowledge base, FAQs, etc.
Solution: RAG
Why: Documents can be indexed and retrieved. Don't need labeled training data.
You have: 100-500 examples
Solution: Prompt Engineering with few-shot examples
Why: Not enough for fine-tuning (need 1,000+), but can show examples in prompt.
# Few-shot prompt (5 examples)
prompt = f"""
Extract key information from contracts in this format:
Example 1:
Contract: [sample contract 1]
Output: {{
"parties": ["Company A", "Company B"],
"start_date": "2024-01-01",
"value": "$500,000"
}}
Example 2:
[sample contract 2]
...
Now extract from this contract:
{new_contract}
"""
You have: 1,000+ high-quality labeled examples
Solution: Fine-Tuning (potentially)
Why: Enough data to train without overfitting. Can teach model your specific patterns.
But consider: Does prompt engineering + RAG get you 90% of the way there for 10% of the cost?
You have: 10,000+ examples with clear patterns
Solution: Fine-Tuning (strongly consider it)
Why: Excellent dataset. Fine-tuning will be effective and consistent.
Factor 3: How Often Does Your Information Change?
Information is static (changes monthly/yearly)
Solution: Fine-Tuning or RAG (both work)
Example: Legal precedents, medical textbooks, historical data
Information changes weekly/daily
Solution: RAG (strongly preferred)
Why: Can update documents immediately without retraining.
Information changes in real-time (hourly/continuously)
Solution: RAG with real-time indexing
Why: Only RAG can keep up with real-time changes.
# Real-time indexing
@app.post("/documents")
async def index_new_document(doc):
# Extract text
text = extract_text(doc)
# Generate embedding
embedding = embed(text)
# Index immediately (available in seconds)
await vector_db.insert(doc.id, embedding, text)
return {"status": "indexed", "id": doc.id}
Factor 4: Do You Need to Cite Sources?
Need transparency and citations
Solution: RAG (only option)
Why: RAG retrieves specific documents, so you can cite them.
# RAG with citations
answer = {
"text": "The return policy allows returns within 30 days...",
"sources": [
{"doc": "Return_Policy_2024.pdf", "page": 3},
{"doc": "Customer_Service_Guidelines.docx", "section": 2.1}
]
}
Fine-tuning: Model can't cite sources (knowledge is baked in, no way to trace back)
Factor 5: What's Your Context Window Situation?
All information fits in context (<32K tokens)
Solution: Prompt Engineering or RAG (both work)
# Everything in one prompt
prompt = f"""
Here are all relevant policies:
{policy_doc_1}
{policy_doc_2}
{policy_doc_3}
User question: {question}
"""
Information exceeds context window (>32K tokens)
Solution: RAG (required)
Why: Can't fit everything in prompt. Need to retrieve most relevant subset.
Factor 6: What's Your Budget and Timeline?
Budget: <$10K, Timeline: <2 weeks
Solution: Prompt Engineering only
Budget: $10K-$50K, Timeline: 1-2 months
Solution: RAG
Budget: $50K-$200K, Timeline: 2-4 months
Solution: Fine-Tuning or RAG + Fine-Tuning
Budget: $200K+, Timeline: 4-6 months
Solution: Full Hybrid System (RAG + Fine-Tuning + Advanced Engineering)
Deep Dive: RAG Implementation
Since RAG is the most commonly used approach, let's go technical:
RAG Architecture Components
1. Document Ingestion Pipeline
async def ingest_document(file_path):
"""
Process document for RAG system.
"""
# Extract text
text = extract_text(file_path)
# Chunk into manageable pieces
chunks = chunk_document(
text,
chunk_size=1000, # tokens
overlap=200 # token overlap between chunks
)
# Generate embeddings for each chunk
embeddings = []
for chunk in chunks:
embedding = await generate_embedding(chunk.text)
embeddings.append({
'text': chunk.text,
'embedding': embedding,
'metadata': {
'source': file_path,
'chunk_id': chunk.id,
'page': chunk.page
}
})
# Store in vector database
await vector_db.insert_batch(embeddings)
return len(embeddings)
Critical decisions:
Chunk size:
- Too small (100-200 tokens): Loses context
- Too large (2000+ tokens): Less precise retrieval
- Sweet spot: 500-1500 tokens
- Consider document structure (paragraphs, sections)
Overlap:
- Prevents important info from being split across chunks
- Typical: 10-20% overlap
- Example: 1000 token chunks with 200 token overlap
2. Embedding Strategy
# Choose embedding model
EMBEDDING_MODELS = {
'openai': 'text-embedding-3-large', # $0.13 per 1M tokens
'cohere': 'embed-english-v3.0', # $0.10 per 1M tokens
'voyage': 'voyage-2', # $0.12 per 1M tokens
}
# Generate embedding
def generate_embedding(text, model='openai'):
if model == 'openai':
response = openai.Embedding.create(
input=text,
model=EMBEDDING_MODELS['openai']
)
return response['data'][0]['embedding']
Model selection considerations:
- Accuracy: OpenAI's text-embedding-3-large is currently best (as of Jan 2025)
- Cost: Varies 2-3x between providers
- Latency: All are fast (<100ms for single embedding)
- Dimension: 1024-3072 dimensions (higher = more accurate but more storage)
Pro tip: Test multiple embedding models on your data. Quality varies by domain.
3. Vector Database Selection
# Pinecone (managed, serverless)
import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("my-index")
index.upsert(vectors=[
("doc1", embedding1, {"text": "...", "source": "..."}),
("doc2", embedding2, {"text": "...", "source": "..."})
])
# Query
results = index.query(
vector=query_embedding,
top_k=10,
include_metadata=True
)
| Database | Type | Pros | Cons | Cost |
|---|---|---|---|---|
| Pinecone | Managed | Easy, scalable | Vendor lock-in | $70-$500/mo |
| Weaviate | Self-hosted | Open source, flexible | Need to manage | Infrastructure only |
| Qdrant | Self-hosted | Fast, Python-native | Newer, smaller community | Infrastructure only |
| pgvector | PostgreSQL extension | Leverage existing Postgres | Limited scale, slower | Existing DB cost |
Recommendation: Start with Pinecone (fast setup), migrate to self-hosted if cost becomes issue at scale.
4. Retrieval Strategy
Don't just do simple vector search. Use hybrid approach:
async def retrieve_relevant_docs(query, top_k=5):
"""
Multi-strategy retrieval for better results.
"""
# Strategy 1: Semantic vector search
query_embedding = await generate_embedding(query)
vector_results = await vector_db.search(
query_embedding,
top_k=20 # Over-retrieve
)
# Strategy 2: Keyword search (BM25)
keyword_results = await elasticsearch.search(
query=query,
top_k=20
)
# Strategy 3: Hybrid ranking (combine both)
combined_results = merge_results(
vector_results,
keyword_results,
weights={'vector': 0.7, 'keyword': 0.3}
)
# Strategy 4: Rerank with cross-encoder
reranked = await rerank_with_cross_encoder(
query=query,
documents=combined_results[:10],
model='cross-encoder-ms-marco'
)
return reranked[:top_k]
Why hybrid approach:
- Vector search: Good for semantic similarity
- Keyword search: Good for exact matches (names, technical terms)
- Reranking: Improves precision significantly (+10-20% accuracy)
5. Prompt Construction with Retrieved Context
def construct_rag_prompt(query, retrieved_docs):
"""
Build prompt with retrieved context.
"""
# Format retrieved documents
context = "\n\n".join([
f"Document {i+1} (from {doc.source}):\n{doc.text}"
for i, doc in enumerate(retrieved_docs)
])
# Build prompt
prompt = f"""
Answer the user's question based on the provided documents.
IMPORTANT INSTRUCTIONS:
- Use ONLY information from the provided documents
- If the answer isn't in the documents, say "I don't have enough information to answer that"
- Cite which document(s) you used (e.g., "According to Document 2...")
- Be specific and detailed in your answer
DOCUMENTS:
{context}
USER QUESTION:
{query}
ANSWER:
"""
return prompt
Prompt engineering for RAG:
✅ Do:
- Instruct model to use only provided docs
- Tell it to admit "I don't know" if info missing
- Ask for citations
- Provide document formatting that's easy to parse
❌ Don't:
- Let model fill in gaps with its own knowledge (leads to hallucinations)
- Make prompt too long (wastes tokens)
- Forget to handle edge cases (no relevant docs found)
RAG Failure Modes and Fixes
Problem 1: Retrieval Returns Irrelevant Documents
Symptoms:
- Model says "I don't know" even though answer exists
- Model answers based on wrong documents
- Search quality score: <70%
Fixes:
- ✅ Improve chunking strategy (preserve context)
- ✅ Add hybrid search (vector + keyword)
- ✅ Tune retrieval parameters (top_k, similarity threshold)
- ✅ Try different embedding models
- ✅ Add metadata filtering (date, source, category)
Problem 2: Model Hallucinates Despite RAG
Symptoms:
- Model invents information not in retrieved docs
- Confident but incorrect answers
- Citations to non-existent documents
Fixes:
# Stricter system prompt
system_prompt = """
You are a helpful assistant that answers questions based ONLY on provided documents.
CRITICAL RULES:
1. If the information is NOT in the provided documents, say "I don't have that information in the provided documents."
2. DO NOT use your general knowledge to fill in gaps
3. DO NOT make assumptions or inferences beyond what's explicitly stated
4. ALWAYS cite which document you're using (e.g., "According to Document 2...")
If you're unsure, err on the side of saying you don't know.
"""
# Validate answer against sources
def validate_answer(answer, source_docs):
"""
Check if answer is grounded in source docs.
"""
# Use LLM to verify
verification_prompt = f"""
Source documents:
{source_docs}
Claimed answer:
{answer}
Question: Is this answer fully supported by the source documents?
Answer only "YES" or "NO" with brief explanation.
"""
verification = llm.generate(verification_prompt)
if "NO" in verification:
return "I don't have sufficient information to answer that accurately."
return answer
Problem 3: Context Window Overflow
Symptoms:
- Retrieved docs too long for context window
- Model truncates input
- Answers missing information from later docs
Fixes:
def fit_context_window(retrieved_docs, max_tokens=6000):
"""
Ensure retrieved docs fit in context window.
"""
docs_with_tokens = [
(doc, count_tokens(doc.text))
for doc in retrieved_docs
]
# Sort by relevance score
docs_with_tokens.sort(key=lambda x: x[0].score, reverse=True)
# Include docs until we hit token limit
selected_docs = []
total_tokens = 0
for doc, token_count in docs_with_tokens:
if total_tokens + token_count <= max_tokens:
selected_docs.append(doc)
total_tokens += token_count
else:
break
return selected_docs
Deep Dive: Fine-Tuning Implementation
Fine-tuning is more complex. Here's what's actually involved:
Step 1: Data Preparation (Hardest Part)
You need 1,000+ high-quality examples in this format:
training_example = {
"messages": [
{
"role": "system",
"content": "You are a medical diagnosis assistant."
},
{
"role": "user",
"content": "Patient presents with fever 102°F, severe headache, neck stiffness"
},
{
"role": "assistant",
"content": "Concerning for meningitis. Differential:\n1. Bacterial meningitis (urgent)\n2. Viral meningitis\n3. SAH\n\nActions:\n- Lumbar puncture\n- Blood cultures\n- Consider empiric antibiotics"
}
]
}
Data quality matters more than quantity:
✅ 1,000 excellent examples > 10,000 mediocre examples
✅ Diverse examples (cover edge cases)
✅ Consistent format and style
✅ Expert-reviewed (not crowdsourced)
Where to get training data:
Option 1: Historic data
- Customer support tickets (question + best answer)
- Internal documentation (Q&A pairs)
- Expert annotations on past cases
Option 2: Synthetic generation
# Generate training examples with GPT-4
def generate_training_examples(topic, count=100):
examples = []
for i in range(count):
# Generate diverse scenarios
prompt = f"""
Generate a realistic {topic} scenario and expert response.
Format:
Scenario: [realistic user question/input]
Expert Response: [detailed, accurate response]
"""
response = gpt4.generate(prompt)
# Parse and structure
example = parse_into_training_format(response)
examples.append(example)
return examples
# Generate 1000 examples
training_data = generate_training_examples("medical diagnosis", 1000)
# CRITICAL: Have domain experts review ALL synthetic examples
# Bad synthetic data = worse than no fine-tuning
Option 3: Hybrid (historic + synthetic + expert review)
- Start with historic data (500 examples)
- Generate synthetic to fill gaps (500 examples)
- Domain experts review and correct all (1,000 examples)
- Best approach for quality
Step 2: Training
# OpenAI Fine-Tuning API
import openai
# Upload training file
file = openai.File.create(
file=open("training_data.jsonl", "rb"),
purpose='fine-tune'
)
# Create fine-tuning job
fine_tune = openai.FineTuning.create(
training_file=file.id,
model="gpt-4",
hyperparameters={
"n_epochs": 3, # Number of passes through data
"learning_rate": 1e-5, # How fast model learns
"batch_size": 4 # Examples per training step
}
)
# Monitor training
while fine_tune.status != "succeeded":
fine_tune = openai.FineTuning.retrieve(fine_tune.id)
print(f"Status: {fine_tune.status}")
time.sleep(60)
# Training complete
model_id = fine_tune.fine_tuned_model
Training time:
- Small dataset (1K examples): 1-3 hours
- Medium dataset (10K examples): 6-12 hours
- Large dataset (100K examples): 24-48 hours
Cost (OpenAI pricing):
- Training: ~$0.10 per 1K tokens in training data
- Inference: 8x base model cost
- Example: 1M token training data = $100 training cost
- Ongoing: If base model costs $0.01/1K tokens, fine-tuned costs $0.08/1K tokens
Step 3: Evaluation
def evaluate_fine_tuned_model(model_id, test_set):
"""
Test fine-tuned model on held-out test set.
"""
results = []
for example in test_set:
# Generate with fine-tuned model
response = openai.ChatCompletion.create(
model=model_id,
messages=example['messages'][:-1] # Exclude assistant response
)
predicted = response.choices[0].message.content
expected = example['messages'][-1]['content']
# Calculate similarity
similarity = calculate_similarity(predicted, expected)
results.append({
'input': example['messages'][-2]['content'],
'expected': expected,
'predicted': predicted,
'similarity': similarity
})
# Calculate metrics
avg_similarity = sum(r['similarity'] for r in results) / len(results)
return {
'average_similarity': avg_similarity,
'results': results
}
Evaluation metrics:
- Accuracy (for classification tasks)
- BLEU score (for generation tasks)
- Human evaluation (gold standard)
- Domain-specific metrics (e.g., medical diagnosis accuracy)
Acceptable performance:
- 90% accuracy for high-stakes domains (medical, legal, financial)
- 80% for moderate-stakes domains
- 70% for low-stakes domains
If below threshold: Need more/better training data or different approach.
The Hybrid Approach: RAG + Fine-Tuning
Sometimes you need both. Here's when and how:
When to Use Both
Scenario 1: Complex Domain with Specific Knowledge + Specific Behavior
Example: Legal AI assistant
What RAG provides:
- Access to case law database (millions of cases)
- Latest legislation and regulations
- Firm-specific precedents
What fine-tuning provides:
- Legal reasoning patterns
- Proper legal citation format
- Jurisdiction-specific analysis style
- Risk assessment methodology
Architecture:
async def hybrid_legal_assistant(query):
# Step 1: RAG retrieves relevant cases
relevant_cases = await rag_search(query, top_k=10)
# Step 2: Fine-tuned model analyzes with legal reasoning
analysis = await fine_tuned_model.generate(
prompt=f"""
Query: {query}
Relevant precedents:
{format_cases(relevant_cases)}
Provide legal analysis following firm standards:
1. Applicable law
2. Relevant precedents
3. Analysis
4. Recommendation
""",
model="gpt-4-legal-finetuned"
)
return analysis
Result: Best of both worlds
- RAG: Up-to-date legal knowledge
- Fine-tuning: Consistent legal reasoning and format
Implementation Strategy
Phase 1: Start with RAG (Month 1-2)
- Get system working with RAG
- Validate approach
- Collect real user queries and desired responses
- Cost: $20K-$40K
Phase 2: Fine-tune if needed (Month 3-4)
- Use real queries as training data
- Fine-tune for consistent behavior
- A/B test RAG-only vs. RAG+fine-tuning
- Only proceed if fine-tuning adds significant value
- Additional cost: $50K-$100K
Decision point: Does fine-tuning improve results enough to justify 2x cost?
Practical Decision Framework
Let's make this concrete. Answer these questions:
Question Set 1: Knowledge vs. Behavior
Q: What are you trying to improve?
- A) Model doesn't have the information → Use RAG
- B) Model doesn't respond in the right format/style → Use Prompt Engineering (or fine-tuning if extreme consistency needed)
- C) Model doesn't follow domain-specific reasoning patterns → Use Fine-Tuning
- D) Model needs specific information AND specific behavior → Use RAG + Fine-Tuning
Question Set 2: Data Situation
Q: What data do you have?
- A) Documents, knowledge base, database (no labeled examples) → Use RAG
- B) 100-500 input/output examples → Use Prompt Engineering with few-shot
- C) 1,000-5,000 labeled examples → Consider Fine-Tuning (if other factors support it)
- D) 10,000+ labeled examples → Fine-Tuning strongly recommended
Question Set 3: Update Frequency
Q: How often does your information change?
- A) Real-time or daily → RAG (only option)
- B) Weekly or monthly → RAG (preferred) or Fine-tuning (acceptable)
- C) Quarterly or yearly → Fine-Tuning (acceptable) or RAG (also fine)
- D) Static (never changes) → Fine-Tuning (most efficient) or RAG (more flexible)
Question Set 4: Budget & Timeline
Q: What's your budget and timeline?
- A) <$10K, <2 weeks → Prompt Engineering only
- B) $10K-$50K, 1-2 months → RAG
- C) $50K-$200K, 2-4 months → Fine-Tuning or RAG + Fine-Tuning
- D) $200K+, 4-6 months → Full Hybrid System
Real-World Examples from Our Projects
Example 1: Agent22 (Employee Onboarding) - RAG Only
Problem: Employees asking same questions, knowledge scattered across 10K+ docs
Approach: RAG
Why:
- ✅ Needed access to company docs (knowledge)
- ✅ Docs updated frequently
- ✅ No consistent "behavior" to teach (just answer questions)
- ✅ Budget: $85K (fit RAG perfectly)
Architecture:
- Indexed 10K documents in Pinecone
- Semantic search for relevant docs
- GPT-4 generates answers based on retrieved docs
- No fine-tuning needed
Result: 80% adoption, 67% faster onboarding
Example 2: Sokrateque (Academic Research) - Advanced RAG
Problem: Researchers need to find relevant papers in 10M+ paper corpus
Approach: Sophisticated RAG with multiple strategies
Why:
- ✅ Massive knowledge base (10M papers)
- ✅ Papers updated constantly (new publications daily)
- ✅ Needed citation transparency
- ✅ No specific "behavior" to teach (just retrieve and summarize)
Architecture:
- Multi-strategy retrieval (semantic + citation graph + keyword)
- Hierarchical search (fast → precise)
- GPT-4 for synthesis
- Citation validation to prevent hallucinations
Result: 10x faster literature reviews, 90%+ accuracy
Example 3: LAWEP.AI (Legislative Drafting) - RAG + Fine-Tuning
Problem: Draft legislation using precedents from 1.2M bills, following legal conventions
Approach: RAG for precedents + Fine-tuning for drafting style
Why:
- ✅ Needed access to 1.2M bills (knowledge) → RAG
- ✅ Needed to teach legislative drafting conventions (behavior) → Fine-tuning
- ✅ Bills updated constantly (new legislation) → RAG
- ✅ Consistent legal formatting required → Fine-tuning
- ✅ Budget: $105K (supported hybrid)
Architecture:
- RAG retrieves relevant precedent bills
- Fine-tuned model drafts in proper legislative format
- Constitutional risk analysis (fine-tuned reasoning)
- Human review before finalization
Result: 70% faster drafting, 0 constitutional challenges
Lesson: Some problems genuinely need both. Don't force yourself into one approach.
Common Mistakes to Avoid
Mistake 1: Fine-Tuning When RAG Would Suffice
Scenario: "We need the model to know our product documentation. Let's fine-tune!"
Why this is wrong:
- Product docs change frequently
- Fine-tuning locks in knowledge (hard to update)
- RAG is faster, cheaper, and more flexible
Correct approach: RAG with product docs indexed
When we see this: ~40% of initial consultations
Our response: "Let's start with RAG and only fine-tune if RAG isn't sufficient"
Mistake 2: Using RAG When Prompt Engineering Would Work
Scenario: "Users want responses in JSON format. Let's build a RAG system!"
Why this is wrong:
- This is a formatting issue, not a knowledge issue
- Prompt engineering can handle format control
- Building RAG adds complexity with no benefit
Correct approach: System prompt with JSON formatting instructions
system_prompt = """
Return all responses as JSON with this structure:
{
"answer": "your answer here",
"confidence": 0.95,
"sources": ["source1", "source2"]
}
"""
Mistake 3: Skipping Prompt Engineering Entirely
Scenario: "RAG/fine-tuning will handle everything, no need for good prompts!"
Why this is wrong:
- Even with RAG or fine-tuning, prompts matter
- Good prompts improve results significantly (20-30%)
- Bad prompts sabotage even great architectures
Correct approach: Invest in prompt engineering regardless of architecture
Mistake 4: Not Testing Approaches
Scenario: "We'll fine-tune because that seems most sophisticated."
Why this is wrong:
- Assumption-based decisions lead to wasted effort
- Might not need fine-tuning
- Could have validated with RAG first
Correct approach:
- Start with prompt engineering (1 week)
- Add RAG if needed (2 weeks)
- Fine-tune only if RAG+prompting insufficient (2 months)
Validate incrementally, don't jump to most complex solution.
Cost Comparison: Real Numbers
Let's compare actual costs for a typical use case (customer support knowledge base):
Scenario:
- 1,000 documents in knowledge base
- 10,000 queries per month
- Need accurate, sourced answers
Option 1: Prompt Engineering Only
Setup Cost: $5K-$10K
- Engineering time: 40-80 hours
- No infrastructure needed
Monthly Cost: $300-$500
- API costs: $300-$500 (GPT-4)
- No additional infrastructure
Year 1 Total: $8,600-$16,000
Pros: Cheapest, fastest
Cons: No access to documents beyond context window, can't cite sources
Option 2: RAG
Setup Cost: $20K-$40K
- Engineering time: 150-300 hours
- Vector DB setup
- Embedding generation
Monthly Cost: $800-$1,500
- API costs: $500-$1,000 (GPT-4 + embeddings)
- Vector DB: $200-$300 (Pinecone)
- Infrastructure: $100-$200
Year 1 Total: $29,600-$58,000
Pros: Access to all documents, citable sources, updatable
Cons: More complex, higher cost
Option 3: Fine-Tuning
Setup Cost: $50K-$150K
- Data preparation: $20K-$50K
- Training: $10K-$30K
- Engineering: $20K-$70K
Monthly Cost: $2,400-$4,000
- API costs: $2,400-$4,000 (fine-tuned model is 8x base price)
- No vector DB needed
Year 1 Total: $78,800-$198,000
Pros: Consistent behavior, no retrieval overhead
Cons: Most expensive, hardest to update, longest timeline
Option 4: RAG + Fine-Tuning
Setup Cost: $80K-$200K
- RAG setup: $20K-$40K
- Fine-tuning: $60K-$160K
Monthly Cost: $3,200-$5,500
- API costs: $3,000-$5,000
- Vector DB: $200-$300
- Infrastructure: $100-$200
Year 1 Total: $118,400-$266,000
Pros: Best quality, most comprehensive
Cons: Most expensive, most complex
Decision based on budget:
| Budget | Recommended Approach |
|---|---|
| <$20K | Prompt Engineering only |
| $20K-$50K | RAG |
| $50K-$100K | RAG (sophisticated) or Fine-Tuning (if justified) |
| $100K+ | RAG + Fine-Tuning (if needed) |
Final Recommendation: Start Simple, Scale Complexity
Here's the playbook that works:
Week 1-2: Prompt Engineering
- Spend 1-2 weeks optimizing prompts
- Test different structures, examples, instructions
- Measure: accuracy, consistency, user satisfaction
- Cost: $5K-$10K
Decision point: Is this good enough (>80% accuracy, >4/5 user satisfaction)?
- Yes: Ship it. Done.
- No: Proceed to RAG
Month 1-2: Add RAG
- Index your documents
- Implement retrieval
- Combine with optimized prompts
- Test thoroughly
- Additional cost: $20K-$40K
Decision point: Is this good enough now (>90% accuracy, >4.5/5 satisfaction)?
- Yes: Ship it. Done.
- No: Consider fine-tuning
Month 3-4: Fine-Tune (If Truly Needed)
- Collect 1,000+ training examples
- Fine-tune model
- A/B test RAG vs. RAG+Fine-tuning
- Only deploy if significant improvement (>10% accuracy gain)
- Additional cost: $50K-$100K
Decision point: Does fine-tuning provide enough value to justify 2-3x cost?
- Yes: Deploy hybrid system
- No: Stick with RAG
When to Get Help
You should consider partnering with AI specialists if:
- ✅ You're unsure which approach is right
- ✅ You've tried one approach and it's not working
- ✅ Your use case is complex (might need hybrid)
- ✅ You need production quality quickly
- ✅ Budget >$50K (justifies expert guidance)
We offer a free technical architecture review where we'll:
- Analyze your specific use case
- Recommend RAG vs. fine-tuning vs. hybrid
- Provide implementation guidance
- Share estimated timelines and costs
No sales pitch, just technical guidance.
Book Free Architecture Review →
Conclusion: The Right Tool for the Right Job
There is no "best" approach. There's only the right approach for your specific situation.
Decision framework recap:
Prompt Engineering:
Controlling format/style, Budget <$10K, Timeline <2 weeks, Model already knows domain
RAG:
Need access to specific documents, Information changes frequently, Need to cite sources, Budget $10K-$50K, Timeline 1-2 months
Fine-Tuning:
Teaching new behaviors/patterns, Need extreme consistency, Have 1,000+ quality examples, Budget $50K-$200K, Timeline 2-4 months
RAG + Fine-Tuning:
Complex domain requiring both, Budget $100K+, Timeline 3-6 months
The winning strategy: Start simple, add complexity only when justified.
Most projects succeed with RAG + excellent prompt engineering. Fine-tuning is the exception, not the rule.
Related Reading:
Need Help with Your AI Project?
We offer free 45-minute strategy calls to help you avoid these mistakes.
Book Free Call


