## The Gap Between Dev and Prod LLM Observability
In development, LLM observability is easy. You enable tracing, call your model, see the trace. Great.
Production is different. You have concurrent requests, multiple models, token budgets, latency SLOs, hallucination risks, and cost spikes that can triple your AI budget overnight. The developers who set up tracing in a Jupyter notebook are not equipped for this.
Here's what production LLM observability actually requires.
## Instrumentation That Covers Every Call
The most common mistake: instrumenting your custom code but forgetting about third-party libraries.
``
python
# Wrong — only instruments direct OpenAI calls
from ddtrace.llmobs import LLMObs
LLMObs.enable()
import openai
# ✓ OpenAI calls are traced
# But if you use LangChain, LlamaIndex, or custom RAG:
from langchain.chat_models import ChatOpenAI
# ✗ NOT traced unless you use ddtrace's LangChain integration
# Right — enable integrations explicitly
from ddtrace.contrib.langchain import patch as patch_langchain
patch_langchain()
# Or use the comprehensive approach
import ddtrace
ddtrace.patch_all() # patches all supported integrations
`
For Databricks Model Serving, add this to your serving endpoint initialization:
`python
import mlflow
import os
# Enable MLflow → Datadog bridge
os.environ["MLFLOW_TRACKING_URI"] = "databricks"
os.environ["DD_MLFLOW_METRICS_ENABLED"] = "true"
mlflow.set_tracking_uri("databricks")
# All model invocations now emit to both MLflow and Datadog
`
## The 7 Monitors Every LLM Production System Needs
### Monitor 1 — Token spend rate of change
`
Alert condition: week_over_week_change(token_spend) > 30%
Severity: WARNING
Message: "Token spend increased {value}% week-over-week.
Check for new traffic patterns or prompt expansion."
`
### Monitor 2 — P95 latency per model
`
Alert condition: p95(llm.request.duration) by model_name > 5000ms
Severity: WARNING (> 5s) / CRITICAL (> 10s)
Message: "Model {model_name} P95 latency at {value}ms — exceeds SLO"
`
### Monitor 3 — Error rate spike
`
Alert condition: rate(llm.request.error) > 0.05 over 5m
Severity: CRITICAL
Message: "LLM error rate at {value}% — possible model endpoint issue or rate limiting"
`
### Monitor 4 — Hallucination proxy metric
`
# Build a faithfulness evaluator and emit this custom metric
Alert condition: avg(custom.llm.faithfulness_score) by app < 0.75
Severity: WARNING
Message: "Faithfulness score dropped to {value} for {app}.
Review recent prompt changes or retrieval quality."
`
### Monitor 5 — Per-agent cost anomaly
`
Alert condition: anomaly(sum(llm.tokens.total) by agent_id, 'agile', 3) > 0
Severity: WARNING
Message: "Agent {agent_id} token consumption is anomalous.
Current: {value} / Historical baseline: {baseline}"
`
### Monitor 6 — Prompt injection detection
`python
# Custom check — runs as a Datadog synthetic test
def check_for_injection_patterns(prompt: str) -> float:
injection_patterns = [
r"ignore previous instructions",
r"disregard the above",
r"you are now",
r"DAN mode",
r"jailbreak",
]
matches = sum(1 for p in injection_patterns if re.search(p, prompt, re.IGNORECASE))
return matches / len(injection_patterns)
# Emit as metric and alert if score > 0
statsd.gauge("custom.llm.injection_risk_score", score, tags=[f"session:{session_id}"])
`
### Monitor 7 — RAG retrieval quality
`python
# After each RAG retrieval, emit context relevance score
from datadog import statsd
def log_retrieval_quality(query: str, retrieved_docs: list, relevance_scores: list):
avg_score = sum(relevance_scores) / len(relevance_scores)
statsd.gauge(
"custom.rag.retrieval_relevance",
avg_score,
tags=[
f"index:{index_name}",
f"query_category:{classify_query(query)}",
]
)
if avg_score < 0.65:
# Alert: retrieval is failing to find relevant context
statsd.event(
"RAG Relevance Warning",
f"Average relevance {avg_score:.2f} below threshold for query category {classify_query(query)}",
alert_type="warning",
)
`
## Cost Control: The Part That Actually Saves Your Budget
Token costs compound fast. Here's a three-layer cost control architecture:
Layer 1 — Request-level token budgets
`python
MAX_TOKENS_BY_AGENT = {
"agentic-remediation": 2000,
"report-generator": 8000,
"chat-assistant": 1500,
}
def enforce_token_budget(agent_id: str, prompt: str) -> str:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = len(enc.encode(prompt))
limit = MAX_TOKENS_BY_AGENT.get(agent_id, 2000)
if tokens > limit * 0.8:
# Truncate and log
statsd.increment("custom.llm.prompt_truncated", tags=[f"agent:{agent_id}"])
return truncate_to_token_limit(prompt, int(limit * 0.7))
return prompt
`
Layer 2 — Daily hard limits per team
Configure in Datadog via Usage Attribution. Create cost centers per business unit and set automated shutoffs at 150% of budget.
Layer 3 — Semantic caching
`python
import hashlib
from redis import Redis
redis = Redis()
CACHE_TTL = 3600 # 1 hour
def cached_llm_call(prompt: str, model: str) -> str:
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
cached = redis.get(cache_key)
if cached:
statsd.increment("custom.llm.cache_hit", tags=[f"model:{model}"])
return cached.decode()
result = call_llm(prompt, model)
redis.setex(cache_key, CACHE_TTL, result)
statsd.increment("custom.llm.cache_miss", tags=[f"model:{model}"])
return result
``Semantic caching alone typically reduces token spend by 20-40% for enterprise AI applications with repetitive query patterns.
## The Observability Dashboard Structure
Organize your LLM dashboard into four sections:
1. Health — Error rate, P50/P95/P99 latency, availability by model
2. Quality — Faithfulness scores, retrieval relevance, hallucination rate proxy
3. Cost — Token spend by model/agent/team, week-over-week trend, projected monthly
4. Security — Injection risk scores, anomalous agent behavior, scope violations
This structure aligns to your stakeholders: ops owns health, AI team owns quality, finance owns cost, security owns the last section.
Production LLM observability is a discipline, not a feature flag. The teams that invest in it early spend 40% less on token costs and catch model drift weeks before their users notice.