Now accepting Q3 2025 engagements. Limited spots available for enterprise AI deployments.

Technical Guide14 min readMarch 25, 2025

Datadog LLM Observability in Production: What Nobody Tells You

Alex Chen

AI Infrastructure Lead

## The Gap Between Dev and Prod LLM Observability

In development, LLM observability is easy. You enable tracing, call your model, see the trace. Great.

Production is different. You have concurrent requests, multiple models, token budgets, latency SLOs, hallucination risks, and cost spikes that can triple your AI budget overnight. The developers who set up tracing in a Jupyter notebook are not equipped for this.

Here's what production LLM observability actually requires.

## Instrumentation That Covers Every Call

The most common mistake: instrumenting your custom code but forgetting about third-party libraries.

``

python
# Wrong — only instruments direct OpenAI calls
from ddtrace.llmobs import LLMObs
LLMObs.enable()
import openai
# ✓ OpenAI calls are traced

# But if you use LangChain, LlamaIndex, or custom RAG:
from langchain.chat_models import ChatOpenAI
# ✗ NOT traced unless you use ddtrace's LangChain integration

# Right — enable integrations explicitly
from ddtrace.contrib.langchain import patch as patch_langchain
patch_langchain()

# Or use the comprehensive approach
import ddtrace
ddtrace.patch_all()  # patches all supported integrations



For Databricks Model Serving, add this to your serving endpoint initialization:

python
import mlflow
import os

# Enable MLflow → Datadog bridge
os.environ["MLFLOW_TRACKING_URI"] = "databricks"
os.environ["DD_MLFLOW_METRICS_ENABLED"] = "true"
mlflow.set_tracking_uri("databricks")

# All model invocations now emit to both MLflow and Datadog



## The 7 Monitors Every LLM Production System Needs

### Monitor 1 — Token spend rate of change


Alert condition: week_over_week_change(token_spend) > 30%
Severity: WARNING
Message: "Token spend increased {value}% week-over-week.
          Check for new traffic patterns or prompt expansion."



### Monitor 2 — P95 latency per model


Alert condition: p95(llm.request.duration) by model_name > 5000ms
Severity: WARNING (> 5s) / CRITICAL (> 10s)
Message: "Model {model_name} P95 latency at {value}ms — exceeds SLO"



### Monitor 3 — Error rate spike


Alert condition: rate(llm.request.error) > 0.05 over 5m
Severity: CRITICAL
Message: "LLM error rate at {value}% — possible model endpoint issue or rate limiting"



### Monitor 4 — Hallucination proxy metric


# Build a faithfulness evaluator and emit this custom metric
Alert condition: avg(custom.llm.faithfulness_score) by app < 0.75
Severity: WARNING
Message: "Faithfulness score dropped to {value} for {app}.
          Review recent prompt changes or retrieval quality."



### Monitor 5 — Per-agent cost anomaly


Alert condition: anomaly(sum(llm.tokens.total) by agent_id, 'agile', 3) > 0
Severity: WARNING
Message: "Agent {agent_id} token consumption is anomalous.
          Current: {value} / Historical baseline: {baseline}"



### Monitor 6 — Prompt injection detection

python
# Custom check — runs as a Datadog synthetic test
def check_for_injection_patterns(prompt: str) -> float:
    injection_patterns = [
        r"ignore previous instructions",
        r"disregard the above",
        r"you are now",
        r"DAN mode",
        r"jailbreak",
    ]
    matches = sum(1 for p in injection_patterns if re.search(p, prompt, re.IGNORECASE))
    return matches / len(injection_patterns)

# Emit as metric and alert if score > 0
statsd.gauge("custom.llm.injection_risk_score", score, tags=[f"session:{session_id}"])



### Monitor 7 — RAG retrieval quality

python
# After each RAG retrieval, emit context relevance score
from datadog import statsd

def log_retrieval_quality(query: str, retrieved_docs: list, relevance_scores: list):
    avg_score = sum(relevance_scores) / len(relevance_scores)
    statsd.gauge(
        "custom.rag.retrieval_relevance",
        avg_score,
        tags=[
            f"index:{index_name}",
            f"query_category:{classify_query(query)}",
        ]
    )
    if avg_score < 0.65:
        # Alert: retrieval is failing to find relevant context
        statsd.event(
            "RAG Relevance Warning",
            f"Average relevance {avg_score:.2f} below threshold for query category {classify_query(query)}",
            alert_type="warning",
        )



## Cost Control: The Part That Actually Saves Your Budget

Token costs compound fast. Here's a three-layer cost control architecture:

Layer 1 — Request-level token budgets

python
MAX_TOKENS_BY_AGENT = {
    "agentic-remediation": 2000,
    "report-generator": 8000,
    "chat-assistant": 1500,
}

def enforce_token_budget(agent_id: str, prompt: str) -> str:
    import tiktoken
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = len(enc.encode(prompt))
    limit = MAX_TOKENS_BY_AGENT.get(agent_id, 2000)
    if tokens > limit * 0.8:
        # Truncate and log
        statsd.increment("custom.llm.prompt_truncated", tags=[f"agent:{agent_id}"])
        return truncate_to_token_limit(prompt, int(limit * 0.7))
    return prompt



Layer 2 — Daily hard limits per team
Configure in Datadog via Usage Attribution. Create cost centers per business unit and set automated shutoffs at 150% of budget.

Layer 3 — Semantic caching

python
import hashlib
from redis import Redis

redis = Redis()
CACHE_TTL = 3600  # 1 hour

def cached_llm_call(prompt: str, model: str) -> str:
    cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    cached = redis.get(cache_key)
    if cached:
        statsd.increment("custom.llm.cache_hit", tags=[f"model:{model}"])
        return cached.decode()

    result = call_llm(prompt, model)
    redis.setex(cache_key, CACHE_TTL, result)
    statsd.increment("custom.llm.cache_miss", tags=[f"model:{model}"])
    return result

``

Semantic caching alone typically reduces token spend by 20-40% for enterprise AI applications with repetitive query patterns.

## The Observability Dashboard Structure

Organize your LLM dashboard into four sections:

1. Health — Error rate, P50/P95/P99 latency, availability by model
2. Quality — Faithfulness scores, retrieval relevance, hallucination rate proxy
3. Cost — Token spend by model/agent/team, week-over-week trend, projected monthly
4. Security — Injection risk scores, anomalous agent behavior, scope violations

This structure aligns to your stakeholders: ops owns health, AI team owns quality, finance owns cost, security owns the last section.

Production LLM observability is a discipline, not a feature flag. The teams that invest in it early spend 40% less on token costs and catch model drift weeks before their users notice.

DatadogLLMOpsObservabilityAI ProductionCost ControlMonitoring

Ready to Implement This in Your Enterprise?

Schedule a free 30-minute call and we'll map this architecture to your specific stack.

Request AI Readiness Assessment

Related Blueprints

Security

Securing LLM API Keys: Best Practices Using Okta and Datadog

8 min read

Agent Blueprints