Advanced9 min read

Production Deployment

Harden prompts for production: caching, rate limiting, fallbacks, and monitoring.

The Gap Between Demo and Production

A prompt that works great in testing can fail spectacularly in production. Real users are creative, adversarial, and unpredictable. Production systems need resilience that prototypes don't require.

Prompt Injection Defense

Prompt injection occurs when user input overrides your system prompt instructions. Mitigations:

  1. Input sanitization — Strip or escape control characters and known injection patterns
  2. Constrained output — Request structured outputs (JSON, fixed schema) that are hard to corrupt
  3. Instruction isolation — Clearly separate system instructions from user data with delimiters
  4. Output validation — Validate every response before using it

Caching

Cache responses for identical or near-identical inputs. Most production workloads have significant repetition. Caching reduces cost and latency simultaneously.

Use semantic caching for near-duplicate detection — even if two prompts aren't identical, they may have the same intent and thus the same ideal response.

Rate Limiting and Fallbacks

Never assume the AI provider is available. Build fallbacks:

  • Retry logic with exponential backoff
  • Graceful degradation (simpler non-AI response)
  • Circuit breakers that disable AI features when error rates spike

Monitoring What Matters

Track in production:

  • Latency (p50, p95, p99)
  • Error rate by error type
  • Token usage and cost per request
  • Output quality (run evals on sampled production traffic)
  • User satisfaction signals (thumbs up/down, re-prompts)

Version Management

When you update a production prompt, do it with a feature flag or gradual rollout. Monitor metrics during the transition. Have a rollback plan ready.

Never update a production prompt at 5 PM on a Friday.

Loading…