engineeringdata
Prompt Engineering Guide
Design, test, and optimize LLM prompts systematically — with evaluation frameworks, chain-of-thought patterns, output formatting, and iteration methodology for reliable AI outputs.
promptsLLMevaluationchain-of-thoughtAIoptimization
Works well with agents
Works well with skills
prompt-engineering-guide/
customer-support-classifier.md
Markdown| 1 | # Prompt Engineering — Support Ticket Classifier, Helios SaaS |
| 2 | |
| 3 | ## Task Definition |
| 4 | |
| 5 | **Goal:** Classify incoming support tickets into one of 6 categories for automatic queue routing. |
| 6 | **Model:** Claude 3.5 Sonnet. **Volume:** ~800 tickets/day, budget $0.01/ticket max. |
| 7 | |
| 8 | ### System Prompt (v4 — final) |
| 9 | |
| 10 | ``` |
| 11 | You are a support ticket classifier for Helios, a project management SaaS product. |
| 12 | Given a support ticket (subject and body), classify it into exactly one category. |
| 13 | |
| 14 | Categories: |
| 15 | - billing: Payments, invoices, subscriptions, refunds, pricing |
| 16 | - bug: Something broken, errors, crashes, unexpected behavior |
| 17 | - feature_request: Requests for new functionality |
| 18 | - account: Login, password resets, SSO, permissions |
| 19 | - integration: API questions, webhooks, third-party connections |
| 20 | - how_to: Usage questions, workflow guidance |
| 21 | |
| 22 | Rules: |
| 23 | - If a ticket spans two categories, choose the one requiring action first. |
| 24 | - Broken integrations are "bug" not "integration." |
| 25 | - If uncertain, respond with "unknown". |
| 26 | |
| 27 | Respond with JSON only: |
| 28 | {"category": "string", "confidence": "high | medium | low", "reasoning": "1 sentence"} |
| 29 | ``` |
| 30 | |
| 31 | ## Few-Shot Examples |
| 32 | |
| 33 | ``` |
| 34 | Ticket: "Can't export CSV anymore" / "Export button throws 500 error since yesterday." |
| 35 | Output: {"category": "bug", "confidence": "high", "reasoning": "Specific error on previously working feature."} |
| 36 | |
| 37 | Ticket: "Add Gantt chart view" / "Would love a timeline view for projects." |
| 38 | Output: {"category": "feature_request", "confidence": "high", "reasoning": "Requesting new functionality."} |
| 39 | |
| 40 | Ticket: "Slack notifications stopped" / "Integration was fine until last week, nothing changed." |
| 41 | Output: {"category": "bug", "confidence": "high", "reasoning": "Previously working integration is broken."} |
| 42 | |
| 43 | Ticket: "How do I set up recurring tasks?" / "I want tasks to recreate every Monday." |
| 44 | Output: {"category": "how_to", "confidence": "high", "reasoning": "Asking how to use an existing feature."} |
| 45 | ``` |
| 46 | |
| 47 | ## Evaluation Framework |
| 48 | |
| 49 | Test set: 50 manually labeled tickets, stratified by category. |
| 50 | |
| 51 | | Metric | Method | Threshold | |
| 52 | |--------|--------|-----------| |
| 53 | | Format compliance | JSON schema validation | 100% | |
| 54 | | Accuracy | Exact match vs. human labels | > 90% | |
| 55 | | High-confidence accuracy | Accuracy on "high" confidence only | > 95% | |
| 56 | | Latency (p95) | API response time | < 2s | |
| 57 | | Cost per ticket | Token count * price | < $0.01 | |
| 58 | |
| 59 | ## Iteration Log |
| 60 | |
| 61 | | Version | Change | Accuracy | Key Issue | |
| 62 | |---------|--------|----------|-----------| |
| 63 | | v1 | Category list only, no examples | 74% | Confused integration/bug (6 errors), inconsistent JSON (3 errors) | |
| 64 | | v2 | Added 3 few-shot examples | 84% | how_to/feature_request confusion on "Can I do X?" tickets | |
| 65 | | v3 | Added how_to example + broken-integration rule | 90% | 2 billing/account edge cases, 2 ambiguous bug/feature | |
| 66 | | v4 | Added Slack edge case + confidence field | 92% | 4 remaining errors are genuine ambiguities (60% human agreement) | |
| 67 | |
| 68 | **v4 accepted.** High-confidence accuracy: 96%. Format compliance: 100%. Avg cost: $0.003/ticket. Avg latency: 680ms. |
| 69 | |
| 70 | ## Production Configuration |
| 71 | |
| 72 | - **Temperature:** 0 (deterministic) |
| 73 | - **Max tokens:** 150 |
| 74 | - **Fallback:** "low" confidence tickets route to human triage |
| 75 | - **Monitoring:** Alert if any category exceeds 2x its 30-day average volume |
| 76 | - **Re-evaluation:** Monthly with 10 fresh test tickets added per cycle |
| 77 |