engineeringdata

Prompt Engineering Guide

Design, test, and optimize LLM prompts systematically — with evaluation frameworks, chain-of-thought patterns, output formatting, and iteration methodology for reliable AI outputs.

promptsLLMevaluationchain-of-thoughtAIoptimization

Works well with agents

AI Engineer AgentML Engineer AgentPrompt Engineer Agent

Works well with skills

Experiment DesignTest Plan Writing
prompt-engineering-guide/
    • customer-support-classifier.md3.3 KB
  • SKILL.md6.4 KB
SKILL.md
Markdown
1 
2# Prompt Engineering Guide
3 
4## Before you start
5 
6Gather the following from the user. If anything is missing, ask before proceeding:
7 
81. **What task should the prompt accomplish?** (Classification, generation, extraction, transformation, summarization)
92. **What model will run the prompt?** (GPT-4, Claude, Gemini, open-source — capabilities differ)
103. **What does the input look like?** (Free text, structured data, documents, code)
114. **What does the ideal output look like?** (JSON, markdown, plain text — provide 2-3 real examples)
125. **What are the failure modes?** (Hallucinations, wrong format, refusals, missing edge cases)
136. **How will you evaluate quality?** (Human review, automated checks, ground truth comparison)
14 
15## Prompt design template
16 
17### 1. Task Definition
18 
19Write a clear system prompt that defines the task, role, and constraints:
20 
21```
22You are a [role] that [core task].
23Your job is to [specific action] given [input description].
24 
25Rules:
26- [Constraint 1: e.g., respond only in valid JSON]
27- [Constraint 2: e.g., never fabricate information not in the source]
28- [Constraint 3: e.g., if uncertain, say "I don't know"]
29 
30Output format:
31[Exact schema or structure the model must follow]
32```
33 
34Every sentence should constrain behavior or clarify expectations. Avoid vague instructions like "be helpful."
35 
36### 2. Few-Shot Examples
37 
38Include 2-5 input-output examples in the prompt. Examples teach format and edge cases more reliably than instructions alone.
39 
40```
41Input: "The product crashed twice today and support hasn't responded."
42Output: {"sentiment": "negative", "topics": ["reliability", "support"], "urgency": "high"}
43 
44Input: "Love the new dashboard — the filters are exactly what I needed."
45Output: {"sentiment": "positive", "topics": ["dashboard", "filters"], "urgency": "low"}
46```
47 
48Cover the typical case, a boundary case, and the hardest case. If the model gets the hard example right, it handles easy cases.
49 
50### 3. Chain-of-Thought Patterns
51 
52For reasoning tasks, instruct the model to show its work before answering:
53 
54- **Step-by-step:** "Think through this step by step before giving your final answer." Best for math and multi-step analysis.
55- **Explain-then-answer:** "First explain your reasoning, then provide the answer on a new line starting with 'Answer:'." Best when you need to audit logic.
56- **Self-critique:** "After drafting your response, review it for errors before outputting the final version." Best for generation tasks.
57 
58When chain-of-thought adds tokens without improving accuracy (simple classification), skip it.
59 
60### 4. Output Formatting
61 
62Specify the exact output structure. Ambiguity in format is the top cause of parsing failures.
63 
64```
65Respond with a JSON object matching this schema exactly:
66{
67 "summary": "string (1-2 sentences)",
68 "confidence": "number (0.0-1.0)",
69 "categories": ["string array, from: billing, technical, feature-request, other"],
70 "requires_escalation": "boolean"
71}
72Do not include any text outside the JSON object.
73```
74 
75For structured outputs, provide the schema and a completed example. For free-text, specify length, tone, and inclusions/exclusions.
76 
77### 5. Evaluation Framework
78 
79Define how you will measure prompt quality before deploying. Build a test set of 20-50 examples with expected outputs.
80 
81```
82| Metric | Method | Pass Threshold |
83|---------------------|-------------------------|----------------|
84| Format compliance | JSON schema validation | 100% |
85| Classification acc. | Match against labels | >90% |
86| Hallucination rate | Human review sample | <5% |
87| Latency (p95) | API response time | <3s |
88| Cost per request | Token count * price | <$0.02 |
89```
90 
91Run the full test set after every prompt change. A prompt that improves accuracy but breaks format compliance is a regression, not an improvement.
92 
93### 6. Iteration Methodology
94 
95Follow this loop for every prompt revision:
96 
971. **Run the current prompt** against the full test set. Record scores.
982. **Identify failure patterns.** Group errors by type: format, accuracy, hallucination, edge cases.
993. **Change one thing.** One modification per iteration — instructions, examples, or structure.
1004. **Re-run the test set.** Compare scores to the previous version.
1015. **Keep or revert.** If any metric degrades, revert and try a different approach.
102 
103Log every iteration: what changed, the hypothesis, and results. Prompt engineering without records is guessing.
104 
105## Quality checklist
106 
107Before delivering a prompt, verify:
108 
109- [ ] System prompt states task, role, and constraints concretely — no vague instructions
110- [ ] Output format specified with a schema or example, not just described
111- [ ] 2-5 few-shot examples cover typical, boundary, and difficult cases
112- [ ] Chain-of-thought included only when it measurably improves accuracy
113- [ ] Test set of 20+ examples exists with expected outputs
114- [ ] Evaluation metrics and thresholds defined before testing begins
115- [ ] Prompt tested on the target model — not assumed to transfer from another
116- [ ] Token usage and cost within budget for expected volume
117 
118## Common mistakes to avoid
119 
120- **Writing instructions instead of showing examples.** When instructions and examples conflict, the model follows examples. One example teaches format better than a paragraph of description.
121- **Optimizing on vibes.** "This feels better" is not evaluation. Build a test set and compare versions quantitatively.
122- **Changing multiple things at once.** One change per iteration — otherwise you cannot attribute improvement or regression.
123- **Ignoring model differences.** A prompt tuned for GPT-4 may underperform on Claude or Gemini. Test on the target model.
124- **Skipping edge cases in examples.** Happy-path-only examples cause hallucination on unusual inputs. Include the hardest realistic cases.
125- **Over-engineering simple tasks.** Start minimal and add complexity only when the test set demands it.
126 

©2026 ai-directory.company

·Privacy·Terms·Cookies·