engineeringdata

Prompt Engineering Guide

Design, test, and optimize LLM prompts systematically — with evaluation frameworks, chain-of-thought patterns, output formatting, and iteration methodology for reliable AI outputs.

promptsLLMevaluationchain-of-thoughtAIoptimization

Works well with agents

AI Engineer Agent ML Engineer Agent Prompt Engineer Agent

Works well with skills

Experiment Design Test Plan Writing

prompt-engineering-guide/

SKILL.md

Markdown

1
2	# Prompt Engineering Guide
3
4	## Before you start
5
6	Gather the following from the user. If anything is missing, ask before proceeding:
7
8	1. What task should the prompt accomplish? (Classification, generation, extraction, transformation, summarization)
9	2. What model will run the prompt? (GPT-4, Claude, Gemini, open-source — capabilities differ)
10	3. What does the input look like? (Free text, structured data, documents, code)
11	4. What does the ideal output look like? (JSON, markdown, plain text — provide 2-3 real examples)
12	5. What are the failure modes? (Hallucinations, wrong format, refusals, missing edge cases)
13	6. How will you evaluate quality? (Human review, automated checks, ground truth comparison)
14
15	## Prompt design template
16
17	### 1. Task Definition
18
19	Write a clear system prompt that defines the task, role, and constraints:
20
21	```
22	You are a [role] that [core task].
23	Your job is to [specific action] given [input description].
24
25	Rules:
26	- [Constraint 1: e.g., respond only in valid JSON]
27	- [Constraint 2: e.g., never fabricate information not in the source]
28	- [Constraint 3: e.g., if uncertain, say "I don't know"]
29
30	Output format:
31	[Exact schema or structure the model must follow]
32	```
33
34	Every sentence should constrain behavior or clarify expectations. Avoid vague instructions like "be helpful."
35
36	### 2. Few-Shot Examples
37
38	Include 2-5 input-output examples in the prompt. Examples teach format and edge cases more reliably than instructions alone.
39
40	```
41	Input: "The product crashed twice today and support hasn't responded."
42	Output: {"sentiment": "negative", "topics": ["reliability", "support"], "urgency": "high"}
43
44	Input: "Love the new dashboard — the filters are exactly what I needed."
45	Output: {"sentiment": "positive", "topics": ["dashboard", "filters"], "urgency": "low"}
46	```
47
48	Cover the typical case, a boundary case, and the hardest case. If the model gets the hard example right, it handles easy cases.
49
50	### 3. Chain-of-Thought Patterns
51
52	For reasoning tasks, instruct the model to show its work before answering:
53
54	- Step-by-step: "Think through this step by step before giving your final answer." Best for math and multi-step analysis.
55	- Explain-then-answer: "First explain your reasoning, then provide the answer on a new line starting with 'Answer:'." Best when you need to audit logic.
56	- Self-critique: "After drafting your response, review it for errors before outputting the final version." Best for generation tasks.
57
58	When chain-of-thought adds tokens without improving accuracy (simple classification), skip it.
59
60	### 4. Output Formatting
61
62	Specify the exact output structure. Ambiguity in format is the top cause of parsing failures.
63
64	```
65	Respond with a JSON object matching this schema exactly:
66	{
67	"summary": "string (1-2 sentences)",
68	"confidence": "number (0.0-1.0)",
69	"categories": ["string array, from: billing, technical, feature-request, other"],
70	"requires_escalation": "boolean"
71	}
72	Do not include any text outside the JSON object.
73	```
74
75	For structured outputs, provide the schema and a completed example. For free-text, specify length, tone, and inclusions/exclusions.
76
77	### 5. Evaluation Framework
78
79	Define how you will measure prompt quality before deploying. Build a test set of 20-50 examples with expected outputs.
80
81	```
82	\| Metric \| Method \| Pass Threshold \|
83	\|---------------------\|-------------------------\|----------------\|
84	\| Format compliance \| JSON schema validation \| 100% \|
85	\| Classification acc. \| Match against labels \| >90% \|
86	\| Hallucination rate \| Human review sample \| <5% \|
87	\| Latency (p95) \| API response time \| <3s \|
88	\| Cost per request \| Token count * price \| <$0.02 \|
89	```
90
91	Run the full test set after every prompt change. A prompt that improves accuracy but breaks format compliance is a regression, not an improvement.
92
93	### 6. Iteration Methodology
94
95	Follow this loop for every prompt revision:
96
97	1. Run the current prompt against the full test set. Record scores.
98	2. Identify failure patterns. Group errors by type: format, accuracy, hallucination, edge cases.
99	3. Change one thing. One modification per iteration — instructions, examples, or structure.
100	4. Re-run the test set. Compare scores to the previous version.
101	5. Keep or revert. If any metric degrades, revert and try a different approach.
102
103	Log every iteration: what changed, the hypothesis, and results. Prompt engineering without records is guessing.
104
105	## Quality checklist
106
107	Before delivering a prompt, verify:
108
109	- [ ] System prompt states task, role, and constraints concretely — no vague instructions
110	- [ ] Output format specified with a schema or example, not just described
111	- [ ] 2-5 few-shot examples cover typical, boundary, and difficult cases
112	- [ ] Chain-of-thought included only when it measurably improves accuracy
113	- [ ] Test set of 20+ examples exists with expected outputs
114	- [ ] Evaluation metrics and thresholds defined before testing begins
115	- [ ] Prompt tested on the target model — not assumed to transfer from another
116	- [ ] Token usage and cost within budget for expected volume
117
118	## Common mistakes to avoid
119
120	- Writing instructions instead of showing examples. When instructions and examples conflict, the model follows examples. One example teaches format better than a paragraph of description.
121	- Optimizing on vibes. "This feels better" is not evaluation. Build a test set and compare versions quantitatively.
122	- Changing multiple things at once. One change per iteration — otherwise you cannot attribute improvement or regression.
123	- Ignoring model differences. A prompt tuned for GPT-4 may underperform on Claude or Gemini. Test on the target model.
124	- Skipping edge cases in examples. Happy-path-only examples cause hallucination on unusual inputs. Include the hardest realistic cases.
125	- Over-engineering simple tasks. Start minimal and add complexity only when the test set demands it.
126