engineeringdata
Prompt Engineering Guide
Design, test, and optimize LLM prompts systematically — with evaluation frameworks, chain-of-thought patterns, output formatting, and iteration methodology for reliable AI outputs.
promptsLLMevaluationchain-of-thoughtAIoptimization
Works well with agents
Works well with skills
prompt-engineering-guide/
SKILL.md
Markdown| 1 | |
| 2 | # Prompt Engineering Guide |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user. If anything is missing, ask before proceeding: |
| 7 | |
| 8 | 1. **What task should the prompt accomplish?** (Classification, generation, extraction, transformation, summarization) |
| 9 | 2. **What model will run the prompt?** (GPT-4, Claude, Gemini, open-source — capabilities differ) |
| 10 | 3. **What does the input look like?** (Free text, structured data, documents, code) |
| 11 | 4. **What does the ideal output look like?** (JSON, markdown, plain text — provide 2-3 real examples) |
| 12 | 5. **What are the failure modes?** (Hallucinations, wrong format, refusals, missing edge cases) |
| 13 | 6. **How will you evaluate quality?** (Human review, automated checks, ground truth comparison) |
| 14 | |
| 15 | ## Prompt design template |
| 16 | |
| 17 | ### 1. Task Definition |
| 18 | |
| 19 | Write a clear system prompt that defines the task, role, and constraints: |
| 20 | |
| 21 | ``` |
| 22 | You are a [role] that [core task]. |
| 23 | Your job is to [specific action] given [input description]. |
| 24 | |
| 25 | Rules: |
| 26 | - [Constraint 1: e.g., respond only in valid JSON] |
| 27 | - [Constraint 2: e.g., never fabricate information not in the source] |
| 28 | - [Constraint 3: e.g., if uncertain, say "I don't know"] |
| 29 | |
| 30 | Output format: |
| 31 | [Exact schema or structure the model must follow] |
| 32 | ``` |
| 33 | |
| 34 | Every sentence should constrain behavior or clarify expectations. Avoid vague instructions like "be helpful." |
| 35 | |
| 36 | ### 2. Few-Shot Examples |
| 37 | |
| 38 | Include 2-5 input-output examples in the prompt. Examples teach format and edge cases more reliably than instructions alone. |
| 39 | |
| 40 | ``` |
| 41 | Input: "The product crashed twice today and support hasn't responded." |
| 42 | Output: {"sentiment": "negative", "topics": ["reliability", "support"], "urgency": "high"} |
| 43 | |
| 44 | Input: "Love the new dashboard — the filters are exactly what I needed." |
| 45 | Output: {"sentiment": "positive", "topics": ["dashboard", "filters"], "urgency": "low"} |
| 46 | ``` |
| 47 | |
| 48 | Cover the typical case, a boundary case, and the hardest case. If the model gets the hard example right, it handles easy cases. |
| 49 | |
| 50 | ### 3. Chain-of-Thought Patterns |
| 51 | |
| 52 | For reasoning tasks, instruct the model to show its work before answering: |
| 53 | |
| 54 | - **Step-by-step:** "Think through this step by step before giving your final answer." Best for math and multi-step analysis. |
| 55 | - **Explain-then-answer:** "First explain your reasoning, then provide the answer on a new line starting with 'Answer:'." Best when you need to audit logic. |
| 56 | - **Self-critique:** "After drafting your response, review it for errors before outputting the final version." Best for generation tasks. |
| 57 | |
| 58 | When chain-of-thought adds tokens without improving accuracy (simple classification), skip it. |
| 59 | |
| 60 | ### 4. Output Formatting |
| 61 | |
| 62 | Specify the exact output structure. Ambiguity in format is the top cause of parsing failures. |
| 63 | |
| 64 | ``` |
| 65 | Respond with a JSON object matching this schema exactly: |
| 66 | { |
| 67 | "summary": "string (1-2 sentences)", |
| 68 | "confidence": "number (0.0-1.0)", |
| 69 | "categories": ["string array, from: billing, technical, feature-request, other"], |
| 70 | "requires_escalation": "boolean" |
| 71 | } |
| 72 | Do not include any text outside the JSON object. |
| 73 | ``` |
| 74 | |
| 75 | For structured outputs, provide the schema and a completed example. For free-text, specify length, tone, and inclusions/exclusions. |
| 76 | |
| 77 | ### 5. Evaluation Framework |
| 78 | |
| 79 | Define how you will measure prompt quality before deploying. Build a test set of 20-50 examples with expected outputs. |
| 80 | |
| 81 | ``` |
| 82 | | Metric | Method | Pass Threshold | |
| 83 | |---------------------|-------------------------|----------------| |
| 84 | | Format compliance | JSON schema validation | 100% | |
| 85 | | Classification acc. | Match against labels | >90% | |
| 86 | | Hallucination rate | Human review sample | <5% | |
| 87 | | Latency (p95) | API response time | <3s | |
| 88 | | Cost per request | Token count * price | <$0.02 | |
| 89 | ``` |
| 90 | |
| 91 | Run the full test set after every prompt change. A prompt that improves accuracy but breaks format compliance is a regression, not an improvement. |
| 92 | |
| 93 | ### 6. Iteration Methodology |
| 94 | |
| 95 | Follow this loop for every prompt revision: |
| 96 | |
| 97 | 1. **Run the current prompt** against the full test set. Record scores. |
| 98 | 2. **Identify failure patterns.** Group errors by type: format, accuracy, hallucination, edge cases. |
| 99 | 3. **Change one thing.** One modification per iteration — instructions, examples, or structure. |
| 100 | 4. **Re-run the test set.** Compare scores to the previous version. |
| 101 | 5. **Keep or revert.** If any metric degrades, revert and try a different approach. |
| 102 | |
| 103 | Log every iteration: what changed, the hypothesis, and results. Prompt engineering without records is guessing. |
| 104 | |
| 105 | ## Quality checklist |
| 106 | |
| 107 | Before delivering a prompt, verify: |
| 108 | |
| 109 | - [ ] System prompt states task, role, and constraints concretely — no vague instructions |
| 110 | - [ ] Output format specified with a schema or example, not just described |
| 111 | - [ ] 2-5 few-shot examples cover typical, boundary, and difficult cases |
| 112 | - [ ] Chain-of-thought included only when it measurably improves accuracy |
| 113 | - [ ] Test set of 20+ examples exists with expected outputs |
| 114 | - [ ] Evaluation metrics and thresholds defined before testing begins |
| 115 | - [ ] Prompt tested on the target model — not assumed to transfer from another |
| 116 | - [ ] Token usage and cost within budget for expected volume |
| 117 | |
| 118 | ## Common mistakes to avoid |
| 119 | |
| 120 | - **Writing instructions instead of showing examples.** When instructions and examples conflict, the model follows examples. One example teaches format better than a paragraph of description. |
| 121 | - **Optimizing on vibes.** "This feels better" is not evaluation. Build a test set and compare versions quantitatively. |
| 122 | - **Changing multiple things at once.** One change per iteration — otherwise you cannot attribute improvement or regression. |
| 123 | - **Ignoring model differences.** A prompt tuned for GPT-4 may underperform on Claude or Gemini. Test on the target model. |
| 124 | - **Skipping edge cases in examples.** Happy-path-only examples cause hallucination on unusual inputs. Include the hardest realistic cases. |
| 125 | - **Over-engineering simple tasks.** Start minimal and add complexity only when the test set demands it. |
| 126 |