engineering
Debugging Guide
Step-by-step debugging methodology for systematic bug finding — covers reproduction, isolation, hypothesis testing, root cause analysis, and fix verification.
debuggingtroubleshootingbug-fixingmethodology
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill debugging-guidedebugging-guide/
SKILL.md
Markdown
| 1 | |
| 2 | # Debugging Guide |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following before investigating: |
| 7 | |
| 8 | 1. **What is the expected behavior?** — What should happen |
| 9 | 2. **What is the actual behavior?** — What happens instead (exact error message, wrong output, crash) |
| 10 | 3. **When did it start?** — Was it always broken, or did it work before a specific change? |
| 11 | 4. **What changed recently?** — Deployments, config changes, dependency updates, data migrations |
| 12 | 5. **Who is affected?** — All users, specific accounts, specific environments |
| 13 | 6. **Can you reproduce it?** — Consistent or intermittent? Steps to trigger? |
| 14 | |
| 15 | Do not start guessing at fixes until you can reproduce the bug or have clear evidence of the root cause. |
| 16 | |
| 17 | ## Procedure |
| 18 | |
| 19 | ### 1. Reproduce the bug |
| 20 | |
| 21 | Before anything else, make the bug happen on demand. |
| 22 | |
| 23 | - Follow the exact steps from the bug report |
| 24 | - Use the same environment (OS, browser, API version) as the reporter |
| 25 | - If the bug is intermittent, identify the conditions that increase its likelihood (load, timing, specific data) |
| 26 | - If you cannot reproduce it, gather more data — logs, screenshots, network traces — before proceeding |
| 27 | |
| 28 | A bug you cannot reproduce is a bug you cannot verify as fixed. |
| 29 | |
| 30 | ### 2. Isolate the scope |
| 31 | |
| 32 | Narrow down where the bug lives: |
| 33 | |
| 34 | - **Layer**: Is it frontend, backend, database, infrastructure, or third-party? |
| 35 | - **Component**: Which module, service, or function? |
| 36 | - **Input**: Which specific inputs trigger the bug? Which inputs do NOT trigger it? |
| 37 | |
| 38 | Techniques for isolation: |
| 39 | - **Binary search**: Comment out or bypass half the code path. Does the bug persist? Narrow to the half that matters. |
| 40 | - **Minimal reproduction**: Strip away everything unrelated until you have the smallest code/input that triggers the bug. |
| 41 | - **Environment comparison**: Does it happen in staging but not local? Diff the configs. |
| 42 | - **Git bisect**: If it worked before, use `git bisect` to find the exact commit that introduced it. |
| 43 | |
| 44 | ### 3. Form a hypothesis |
| 45 | |
| 46 | State your hypothesis explicitly before testing it: |
| 47 | |
| 48 | ``` |
| 49 | HYPOTHESIS: [what you think is wrong] |
| 50 | EVIDENCE: [what makes you think so] |
| 51 | TEST: [how to confirm or disprove it] |
| 52 | PREDICTION: [what you expect to see if the hypothesis is correct] |
| 53 | ``` |
| 54 | |
| 55 | One hypothesis at a time. If you test multiple changes simultaneously, you will not know which one mattered. |
| 56 | |
| 57 | ### 4. Test the hypothesis |
| 58 | |
| 59 | Run the test you defined. Compare the result to your prediction. |
| 60 | |
| 61 | - **Prediction matches** — Your hypothesis is likely correct. Proceed to fix. |
| 62 | - **Prediction does not match** — Your hypothesis is wrong. Do not force it. Return to step 2 with new information. |
| 63 | - **Partial match** — There may be multiple contributing factors. Isolate further. |
| 64 | |
| 65 | ### 5. Find the root cause |
| 66 | |
| 67 | The first fix that makes symptoms disappear is not necessarily the root cause. Ask: |
| 68 | |
| 69 | - **Why** does this input cause this behavior? (not just "what" happens) |
| 70 | - Is this a symptom of a deeper issue? (fixing the symptom may leave the real bug) |
| 71 | - Are there other code paths with the same underlying flaw? |
| 72 | |
| 73 | Use the "5 Whys" technique: |
| 74 | 1. Why did the request fail? — The response was 500. |
| 75 | 2. Why was it 500? — An unhandled null reference. |
| 76 | 3. Why was the value null? — The database query returned no rows. |
| 77 | 4. Why were there no rows? — The user ID was from a deleted account. |
| 78 | 5. Why was a deleted account ID used? — The session was not invalidated on deletion. |
| 79 | |
| 80 | Root cause: sessions are not invalidated when accounts are deleted. |
| 81 | |
| 82 | ### 6. Implement and verify the fix |
| 83 | |
| 84 | 1. Write a test that reproduces the bug (it should fail before the fix) |
| 85 | 2. Apply the minimal fix — change the least amount of code possible |
| 86 | 3. Run the reproduction test — it should pass now |
| 87 | 4. Run the full test suite — no regressions |
| 88 | 5. Check related code paths for the same pattern |
| 89 | |
| 90 | ### 7. Document the finding |
| 91 | |
| 92 | Record for future reference: |
| 93 | |
| 94 | ``` |
| 95 | BUG: [one-line summary] |
| 96 | ROOT CAUSE: [what was actually wrong] |
| 97 | FIX: [what was changed] |
| 98 | RELATED: [other areas that might have the same issue] |
| 99 | PREVENTION: [what would have caught this earlier — test, lint rule, type constraint] |
| 100 | ``` |
| 101 | |
| 102 | ## Quality checklist |
| 103 | |
| 104 | - [ ] The bug is reproducible with a defined set of steps |
| 105 | - [ ] The root cause is identified, not just the symptom |
| 106 | - [ ] A test exists that fails before the fix and passes after |
| 107 | - [ ] The fix changes the minimum necessary code |
| 108 | - [ ] Related code paths were checked for the same pattern |
| 109 | - [ ] The full test suite passes with no regressions |
| 110 | - [ ] The fix is documented with root cause and prevention notes |
| 111 | |
| 112 | ## Common mistakes |
| 113 | |
| 114 | - **Fixing symptoms instead of root causes.** Adding a null check hides the bug — ask WHY the value is null in the first place. |
| 115 | - **Changing multiple things at once.** If you change three things and the bug disappears, you do not know which change fixed it. Change one thing at a time. |
| 116 | - **Skipping reproduction.** Fixing a bug you cannot reproduce means you cannot verify the fix works. Reproduce first, always. |
| 117 | - **Blaming the environment.** "It works on my machine" is not a diagnosis. If it fails in production, the difference between your machine and production IS the bug. |
| 118 | - **Stopping at the first fix that works.** The first fix may mask the real issue. Verify the root cause before declaring victory. |
| 119 | - **Not checking for related instances.** If a bug exists in one place, the same pattern likely exists elsewhere. Search the codebase for similar code. |
| 120 |