data
ML Model Evaluation
Evaluate machine learning models rigorously — with train/test splits, cross-validation, business metric alignment, bias detection, and production readiness assessment.
machine-learningevaluationmetricsbiasmodel-selection
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill ml-model-evaluationml-model-evaluation/
SKILL.md
Markdown
| 1 | |
| 2 | # ML Model Evaluation |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user. If anything is missing, ask before proceeding: |
| 7 | |
| 8 | 1. **What problem is the model solving?** — Classification, regression, ranking, recommendation, generation |
| 9 | 2. **What is the business objective?** — The real-world outcome (reduce churn, detect fraud, recommend products) |
| 10 | 3. **What data is available?** — Dataset size, feature count, label quality, class balance, time range |
| 11 | 4. **What are the constraints?** — Latency, model size, interpretability needs, regulatory obligations |
| 12 | 5. **What is the baseline?** — Current system performance (rule-based, human, or previous model) |
| 13 | 6. **What is the cost of errors?** — False positive vs false negative impact in business terms |
| 14 | |
| 15 | ## Evaluation template |
| 16 | |
| 17 | ### 1. Define Success Metrics |
| 18 | |
| 19 | Map business objectives to technical metrics. Never evaluate on technical metrics alone. |
| 20 | |
| 21 | ``` |
| 22 | Business Objective: Detect fraudulent transactions before settlement |
| 23 | Primary Metric: Precision at 95% recall |
| 24 | Secondary Metrics: AUC-ROC, F1 score, false positive rate |
| 25 | Business Constraint: <50ms inference latency |
| 26 | Baseline Performance: Rule-based system: 72% precision at 95% recall |
| 27 | Target Performance: >85% precision at 95% recall |
| 28 | ``` |
| 29 | |
| 30 | Metric selection rules: |
| 31 | - **Classification**: Use precision/recall/F1 for imbalanced classes. Accuracy is misleading when 98% of data is one class. |
| 32 | - **Regression**: MAE for outlier-tolerant, RMSE when large errors are disproportionately costly. |
| 33 | - **Ranking**: NDCG/MAP when order matters, precision@k when only top results matter. |
| 34 | - Always include a business metric: revenue impact, time saved, error cost reduction. |
| 35 | |
| 36 | ### 2. Data Splitting Strategy |
| 37 | |
| 38 | **Random split** — Default for i.i.d. data: Train 70% / Validation 15% / Test 15%. |
| 39 | |
| 40 | **Temporal split** — Required for time-dependent data: Train before T1 / Validation T1-T2 / Test after T2. |
| 41 | |
| 42 | **Stratified split** — Required for imbalanced classification: maintain class proportions across splits. |
| 43 | |
| 44 | **Group split** — Required when one entity has multiple samples: split by entity ID, not by row. |
| 45 | |
| 46 | Critical rules: |
| 47 | - Never use test data for any decision — tuning, feature selection, or threshold setting |
| 48 | - For small datasets (<5000 samples), use k-fold cross-validation instead of a fixed split |
| 49 | - Always check for data leakage: features encoding the label, future data in training |
| 50 | |
| 51 | ### 3. Model Comparison |
| 52 | |
| 53 | Evaluate all candidates on the same validation set with identical preprocessing: |
| 54 | |
| 55 | | Model | Primary Metric | Latency | Model Size | Training Time | |
| 56 | |-------|---------------|---------|------------|---------------| |
| 57 | | Logistic Regression | 0.78 | 2ms | 1MB | 30s | |
| 58 | | XGBoost | 0.86 | 8ms | 50MB | 10min | |
| 59 | | Neural Net | 0.85 | 25ms | 500MB | 2hr | |
| 60 | |
| 61 | Always include a simple baseline. If a complex model does not meaningfully beat a simple one, choose the simpler model. |
| 62 | |
| 63 | ### 4. Error Analysis |
| 64 | |
| 65 | Do not stop at aggregate metrics. Examine where the model fails: |
| 66 | |
| 67 | - **Confusion matrix**: Inspect false positive and false negative examples manually |
| 68 | - **Segment analysis**: Break down performance by key dimensions (user type, region, value tier). If performance varies >10% across segments, investigate. |
| 69 | - **Error distribution**: For regression, plot residuals — are errors uniform or concentrated? |
| 70 | |
| 71 | ### 5. Bias Detection |
| 72 | |
| 73 | Check for disparities across protected groups: |
| 74 | |
| 75 | - **Demographic parity**: Does positive prediction rate differ across groups? |
| 76 | - **Equal opportunity**: Does true positive rate differ across groups? |
| 77 | - **Calibration**: Does a predicted 80% probability mean 80% actual positive rate for all groups? |
| 78 | |
| 79 | If disparities exceed acceptable thresholds, investigate data representation, feature encoding, and model architecture. |
| 80 | |
| 81 | ### 6. Production Readiness |
| 82 | |
| 83 | Verify before deployment: meets primary metric target, meets latency constraint, model size within limits, bias assessment passed, monitoring plan defined (prediction drift, feature drift, business metric tracking), fallback strategy documented, A/B test plan prepared, data pipeline validated, model versioning in place. |
| 84 | |
| 85 | ## Quality checklist |
| 86 | |
| 87 | Before delivering a model evaluation, verify: |
| 88 | |
| 89 | - [ ] Business objective is mapped to technical metrics with a stated target |
| 90 | - [ ] Data split strategy matches data characteristics (temporal, imbalanced, grouped) |
| 91 | - [ ] Test set was never used for model selection or tuning |
| 92 | - [ ] At least one simple baseline is included for comparison |
| 93 | - [ ] Error analysis examines specific failure cases, not just aggregates |
| 94 | - [ ] Performance is broken down by relevant segments |
| 95 | - [ ] Bias detection covers protected attributes and business segments |
| 96 | - [ ] Production readiness includes latency, monitoring, and fallback |
| 97 | |
| 98 | ## Common mistakes |
| 99 | |
| 100 | - **Evaluating on accuracy alone.** A model predicting "not fraud" for everything achieves 99.5% accuracy on a 0.5% fraud dataset. Use precision/recall for imbalanced problems. |
| 101 | - **Leaking test data.** Using the test set for feature selection or tuning inflates results and breaks the generalization guarantee. |
| 102 | - **Ignoring the simple baseline.** A logistic regression at 90% in 30 seconds often beats a deep learning model at 92% after two weeks of engineering. |
| 103 | - **Reporting only aggregate metrics.** 90% overall accuracy that drops to 50% on a critical segment is not a 90%-accurate model for those users. |
| 104 | - **Skipping the cost analysis.** False positives and false negatives rarely cost the same. The evaluation must reflect the asymmetry. |
| 105 | - **No production monitoring plan.** Models degrade as distributions shift. An evaluation without monitoring is incomplete. |
| 106 |