data

ML Model Evaluation

Evaluate machine learning models rigorously — with train/test splits, cross-validation, business metric alignment, bias detection, and production readiness assessment.

machine-learningevaluationmetricsbiasmodel-selection

Works well with agents

Data Scientist AgentML Engineer AgentProduct Analyst Agent

Works well with skills

Experiment DesignMetrics Framework
$ npx skills add The-AI-Directory-Company/(…) --skill ml-model-evaluation
ml-model-evaluation/
    • churn-prediction.md3.4 KB
  • SKILL.md6.0 KB
SKILL.md
Markdown
1 
2# ML Model Evaluation
3 
4## Before you start
5 
6Gather the following from the user. If anything is missing, ask before proceeding:
7 
81. **What problem is the model solving?** — Classification, regression, ranking, recommendation, generation
92. **What is the business objective?** — The real-world outcome (reduce churn, detect fraud, recommend products)
103. **What data is available?** — Dataset size, feature count, label quality, class balance, time range
114. **What are the constraints?** — Latency, model size, interpretability needs, regulatory obligations
125. **What is the baseline?** — Current system performance (rule-based, human, or previous model)
136. **What is the cost of errors?** — False positive vs false negative impact in business terms
14 
15## Evaluation template
16 
17### 1. Define Success Metrics
18 
19Map business objectives to technical metrics. Never evaluate on technical metrics alone.
20 
21```
22Business Objective: Detect fraudulent transactions before settlement
23Primary Metric: Precision at 95% recall
24Secondary Metrics: AUC-ROC, F1 score, false positive rate
25Business Constraint: <50ms inference latency
26Baseline Performance: Rule-based system: 72% precision at 95% recall
27Target Performance: >85% precision at 95% recall
28```
29 
30Metric selection rules:
31- **Classification**: Use precision/recall/F1 for imbalanced classes. Accuracy is misleading when 98% of data is one class.
32- **Regression**: MAE for outlier-tolerant, RMSE when large errors are disproportionately costly.
33- **Ranking**: NDCG/MAP when order matters, precision@k when only top results matter.
34- Always include a business metric: revenue impact, time saved, error cost reduction.
35 
36### 2. Data Splitting Strategy
37 
38**Random split** — Default for i.i.d. data: Train 70% / Validation 15% / Test 15%.
39 
40**Temporal split** — Required for time-dependent data: Train before T1 / Validation T1-T2 / Test after T2.
41 
42**Stratified split** — Required for imbalanced classification: maintain class proportions across splits.
43 
44**Group split** — Required when one entity has multiple samples: split by entity ID, not by row.
45 
46Critical rules:
47- Never use test data for any decision — tuning, feature selection, or threshold setting
48- For small datasets (<5000 samples), use k-fold cross-validation instead of a fixed split
49- Always check for data leakage: features encoding the label, future data in training
50 
51### 3. Model Comparison
52 
53Evaluate all candidates on the same validation set with identical preprocessing:
54 
55| Model | Primary Metric | Latency | Model Size | Training Time |
56|-------|---------------|---------|------------|---------------|
57| Logistic Regression | 0.78 | 2ms | 1MB | 30s |
58| XGBoost | 0.86 | 8ms | 50MB | 10min |
59| Neural Net | 0.85 | 25ms | 500MB | 2hr |
60 
61Always include a simple baseline. If a complex model does not meaningfully beat a simple one, choose the simpler model.
62 
63### 4. Error Analysis
64 
65Do not stop at aggregate metrics. Examine where the model fails:
66 
67- **Confusion matrix**: Inspect false positive and false negative examples manually
68- **Segment analysis**: Break down performance by key dimensions (user type, region, value tier). If performance varies >10% across segments, investigate.
69- **Error distribution**: For regression, plot residuals — are errors uniform or concentrated?
70 
71### 5. Bias Detection
72 
73Check for disparities across protected groups:
74 
75- **Demographic parity**: Does positive prediction rate differ across groups?
76- **Equal opportunity**: Does true positive rate differ across groups?
77- **Calibration**: Does a predicted 80% probability mean 80% actual positive rate for all groups?
78 
79If disparities exceed acceptable thresholds, investigate data representation, feature encoding, and model architecture.
80 
81### 6. Production Readiness
82 
83Verify before deployment: meets primary metric target, meets latency constraint, model size within limits, bias assessment passed, monitoring plan defined (prediction drift, feature drift, business metric tracking), fallback strategy documented, A/B test plan prepared, data pipeline validated, model versioning in place.
84 
85## Quality checklist
86 
87Before delivering a model evaluation, verify:
88 
89- [ ] Business objective is mapped to technical metrics with a stated target
90- [ ] Data split strategy matches data characteristics (temporal, imbalanced, grouped)
91- [ ] Test set was never used for model selection or tuning
92- [ ] At least one simple baseline is included for comparison
93- [ ] Error analysis examines specific failure cases, not just aggregates
94- [ ] Performance is broken down by relevant segments
95- [ ] Bias detection covers protected attributes and business segments
96- [ ] Production readiness includes latency, monitoring, and fallback
97 
98## Common mistakes
99 
100- **Evaluating on accuracy alone.** A model predicting "not fraud" for everything achieves 99.5% accuracy on a 0.5% fraud dataset. Use precision/recall for imbalanced problems.
101- **Leaking test data.** Using the test set for feature selection or tuning inflates results and breaks the generalization guarantee.
102- **Ignoring the simple baseline.** A logistic regression at 90% in 30 seconds often beats a deep learning model at 92% after two weeks of engineering.
103- **Reporting only aggregate metrics.** 90% overall accuracy that drops to 50% on a critical segment is not a 90%-accurate model for those users.
104- **Skipping the cost analysis.** False positives and false negatives rarely cost the same. The evaluation must reflect the asymmetry.
105- **No production monitoring plan.** Models degrade as distributions shift. An evaluation without monitoring is incomplete.
106 
AgentsSkillsCompaniesJobsForumBlogFAQAbout

©2026 ai-directory.company

·Privacy·Terms·Cookies·