data

ML Model Evaluation

Evaluate machine learning models rigorously — with train/test splits, cross-validation, business metric alignment, bias detection, and production readiness assessment.

machine-learningevaluationmetricsbiasmodel-selection

Works well with agents

Data Scientist Agent ML Engineer Agent Product Analyst Agent

Works well with skills

Experiment Design Metrics Framework

$ npx skills add The-AI-Directory-Company/(…) --skill ml-model-evaluation

ml-model-evaluation/

SKILL.md

Markdown

1
2	# ML Model Evaluation
3
4	## Before you start
5
6	Gather the following from the user. If anything is missing, ask before proceeding:
7
8	1. What problem is the model solving? — Classification, regression, ranking, recommendation, generation
9	2. What is the business objective? — The real-world outcome (reduce churn, detect fraud, recommend products)
10	3. What data is available? — Dataset size, feature count, label quality, class balance, time range
11	4. What are the constraints? — Latency, model size, interpretability needs, regulatory obligations
12	5. What is the baseline? — Current system performance (rule-based, human, or previous model)
13	6. What is the cost of errors? — False positive vs false negative impact in business terms
14
15	## Evaluation template
16
17	### 1. Define Success Metrics
18
19	Map business objectives to technical metrics. Never evaluate on technical metrics alone.
20
21	```
22	Business Objective: Detect fraudulent transactions before settlement
23	Primary Metric: Precision at 95% recall
24	Secondary Metrics: AUC-ROC, F1 score, false positive rate
25	Business Constraint: <50ms inference latency
26	Baseline Performance: Rule-based system: 72% precision at 95% recall
27	Target Performance: >85% precision at 95% recall
28	```
29
30	Metric selection rules:
31	- Classification: Use precision/recall/F1 for imbalanced classes. Accuracy is misleading when 98% of data is one class.
32	- Regression: MAE for outlier-tolerant, RMSE when large errors are disproportionately costly.
33	- Ranking: NDCG/MAP when order matters, precision@k when only top results matter.
34	- Always include a business metric: revenue impact, time saved, error cost reduction.
35
36	### 2. Data Splitting Strategy
37
38	Random split — Default for i.i.d. data: Train 70% / Validation 15% / Test 15%.
39
40	Temporal split — Required for time-dependent data: Train before T1 / Validation T1-T2 / Test after T2.
41
42	Stratified split — Required for imbalanced classification: maintain class proportions across splits.
43
44	Group split — Required when one entity has multiple samples: split by entity ID, not by row.
45
46	Critical rules:
47	- Never use test data for any decision — tuning, feature selection, or threshold setting
48	- For small datasets (<5000 samples), use k-fold cross-validation instead of a fixed split
49	- Always check for data leakage: features encoding the label, future data in training
50
51	### 3. Model Comparison
52
53	Evaluate all candidates on the same validation set with identical preprocessing:
54
55	\| Model \| Primary Metric \| Latency \| Model Size \| Training Time \|
56	\|-------\|---------------\|---------\|------------\|---------------\|
57	\| Logistic Regression \| 0.78 \| 2ms \| 1MB \| 30s \|
58	\| XGBoost \| 0.86 \| 8ms \| 50MB \| 10min \|
59	\| Neural Net \| 0.85 \| 25ms \| 500MB \| 2hr \|
60
61	Always include a simple baseline. If a complex model does not meaningfully beat a simple one, choose the simpler model.
62
63	### 4. Error Analysis
64
65	Do not stop at aggregate metrics. Examine where the model fails:
66
67	- Confusion matrix: Inspect false positive and false negative examples manually
68	- Segment analysis: Break down performance by key dimensions (user type, region, value tier). If performance varies >10% across segments, investigate.
69	- Error distribution: For regression, plot residuals — are errors uniform or concentrated?
70
71	### 5. Bias Detection
72
73	Check for disparities across protected groups:
74
75	- Demographic parity: Does positive prediction rate differ across groups?
76	- Equal opportunity: Does true positive rate differ across groups?
77	- Calibration: Does a predicted 80% probability mean 80% actual positive rate for all groups?
78
79	If disparities exceed acceptable thresholds, investigate data representation, feature encoding, and model architecture.
80
81	### 6. Production Readiness
82
83	Verify before deployment: meets primary metric target, meets latency constraint, model size within limits, bias assessment passed, monitoring plan defined (prediction drift, feature drift, business metric tracking), fallback strategy documented, A/B test plan prepared, data pipeline validated, model versioning in place.
84
85	## Quality checklist
86
87	Before delivering a model evaluation, verify:
88
89	- [ ] Business objective is mapped to technical metrics with a stated target
90	- [ ] Data split strategy matches data characteristics (temporal, imbalanced, grouped)
91	- [ ] Test set was never used for model selection or tuning
92	- [ ] At least one simple baseline is included for comparison
93	- [ ] Error analysis examines specific failure cases, not just aggregates
94	- [ ] Performance is broken down by relevant segments
95	- [ ] Bias detection covers protected attributes and business segments
96	- [ ] Production readiness includes latency, monitoring, and fallback
97
98	## Common mistakes
99
100	- Evaluating on accuracy alone. A model predicting "not fraud" for everything achieves 99.5% accuracy on a 0.5% fraud dataset. Use precision/recall for imbalanced problems.
101	- Leaking test data. Using the test set for feature selection or tuning inflates results and breaks the generalization guarantee.
102	- Ignoring the simple baseline. A logistic regression at 90% in 30 seconds often beats a deep learning model at 92% after two weeks of engineering.
103	- Reporting only aggregate metrics. 90% overall accuracy that drops to 50% on a critical segment is not a 90%-accurate model for those users.
104	- Skipping the cost analysis. False positives and false negatives rarely cost the same. The evaluation must reflect the asymmetry.
105	- No production monitoring plan. Models degrade as distributions shift. An evaluation without monitoring is incomplete.
106