data

ML Model Evaluation

Evaluate machine learning models rigorously — with train/test splits, cross-validation, business metric alignment, bias detection, and production readiness assessment.

machine-learningevaluationmetricsbiasmodel-selection

Works well with agents

Data Scientist Agent ML Engineer Agent Product Analyst Agent

Works well with skills

Experiment Design Metrics Framework

$ npx skills add The-AI-Directory-Company/(…) --skill ml-model-evaluation

ml-model-evaluation/

churn-prediction.md

Markdown

1	# Model Evaluation — Customer Churn Predictor, StreamVault
2
3	## Problem Definition
4
5	Business objective: Identify subscribers likely to cancel within 30 days so the retention team can intervene with targeted offers. Current rule-based system (usage drop >50%) achieves 61% precision at 44% recall.
6
7	Cost of errors: False negative (missed churn) costs ~$480 in lost annual revenue. False positive (unnecessary outreach) costs ~$8 per call.
8
9	## Success Metrics
10
11	\| \| Metric \| Target \| Rationale \|
12	\|---\|---\|---\|---\|
13	\| Primary \| Recall \| > 70% \| Missed churn costs 60x more than false outreach \|
14	\| Secondary \| Precision at 70% recall \| > 55% \| Keep volume manageable for 4-person retention team \|
15	\| Business \| Net saves per month \| > 120 \| Currently ~55 saves/month with rule-based approach \|
16
17	## Data & Splitting
18
19	Dataset: 214,000 subscriber-months (Jan 2024 - Jun 2025), 34 features (usage, account, engagement), 8.3% positive class. Temporal split (required -- churn is seasonal):
20
21	\| Split \| Period \| Rows \| Churn Rate \|
22	\|-------\|--------\|------\|-----------\|
23	\| Train \| Jan - Dec 2024 \| 142,800 \| 8.1% \|
24	\| Validation \| Jan - Mar 2025 \| 35,600 \| 8.4% \|
25	\| Test \| Apr - Jun 2025 \| 35,600 \| 8.6% \|
26
27	## Model Comparison (Validation Set)
28
29	\| Model \| AUC-ROC \| Precision @70% Recall \| Latency \| Training \|
30	\|-------\|---------\|----------------------\|---------\|----------\|
31	\| Logistic Regression \| 0.79 \| 0.48 \| 0.3ms \| 12s \|
32	\| Random Forest \| 0.83 \| 0.56 \| 1.2ms \| 45s \|
33	\| XGBoost \| 0.87 \| 0.63 \| 0.8ms \| 3 min \|
34	\| LightGBM \| 0.86 \| 0.61 \| 0.6ms \| 2 min \|
35	\| Rule-based (current) \| 0.68 \| 0.61* \| — \| — \|
36
37	Selected: XGBoost. Highest AUC and precision at target recall. Test set results: AUC 0.85, precision 0.59 at 72% recall. Slight validation-to-test degradation (0.87 to 0.85) is within expected range for temporal shift.
38
39	## Error Analysis
40
41	False negatives: Long-tenure users (>24mo) who churn abruptly after price increases -- no gradual usage decline to detect (23% of FNs). Annual plan users who disengage mid-cycle but cancel at renewal (18% of FNs).
42
43	False positives: Seasonal users whose low-activity periods mimic churn signals (31% of FPs). Users who contacted support about billing but resolved the issue (14% of FPs).
44
45	## Segment Performance
46
47	\| Segment \| Recall \| Precision @70% \| Gap \|
48	\|---------\|--------\|----------------\|-----\|
49	\| Monthly plan \| 74% \| 0.62 \| — \|
50	\| Annual plan \| 58% \| 0.41 \| Under-served \|
51	\| Tenure < 6mo \| 78% \| 0.67 \| — \|
52	\| Tenure > 24mo \| 61% \| 0.44 \| Under-served \|
53	\| International users \| 68% \| 0.55 \| 5-pt TPR gap vs. US \|
54
55	Annual-plan and long-tenure segments rely on usage-decline features that manifest differently for these groups. International TPR gap traced to timezone and connectivity differences in usage features.
56
57	## Production Readiness
58
59	\| Criterion \| Status \| Notes \|
60	\|-----------\|--------\|-------\|
61	\| Primary metric \| PASS \| 72% recall exceeds 70% target \|
62	\| Precision \| PASS \| 59% at target recall (target >55%) \|
63	\| Latency \| PASS \| Full 214K-user batch in <3 min \|
64	\| Bias assessment \| CONDITIONAL \| International TPR gap flagged; monitor, fix in v2 \|
65	\| Monitoring \| DEFINED \| Weekly AUC on rolling window; alert if <0.80 \|
66	\| Fallback \| DEFINED \| Revert to rules if AUC <0.78 for 2 consecutive weeks \|
67	\| A/B test \| DEFINED \| 50/50 split vs. rule-based, measure save rate over 8 weeks \|
68
69	Recommendation: Approve for A/B test. Schedule v2 feature work for annual-plan and long-tenure blind spots.
70