data
ML Model Evaluation
Evaluate machine learning models rigorously — with train/test splits, cross-validation, business metric alignment, bias detection, and production readiness assessment.
machine-learningevaluationmetricsbiasmodel-selection
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill ml-model-evaluationml-model-evaluation/
churn-prediction.md
Markdown
| 1 | # Model Evaluation — Customer Churn Predictor, StreamVault |
| 2 | |
| 3 | ## Problem Definition |
| 4 | |
| 5 | **Business objective:** Identify subscribers likely to cancel within 30 days so the retention team can intervene with targeted offers. Current rule-based system (usage drop >50%) achieves 61% precision at 44% recall. |
| 6 | |
| 7 | **Cost of errors:** False negative (missed churn) costs ~$480 in lost annual revenue. False positive (unnecessary outreach) costs ~$8 per call. |
| 8 | |
| 9 | ## Success Metrics |
| 10 | |
| 11 | | | Metric | Target | Rationale | |
| 12 | |---|---|---|---| |
| 13 | | Primary | Recall | > 70% | Missed churn costs 60x more than false outreach | |
| 14 | | Secondary | Precision at 70% recall | > 55% | Keep volume manageable for 4-person retention team | |
| 15 | | Business | Net saves per month | > 120 | Currently ~55 saves/month with rule-based approach | |
| 16 | |
| 17 | ## Data & Splitting |
| 18 | |
| 19 | **Dataset:** 214,000 subscriber-months (Jan 2024 - Jun 2025), 34 features (usage, account, engagement), 8.3% positive class. Temporal split (required -- churn is seasonal): |
| 20 | |
| 21 | | Split | Period | Rows | Churn Rate | |
| 22 | |-------|--------|------|-----------| |
| 23 | | Train | Jan - Dec 2024 | 142,800 | 8.1% | |
| 24 | | Validation | Jan - Mar 2025 | 35,600 | 8.4% | |
| 25 | | Test | Apr - Jun 2025 | 35,600 | 8.6% | |
| 26 | |
| 27 | ## Model Comparison (Validation Set) |
| 28 | |
| 29 | | Model | AUC-ROC | Precision @70% Recall | Latency | Training | |
| 30 | |-------|---------|----------------------|---------|----------| |
| 31 | | Logistic Regression | 0.79 | 0.48 | 0.3ms | 12s | |
| 32 | | Random Forest | 0.83 | 0.56 | 1.2ms | 45s | |
| 33 | | XGBoost | 0.87 | 0.63 | 0.8ms | 3 min | |
| 34 | | LightGBM | 0.86 | 0.61 | 0.6ms | 2 min | |
| 35 | | Rule-based (current) | 0.68 | 0.61* | — | — | |
| 36 | |
| 37 | **Selected: XGBoost.** Highest AUC and precision at target recall. Test set results: AUC 0.85, precision 0.59 at 72% recall. Slight validation-to-test degradation (0.87 to 0.85) is within expected range for temporal shift. |
| 38 | |
| 39 | ## Error Analysis |
| 40 | |
| 41 | **False negatives:** Long-tenure users (>24mo) who churn abruptly after price increases -- no gradual usage decline to detect (23% of FNs). Annual plan users who disengage mid-cycle but cancel at renewal (18% of FNs). |
| 42 | |
| 43 | **False positives:** Seasonal users whose low-activity periods mimic churn signals (31% of FPs). Users who contacted support about billing but resolved the issue (14% of FPs). |
| 44 | |
| 45 | ## Segment Performance |
| 46 | |
| 47 | | Segment | Recall | Precision @70% | Gap | |
| 48 | |---------|--------|----------------|-----| |
| 49 | | Monthly plan | 74% | 0.62 | — | |
| 50 | | Annual plan | 58% | 0.41 | Under-served | |
| 51 | | Tenure < 6mo | 78% | 0.67 | — | |
| 52 | | Tenure > 24mo | 61% | 0.44 | Under-served | |
| 53 | | International users | 68% | 0.55 | 5-pt TPR gap vs. US | |
| 54 | |
| 55 | Annual-plan and long-tenure segments rely on usage-decline features that manifest differently for these groups. International TPR gap traced to timezone and connectivity differences in usage features. |
| 56 | |
| 57 | ## Production Readiness |
| 58 | |
| 59 | | Criterion | Status | Notes | |
| 60 | |-----------|--------|-------| |
| 61 | | Primary metric | PASS | 72% recall exceeds 70% target | |
| 62 | | Precision | PASS | 59% at target recall (target >55%) | |
| 63 | | Latency | PASS | Full 214K-user batch in <3 min | |
| 64 | | Bias assessment | CONDITIONAL | International TPR gap flagged; monitor, fix in v2 | |
| 65 | | Monitoring | DEFINED | Weekly AUC on rolling window; alert if <0.80 | |
| 66 | | Fallback | DEFINED | Revert to rules if AUC <0.78 for 2 consecutive weeks | |
| 67 | | A/B test | DEFINED | 50/50 split vs. rule-based, measure save rate over 8 weeks | |
| 68 | |
| 69 | **Recommendation:** Approve for A/B test. Schedule v2 feature work for annual-plan and long-tenure blind spots. |
| 70 |